Gazetteers are geospatial dictionaries of geographic names containing triples of place names, geographic footprints, and feature types for named geographic places. As an important element in Geospatial Information Retrieval (GIR), these precious resources should be enriched according to new applications. . Identification and adding new place names to the gazetteer, and keeping it up to date are important issues in the gazetteer enrichment. The main challenge in this era is that in most gazetteers only a top-down approach is considered. Consequently, most local place names are ignored in such gazetteers. In addition, updating gazetteers is a time-consuming and expensive process. Since the emergence of Web 2.0, using volunteered Geographic Information (VGI) and social media in harvesting place names have been attracted the attention of many researchers due to containing local place names and recently created ones. In a similar condition, online property ads published by people contain such place names. This article presents a data-driven method for identifying urban place names including neighborhoods and main streets using online real estate advertisements.
Materials and Methods
The online real estate ads of four metropolises including Tehran, Mashhad, Isfahan, and Shiraz mined from the Divar website. After n-gram extraction and applying required pre-processes, the n-grams got labeled. To remove outlier points from an n-gram set and consider the scenario that several places can have the same name through a city, the point set of the n-gram get clustered. Based on a set of spatial statistics, the random forest models on housing data of each city trained and then tested on the ads data of other cities.
Discussion and Results
The results show that either in detecting the main street or neighborhood, the model trained on ads data from one city has a successful prediction on the other ones. For example, the models trained based on the data of Tehran and tested on the data of Mashhad achieved 61% and 74% respectively in identifying street and neighbourhood. However, for some reasons such as imbalancement of datasets, data labeling challenges, and in some cases, identifying non-spatial n-grams due to clustering, precision has been decreased. Also, because of differences in urban patterns and place naming patterns between the cities, the recall has been slightly decreased.
A place can be referenced in two different ways: 1- By calling its name and 2- By coordinate data. Gazetteers are considered a bridge between that two types of georeferencing. According to the importance of these resources in geospatial applications, the enrichment of them is a necessity. For containing local place names, online property listings can be considered as a valuable resource for harvesting toponyms and enriching gazetteers. Regarding to that most users in publishing online property, ads consider a neighborhood or main street name which is well-known for the readers, these place names usually are written without any clue for identifying a location in a text processing manner. The behavior with respect to a set of spatial statistics can be considered as a spatial signature to recognize an n-gram as a neighborhood or street place name.