Introduction
Regarding the journalism sector, hundreds of news are published every day, but unfortunately not all of them include a proper geographic reference. All those news with a reference to a geographic location let the door open to a new way of distribution, and also let new ways of analysis based on this new parameter, the location. As an example, imagine that a developer wants to let a user with a smartphone or a tablet, receive automatically the news related to the place where the user is. Without a reference to where the news took place it is difficult to face the problem, however with a reference to a place the problem now seems affordable.
Despite being this project based in the concept of Data Science, I am not explicitly using algorithms from the world of Data Science, I follow, however, the main principles: look for databases that can help me face the problem, clean all data in order to be able to use it, and finally I try to show the results in an understandable way to reach results or conclusions.
Objectives
- Obtain a set of news to analyze
- Create a locations database
- Identify cities in a text
- Show the results in a map
Limitations to consider
This project is not about applying machine learning methods or natural language processing, it's an implementation based on simple characteristics that names of towns have. Keep in mind that the cities recognition as well as all databases are prepared to identify locations in catalan, because each language has its' own characteristics. In order to adapt this project to english language it will be necessary to change the database of locations (by simply loading the names of cities and towns in english) and changing the texts to analyze to english ones. Despite not being hardly tested, the location recognition algorithm should give a good approximation also in english.
First things first
For locations database I consider two sets of data, one that refers to catalan places (a more detailed database because Catalonia it's the focus place of the digital newspapers used), and a "rest of the world" database.
Catalan cities database
For catalan cities I used the database from the Institut Cartogràfic de Catalunya (ICC), the file called Nomencàtor. It is an Excel file, so first steps were to adapt all that information to what is really necessary and save it to a cleaned database. I preserve the following fields:
- Name of the town
- Kind of place (town, city,...)
- Coordinates: in order to obtain the coordinates I had to convert from UTM to lat/lon first
All this information is saved into a MongoDB database in order to be able to make queries to it.
World cities
Considering the rest of the world, I considered the database from geonames.org, concretely the file cities15000.txt, which contains all cities with a population of over 15.000 inhabitants. By consulting the file alternatenames.txt I got all the names of different towns in Catalan, and after that checking the id of each catalan town name I parsed the file cities15000.txt to obtain more information of each city, for example the coordinates. For each city I preserve the following information:
- Name of the town
- Coordinates
- Population
Obtaining the news database
In order to keep a database of news I created an automated process to fetch news from RSS of each digital newspaper. All that data was saved into a database with the following information for each news:
- Title
- Subtitle
- Text
- Publication date
The newspapers used were: ara.cat, regio7.cat, and vilaweb.cat
Identifying cities in a text
Creating a "key words" dictionary
For each name of city I got the first word that begin with an uppercase letter. i.e.: From the name "Sant Fruitós de Bages" I got the word "Sant". For each key word I checked the maximum and minimum ammount of words that can contain a name of town with that key word. i.e.: For the previous town I consider an ammount of 4 words.
A simple algorithm to identify cities
- First of all I identify all words beginning with an upper case letter.
- I ignore the words that are in a black list (words that can produce false positives of cities, for example the city of "Un" in India. It is a small city that rarely appears in catalan newspapers, but this name refers also to an article in catalan, so it is a word that can produce confusion in results)
- I check if the word is a people name, or the word was preceded by a name of a person (in order to prevent surnames)
- For each city name that was identified at this point, I get the X following words, where this X refers to the maximum longitude of a name of town with this key word.
- For each key word, I check first the longest option and reduce the ammount of words in order to identify a city. For example, given the key word "Abella" that has a maximum longitude of 3, and a minimum of 1. First I get the keyword and the next 2 words (will have a longitude of 3 words) "Abella era una", and I check whether if this city is in the list of cities, I see that it is not in the list. I continue reducing by 1 the length of words, so now I have "Abella era", which is not a name of town. I continue until I reach the minimum length registered in the dictionary for this word, that is 1. I check "Abella" which has the minimum length, and I see that is a city in the list, so now I have a town and I save it as a found city.
- Finally, for each found city I query it in the database of cities and I save the coordinates and the information given by the database in a CSV file.
- All that files will be uploaded to CartoDB in order to show the results in a map.
Extra step: Clustering the results using the algorithm DBScan
In order to show the results in a more clean way, I implemented an algorithm over the previous results. I used the DBScan algorithm, with consists in the following steps:
- Look for clusters in the neighbourhood of each point (given a distance parameter Epsilon).
- If inside the neighbourhood there are more cities than the ones specified by a parameter named minPoints, a cluster will be created.
- Iteratively this algorithm will put together the reachable points from the center of each cluster.
- When no more points can be joined to a cluster, the algorithm ends.
Results
Over a sample database of 200 news, the effectivity was over the 93%. So it is a non-perfect algorithm, but can give results with a high degree of reliability.
Cities with more appearances
1. Barcelona 826
2. Madrid 298
3. Manresa 148
4. Lleida 116
5. Girona 113
6. Tarragona 82
7. Berga 76
8. Badalona 73
9. Berlín 65
10. Sabadell 61
11. París 58
12. Nova York 52
13. Terrassa 48
14. Sevilla 43
15. Londres 42
16. Reus 42
17. Munic 38
18. Mataró 37
19. Vic 33
20. Roma 29
Final maps
General map with all locations found.
Map with two layers, a heat map, and an "evolution over time" map.
Map with three layers. One to see the final results with automatic clustering that CartoDB applies, another to see the distribution regarding the newspaper used, and the last one that shows the places found over time.
Map with the results of clustering using DBScan algorithm. The different layers refers to different distances Epsilon.