Introduction

Regarding the journalism sector, hundreds of news are published every day, but unfortunately not all of them include a proper geographic reference. All those news with a reference to a geographic location let the door open to a new way of distribution, and also let new ways of analysis based on this new parameter, the location. As an example, imagine that a developer wants to let a user with a smartphone or a tablet, receive automatically the news related to the place where the user is. Without a reference to where the news took place it is difficult to face the problem, however with a reference to a place the problem now seems affordable.

Despite being this project based in the concept of Data Science, I am not explicitly using algorithms from the world of Data Science, I follow, however, the main principles: look for databases that can help me face the problem, clean all data in order to be able to use it, and finally I try to show the results in an understandable way to reach results or conclusions.

Objectives

Obtain a set of news to analyze
Create a locations database
Identify cities in a text
Show the results in a map

Limitations to consider

This project is not about applying machine learning methods or natural language processing, it's an implementation based on simple characteristics that names of towns have. Keep in mind that the cities recognition as well as all databases are prepared to identify locations in catalan, because each language has its' own characteristics. In order to adapt this project to english language it will be necessary to change the database of locations (by simply loading the names of cities and towns in english) and changing the texts to analyze to english ones. Despite not being hardly tested, the location recognition algorithm should give a good approximation also in english.

First things first

For locations database I consider two sets of data, one that refers to catalan places (a more detailed database because Catalonia it's the focus place of the digital newspapers used), and a "rest of the world" database.

Catalan cities database

For catalan cities I used the database from the Institut Cartogràfic de Catalunya (ICC), the file called Nomencàtor. It is an Excel file, so first steps were to adapt all that information to what is really necessary and save it to a cleaned database. I preserve the following fields:

Name of the town
Kind of place (town, city,...)
Coordinates: in order to obtain the coordinates I had to convert from UTM to lat/lon first

All this information is saved into a MongoDB database in order to be able to make queries to it.

World cities

Considering the rest of the world, I considered the database from geonames.org, concretely the file cities15000.txt, which contains all cities with a population of over 15.000 inhabitants. By consulting the file alternatenames.txt I got all the names of different towns in Catalan, and after that checking the id of each catalan town name I parsed the file cities15000.txt to obtain more information of each city, for example the coordinates. For each city I preserve the following information:

Name of the town
Coordinates
Population

Obtaining the news database

In order to keep a database of news I created an automated process to fetch news from RSS of each digital newspaper. All that data was saved into a database with the following information for each news:

Title
Subtitle
Text
Publication date

The newspapers used were: ara.cat, regio7.cat, and vilaweb.cat

Identifying cities in a text

Creating a "key words" dictionary

For each name of city I got the first word that begin with an uppercase letter. i.e.: From the name "Sant Fruitós de Bages" I got the word "Sant". For each key word I checked the maximum and minimum ammount of words that can contain a name of town with that key word. i.e.: For the previous town I consider an ammount of 4 words.

A simple algorithm to identify cities

First of all I identify all words beginning with an upper case letter.
I ignore the words that are in a black list (words that can produce false positives of cities, for example the city of "Un" in India. It is a small city that rarely appears in catalan newspapers, but this name refers also to an article in catalan, so it is a word that can produce confusion in results)
I check if the word is a people name, or the word was preceded by a name of a person (in order to prevent surnames)
For each city name that was identified at this point, I get the X following words, where this X refers to the maximum longitude of a name of town with this key word.
For each key word, I check first the longest option and reduce the ammount of words in order to identify a city. For example, given the key word "Abella" that has a maximum longitude of 3, and a minimum of 1. First I get the keyword and the next 2 words (will have a longitude of 3 words) "Abella era una", and I check whether if this city is in the list of cities, I see that it is not in the list. I continue reducing by 1 the length of words, so now I have "Abella era", which is not a name of town. I continue until I reach the minimum length registered in the dictionary for this word, that is 1. I check "Abella" which has the minimum length, and I see that is a city in the list, so now I have a town and I save it as a found city.
Finally, for each found city I query it in the database of cities and I save the coordinates and the information given by the database in a CSV file.
All that files will be uploaded to CartoDB in order to show the results in a map.

Extra step: Clustering the results using the algorithm DBScan

In order to show the results in a more clean way, I implemented an algorithm over the previous results. I used the DBScan algorithm, with consists in the following steps:

Look for clusters in the neighbourhood of each point (given a distance parameter Epsilon).
If inside the neighbourhood there are more cities than the ones specified by a parameter named minPoints, a cluster will be created.
Iteratively this algorithm will put together the reachable points from the center of each cluster.
When no more points can be joined to a cluster, the algorithm ends.

Results

Over a sample database of 200 news, the effectivity was over the 93%. So it is a non-perfect algorithm, but can give results with a high degree of reliability.

Cities with more appearances

1.  Barcelona   826
2.  Madrid    298
3.  Manresa    148
4.  Lleida    116
5.  Girona    113
6.  Tarragona    82
7.  Berga    76
8.  Badalona    73
9.  Berlín    65 
10. Sabadell    61
11. París    58 
12. Nova York    52
13. Terrassa    48
14. Sevilla    43
15. Londres    42
16. Reus    42
17. Munic   38
18. Mataró 37
19. Vic     33
20. Roma    29

Final maps

General map with all locations found.

Map with two layers, a heat map, and an "evolution over time" map.

Map with three layers. One to see the final results with automatic clustering that CartoDB applies, another to see the distribution regarding the newspaper used, and the last one that shows the places found over time.

Map with the results of clustering using DBScan algorithm. The different layers refers to different distances Epsilon.

Text based tools for the visualization and geolocation of news

This website refers to my Final Project of the degree in Computer Science at the Universitat de Barcelona.