25 August 2016

City Similarity

A long time ago, I read an analysis that stated a person’s zip code is the most accurate predictor of obesity.
That made me think about what else can be predicted just by the location? Can we predict what kind of brick and mortar businesses are favored in a particular area?
When I started with Insight, that was my initial idea for the project. Of course, I had to scale it down, significantly because the whole Insight project had to be done in two weeks maximum.
But I did not give up on the idea.
So I used Yelp data and decided to see can I establish a difference between cities based on the most common business in them.
In essence the first step is counting business categories. And since Yelp data is not really clean, most of the work was to clean up data.
City’s names and business categories themselves vary across the data set, reflecting different ideas of how people define their business and even think how should they write down the name of their own city.
Anyway, using methods of natural language processing, I performed data munging (you can see python notebook on GitHub).
Then I formed a corpus of the words matching the most common categories for each city.
From my previous analysis, I established that restaurants are quite typical in most of the towns. Not surprising, considering that most of us prefer to buy stuff online. Since at this time Google Hangout, Skype and the rest are still not good enough to replace face to face meeting, so we have restaurants.
Yeah, so this means, I need to do something to avoid this. At least pick which restaurant is most typical in a city. And to do so, I need to use phrases of two or three words, instead of a single word. So I determined the most common phrases of two and three words per city.
And then I attempted to do clustering using k-means. K-means is the most simple method, so, of course, one should try it at least.
In my particular example, this was a disastrous choice. You see, Yelp data I have contains only 362 cities. And corpus I form from the categories ends up having vectors of many dimensions. Too many dimensions.
I limited myself to 3 top words per city, and two phrases per city. That gave me vector dimensions from 160 to 66. Still, too much of the dimensions for my number of cities.
Therefore I decided to play with the new package in Anaconda, the one called Orange and do the hierarchal clustering instead.
This approach worked.


Here you can see a heat map of the similarities between cities using single word categories. Blue color presents the most similar cities(that’s why we have this diagonal line, cities are very akin to each other. ;-) ) And yellow color represents the most different cities. Colors pass through red and green between those two, making rather lime colored heat map. So even with the restaurants as the most common category, when we take three categories in consideration there is not much similarity between cities.

graphDoubleHeatmap of two-word phrases

Situation drastically improves when a two-word phrase is analyzed. This heat map shows way more similarities between cities.

graphTripleHeatmap of three-word phrases

But here is not much change when I analyze the three-word phrase. Dissimilar green lines are bit finer across the middle of the heat map, but that’s all.
Anyway, this grouping of the cities into similar clusters will help with the next step. Developing a machine learning algorithm that uses US census data.
Yeah, I’m greedy, I would like to expand this to all cities, not to just this puny sample of 362 cities I found in Yelp data!

No comments:

Post a Comment