Practical Naive Bayes — Classification of Amazon Reviews

If you search around the internet looking for applying Naive Bayes classification on text, you’ll find a ton of articles that talk about the intuition behind the algorithm, maybe some slides from a lecture about the math and some notation behind it, and a bunch of articles I’m not going to link here that pretty much just paste some code and call it an explanation.

So I’m going to try to do a little more here, by hopefully writing and explaining enough, is let you yourself write a working Naive Bayes classifier.

There are three sections here. First is setup, and what format I’m expecting your text to be in for the classification. Second, I’ll talk about how to run naive Bayes on your own, using slow Python data structures. Finally, we’ll use Python’s NLTK and it’s classifier so you can see how to use that, since, let’s be honest, it’s gonna be quicker. Note that you wouldn’t want to use either of these in production, so look for a follow up post about how you might go about doing that.

As always, twitter, and check out the full code on github.

Setup

Data from this is going to be from this UCSD Amazon review data set. I swear one of the biggest issues with running these algorithms on your own is finding a data set big and varied enough to get interesting results. Otherwise you’ll spend most of your time scraping and cleaning data that by the time you get to the ML part of the project, you’re sufficiently annoyed. So big thanks that this data already exists.

You’ll notice that this set has millions of reviews for products across 24 different classes. In order to keep the complexity down here (this is a tutorial post after all), I’m sticking with two classes, and ones that are somewhat far enough different from each other to show that classification works, we’ll be classifying baby reviews against tools and home improvement reviews.

Preprocessing

First thing I want to do now, after unpacking the .gz file, is to get a train and test set that’s smaller than the 160,792 and 134,476 of baby and tool reviews respectively. For purposes here, I’m going to use 1000 of each, with 800 used for training, and 200 used for testing. The algorithms are able to support any number of training and test reviews, but for demonstration purposes, we’re making that number lower.

Check the github repo if you want to see the code, but I wrote a script that just takes the full file, picks 1000 random numbers, segments 800 into the training set, and 200 into the test set, and saves them to files with the names “train_CLASSNAME.json” and “test_CLASSNAME.json” where classname is either “baby” or “tool”.

Also, the files from that dataset are really nice, in that they’re already python objects. So to get them into a script, all you have to do is run “eval” on each line of the file if you want the dict object.

Features

There really wasn’t a good place to talk about this, so I’ll mention it here before getting into either of the self, and nltk running of the algorithm. The features we’re going to use are simply the lowercased version of all the words in the review. This means, in order to get a list of these words from the block of text, we remove punctuation, lowercase every word, split on spaces, and then remove words that are in the NLTK corpus of stopwords (basically boring words that don’t have any information about class).

from nltk.corpus import stopwords
STOP_WORDS = set(stopwords.words('english'))
STOP_WORDS.add('')
def clean_review(review):
  exclude = set(string.punctuation)
  review = ''.join(ch for ch in review if ch not in exclude)
  split_sentence = review.lower().split(" ")
  clean = [word for word in split_sentence if word not in STOP_WORDS]
  return clean

Realize here that there are tons of different ways to do this, and ways to get more sophisticated that hopefully can get you better results! Things like stemming, which takes words down to their root word (wikipedia gives the example of “stems”, “stemmer”, “stemming”, “stemmed” as based on “stem”). You might want to include n-grams, for an n larger than 1 in our case as well.

Basically, there’s tons of processing on the text that you could do here. But since this I’m just talking about how Naive Bayes works, I’m sticking with simplicity. Maybe in the future I can get fancy and see how well I can do in classifying these reviews.

Ok, on to the actual algorithm.

Continue reading

Classifying Country Music Songs is an Art — Getting Training Data

If you’ve been following along recently, I’ve been writing about my theory of country music, and how unlike most other genres out there, country music song topics are, let’s just say, much more centralized. And so in my continuing effort to automatically classify the country songs topic, I need to take all the songs lyrics I downloaded, and manually classify them so I have some training data.

This is actually the third post on this topic I’ve written. In the first post where I showed how to get song lyrics using Genius’s API and scraping, and then the second post, where I gathered up all the lyrics from country artists, removed the duplicates, and realized that Lee Brice talks about beer and trucks much more than he does about love. The stats I ran at the end of the second entry are fine and all, but really what I have at the moment is some 5 thousand songs that are uncategorized, which isn’t going to allow me to do any more sophisticated classification than simple word analysis.

What this means is I’m going to need some help classifying those 5000 songs. To do this, I wrote a rails app deployed on Heroku free mode that will allow anyone to sign up and help with this task. Obviously I’m not expecting people to get through all 5000 themselves (other than me of course), but hopefully if I can get enough people to do more than a few songs, I can get a good representation from which I can get interesting results.

Rest of the article is as follows. First, I’ll have a section where I talk about my theory of country music song topics, which I’ve been annoying my friends by talking about whenever we talk about country music. Then in the next / last section, I’ll talk about how I’m looking to get these songs classified, and what the interface is like and what’s going on behind the scenes.

As an aside, I am somewhat of a fan of country music. I usually just say it’s pop music with a slide, and by definition, pop music is catchy. But still, those country song lyrics can get quite ridiculous () which is definitely fun to laugh at.

Also, follow me along on twitter for more updates on this, and other topics.

Country Music Song Topics

Topic 1: Love

Love. The classic song topic — universally relatable, unbounded in subtopics, and somewhat of a default topic for any story, song or otherwise.Now that I think about it, I’m not sure there’s any song genre out there that doesn’t have love as a main song topic. So it makes sense that love is quite prevalent in many of the country songs.

Whether happy songs, about how Brett Young can’t go to sleep unless his girl is next to him at night,

or sad songs, about how the guy in Billy Currington’s song, “It Don’t Hurt Like It Used To”, was broken up with, but got over it eventually. Or somewhat, cause it still hurts.

Now that I think about it, I’m not sure there’s any song genre out there that doesn’t have love as a main song topic.

Topic 2: Small Town Life

Nothing says small town life like boots, dirt roads, dumpy bars with a cover band, railroad tracks, barns, white churches, and crop fields. No, I’m not making this up, those are just some of the things the band LoCash sings about in their recent song titled “I Love this Life”

If that wasn’t enough, how about picket fences, blue sky and green grass, old Ford trucks, back porches, homemade wine, tire swings, fireworks, and dead deer heads waiting to be hung on the wall. Yup, that’s just what Drake White is singing about in his song “Livin’ The Dream”. I admit, if I hear this when scrolling through stations on the radio in my car that doesn’t have an aux port, I’ll turn it up.

Continue reading

Talkin’ ‘Bout Trucks, Beer, and Love in Country Songs — Analyzing Genius Lyrics

Trucks, beer, and love, all things that make country music go round. I’ve said before that country music is just pop music with a slide, and then lyrics about slightly different topics than what you’ll hear in hip hop or “normal” pop music on the radio.

In my continuing quest to validate my theory that all country songs can fit into one of four different topics, in this post, I go through lyrics to see which artists talk about trucks, beer, and love the most. In my first post on this topic, I talked about how to get song lyrics from genius and print them out on the command line.

The goal here, and what I’m going to walk you through, is how I stored stored info and lyrics for all the songs for the country artists, how I made sure that all the lyrics were unique, and then ran some stats on the songs. Another note before we go is that a lot of data work is just janitorial. The actual code for getting “interesting” results is fairly simple. The key it to enjoy doing the janitor-style coding and then you’ll be good.

If you’re interested in which country music people talk most about trucks, beer, alcohol, or small towns, skip to the end where I list out some stats. For the rest, here’s some code.

https://www.pinterest.com/pin/59180182578213991/

I wonder how they feel about beer trucks. I’m guessing they’d all be fans of them.

Step 1 — Save the Lyrics!

When doing anything with web scraping, the one thing to always, always keep in mind here, is that you want to avoid hitting the server for as little as possible. With that in mind, we’re going to do here is assume the inputs are names of artists. For each of those artists, find all of their songs, and then for each of those songs, grab the lyrics in the way that I did in the first post, and then save them locally along with some meta information the API provides.

Now when I post the following code, don’t imagine that I knew what I wanted . Everything in here was created iteratively. Here’s a list of all the features of this piece of code does that were created iteratively.

Directory structure — Within the folder that contains the main .py file, there’s a folder named artists. And within that folder, when the code runs, a folder with the artist’s name is created (if not already). And within that folder, there are two more folders, info and lyrics. When we run the code, I put the lyrics in /artists/artist_name/lyrics/Song Title.txt and the info from the API, containing information about the song, like annotations, title, and song API id so we can grab it again if need be, in the file /artists/artist_name/info/Song Title.txt. The key, again, being saving all the info given to avoid unnecessary requests.

Redundancy Checking — Along with making sure to save all the info given, if we run an artist for the second time, we don’t want to get lyrics that we already have. So once we have all the songs for that artist, I run a check to see if we have a file with the name of the song already, and that the file isn’t empty. If the file is there, we continue to the next song.

Lyric Error Checking — Ahh unicode. While great for allowing multitudes of different characters rather than the standard English alphabet along with a few specialty characters, they’re not ideal when I’m trying to deal with simple song lyrics. And when saving the lyrics, I encountered more than a few random, unnecessary characters that Python threw errors for encoding problems. In a semi-janky rule-based solution (which isn’t great to use, see below), when I saw these errors being thrown, I would specifically replace them with the correct “normal” character. I assume there’s some library out there that would take care of all the encoding issues, but this worked for me. Also, on Genius’s end, it would be sweet if they, you know, checked for abnormal characters when lyrics were uploaded and didn’t have them in the first place. Also would be cool if they included the lyrics in the API.

def clean_lyrics(lyrics):
  lyrics = lyrics.replace(u"\u2019", "'") #right quotation mark
  lyrics = lyrics.replace(u"\u2018", "'") #left quotation mark
  lyrics = lyrics.replace(u"\u02bc", "'") #a with dots on top
  lyrics = lyrics.replace(u"\xe9", "e") #e with an accent
  lyrics = lyrics.replace(u"\xe8", "e") #e with an backwards accent
  lyrics = lyrics.replace(u"\xe0", "a") #a with an accent
  lyrics = lyrics.replace(u"\u2026", "...") #ellipsis apparently
  lyrics = lyrics.replace(u"\u2012", "-") #hyphen or dash
  lyrics = lyrics.replace(u"\u2013", "-") #other type of hyphen or dash
  lyrics = lyrics.replace(u"\u2014", "-") #other type of hyphen or dash
  lyrics = lyrics.replace(u"\u201c", '"') #left double quote
  lyrics = lyrics.replace(u"\u201d", '"') #right double quote
  lyrics = lyrics.replace(u"\u200b", ' ') #zero width space ?
  lyrics = lyrics.replace(u"\x92", "'") #different quote
  lyrics = lyrics.replace(u"\x91", "'") #still different quote
  lyrics = lyrics.replace(u"\xf1", "n") #n with tilde!
  lyrics = lyrics.replace(u"\xed", "i") #i with accent
  lyrics = lyrics.replace(u"\xe1", "a") #a with accent
  lyrics = lyrics.replace(u"\xea", "e") #e with circumflex
  lyrics = lyrics.replace(u"\xf3", "o") #o with accent
  lyrics = lyrics.replace(u"\xb4", "") #just an accent, so remove
  lyrics = lyrics.replace(u"\xeb", "e") #e with dots on top
  lyrics = lyrics.replace(u"\xe4", "a") #a with dots on top
  lyrics = lyrics.replace(u"\xe7", "c") #c with squigly bottom
  return lyrics

Check out the most of the main function below. If you’re looking for the actual full file, check out this gist. It’s easier to post that on Github than formatting the entire thing here.

Continue reading

Getting Song Lyrics from Genius’s API + Scraping

Genius is a great resource. At a high level, Genius has song lyrics and allows users to comment on what the artist meant. Starting as Rap Genius, where users annotated rap lyrics, the site rebranded as “Genius”, allowing all songs to be talked about. According to their website, “Genius is the world’s biggest collection of song lyrics and crowdsourced musical knowledge.” Recently even, they’ve moved to allowing annotations of pretty much anything posted online.

I’ve have used it a bunch recently while trying to figure out what the hell Frank Ocean was trying to say in his new album Blond. Users of the site explained tons of Frank’s references that went whoosh right over my head when I listened the first time and all the times after.

And recently, when I had some ideas for mini projects using song lyrics, I was pretty happy to find that Genius had a API for getting the data on their site. Whenever I’m trying to get data elsewhere, I’m much happier with an API, or at least being able to get it from JSON responses rather than parsing HTML. It’s just cleaner to look at, and with an API, I can expect good documentation that isn’t going to change with css updates.

Their API docs looked pretty good at first glance, with endpoints for artists, songs, albums, and annotations. One things I did notice was that they don’t have an artist entry point. A lot of what I want to do is artist based, meaning I need to know the artist id for everyone. And in order for me to get that, I have to search the artist, grab a song from the results, hit the song endpoint for that song’s information, and then grab the artist id from there. It’d be nice if you could specify what I’m searching for when I hit the search endpoint so I don’t have to go through that whole charade just to get the artist. But that’s a blog post for another time. Overall, they give out tons of information pretty easily.

But why, Genius, why don’t you have an endpoint for getting the raw lyrics of a song?! You have a songs endpoint on the API, and you give me a ton of information from there — the song title, album name, featured artists on the song, number of annotations, images associated with the song, album information, page views for that song, and a whole host of more data. But the one thing you don’t give me, and the one thing that people using the API probably want the most, is plain text lyrics!

Pre-Genius, I was stuck with these jankily laid out sites with super old looking css that would have the lyrics, but not necessarily correct, and definitely no annotations. Those sites are probably easily scrapeable considering their simplicity, but searching for the right song would be more difficult, and the lyrics might not be correct. Genius solved this all now for a web user, but dammit, I want the lyrics in the API!

Now you might be able to get the entire set of lyrics by using the annotations endpoint, which had information about all the annotations for a certain song or article, but that would require a song to have annotations for every word in the song. For someone like Chance the Rapper who like Frank Ocean (and most other hip hop artists uses tons of references in his lyrics, having complete annotations might not be an issue. But of Jake Owen, who’s new single “American Country Love Song” has probably the most self explanatory lyrics ever (sorry for throwing you under the bus here, Jake. Still a fan), there’s no need to annotate anything, and getting the lyrics in this manner wouldn’t work.

The lyrics are there on the internet however, and I can get at them by hitting the song endpoint, and using the web url that it returns. The rest of this article will show you how to do that using Python and it’s requests and BeautifulSoup libraries. But I don’t have to have to resort to HTML parsing, and I don’t think Genius wants users doing that either.

I’m left here wondering why they don’t want to give up the lyrics so easily, and I really don’t have much to go on. Genius’s goal seems to be wanting to annotate the internet. It has already moved on from their initial site of Rap Genius, into all music, and now into speech transcripts, as well as pretty much any other content on the web. Their value comes from those annotations themselves, not the information they’re annotating. They give away the annotations freely, but not the information (lyrics) in this case.

Enough speculation on why Genius doesn’t spit out the lyrics to a song when you get the other information. And as I’m writing this, I realize I easily could have overlooked something in their API and Genius might return the full lyrics, but I overlooked it. In that case, half of this article will be pointless and I’ll hold my head in shame from yelling at them like I did.

For purposes here, I’m going to show you how to get the song lyrics from Genius if you have the song title, and also talk through my process of getting there.

Note of clarification, just to make sure I’m not violating their terms of service, this post is for informational purposes only. Hopefully this can help programmers out there learn. Don’t do something bad with this knowledge. Code time!

First thing you’re going to need is an account set up with Genius. You can sign up from the upper right hand corner of the genius.com homepage. After that, navigate to the api docs where you’ll then see your Bearer token that you’ll need for all API requests.

I’m using the requests library here, and once you have the bearer token, here’s what all the API requests to Genius should look like if, for example, you’re searching for a song title.

import requests

#TOKEN below should be the string that the API docs tells you
#Clearly I'm not giving mine out here on the internet. That'd be dumb
base_url = "http://api.genius.com"
#Key line below here when, this is how to authorize your request when
#using the API
headers = {'Authorization': 'Bearer TOKEN'}
search_url = base_url + "/search"
song_title = "In the Midst of It All"
data = {'q': song_title}
response = requests.get(search_url, data=data, headers=headers)

The response, according to the Genius API, would be a list of songs that match that string passed in, with the first result being the Tom Misch song that I was going for. By changing around the url that is passed into the request method, you can access all the information that Genius supplies from the API (pretty much everything but the lyrics).

Continue reading

Predicting PGA Tour Scoring Average from Statistics Using Linear Regression

First off, I admit, that’s probably the most boring title for a blog post ever. It gets a negative value on the clickbait scale that is generally unseen in the modern, “every click equals dollars” era that we live in. On the other hand, it tells you exactly what this article is about — predicting scoring average using stats.

In this article, I’ll go through getting the data from the database, cleaning that data for use, and then running a linear regression in order to generate coefficients for each of the stats to generate scoring average predictions. Oh, and some analysis and commentary at the end!

Shameless shoutout to my other blog, Golf on the Mind. Check it out and subscribe to the newsletter / twitter / instagram if you’re into golf at all. Or ignore, and keep reading for some code!

Here's a pic of a golf course to get you in the mood.

Here’s a pic of a golf course to get you in the mood.

Getting the data

Last time if you remember, I spent all this effort taking the csv stat files, and putting the information into a database. Start there if you haven’t read that post yet. It’ll show how I grabbed the stats and formatted them.

Now that you’re back in the present we need to create a query that gets the stats for the players for a specific year. An example row in a CSV file of the data would be something like:

player_id, player_name, stat_1_value, stat_2_value, … , stat_n_value

for stats 1 to n where n (the number of stats), and the which stats themselves (driving distance, greens in regulation, etc.) vary depending on inputs.

Now let me say, I am not an expert in writing sql queries. And since people on the internet loooove to dole out hate in comments sections, I’m just going to say that there’s probably a better way of writing this query. Feel free to let me know and I can throw an edit in here, but this query works just fine.

select players.id,
  players.name,
  max(case when stat_lines.stat_id=330 then stat_lines.raw else null end) as putting_average,
  max(case when stat_lines.stat_id=157 then stat_lines.raw else null end) as driving_distance,
  max(case when stat_lines.stat_id=250 then stat_lines.raw else null end) as gir,
  max(case when stat_lines.stat_id=156 then stat_lines.raw else null end) as driving_accuracy,
  max(case when stat_lines.stat_id=382 then stat_lines.raw else null end) as scoring_average
from players
  join stat_lines on stat_lines.player_id = players.id
  join stats on stat_lines.stat_id=stats.id
where stat_lines.year=2012 and (stats.id=157 or stats.id=330 or stats.id=382 or stats.id=250 or stats.id=156) and stat_lines.raw is not null
group by players.name,players.id;

High level overview time! We’re selecting player id, and player name, along with their stats for putting average, driving distance, greens in regulation, driving accuracy and scoring average for the year 2012. In order to get the right stats, we need to know the stat id for the stats.

One more thing. This query is funky, and I probably could have designed the schema differently to make this prettier. For example, I could have just gone with one table, stat_lines, with fields for player_name and stat_name (along with all the current fields) and then the sql would be very simple. But there are other applications to keep in mind. What if you wanted to display all stats by a player? Or all of a players stats for a certain year? With the way I have the schema set up, those queries are simple and logical. For this specific case, I’ll deal with the complexity.

Loading the Data

That query above is great, but it’s not going to cut it if I have to specify what the year, and the stat ids in that string every time I run the script. Gotta be dynamic here.

Continue reading

Python, Postgres, SQLAlchemy, and PGA Tour Stats

A little ago, I wrote an article about scraping a bunch of PGA Tour stats. The end result of that was writing those stats out into CSV files. While this was suitable for that task of gathering the stats, let’s face it, you’re probably going to want to put those into some database to allow for easier querying, or possibly integrate it into to web app in the future. There are a bunch of different reasons for wanting this, so I’m going to go through the process I took to put all the data in the CSV files into the database.

Adding players to the database

First step is to fire up postgres! I’m not going to cover starting postgres since there’s so much good content about it, for example, this super good tutorial here by Digital Ocean. I created at database called ‘pgatour’, created a user named ‘pgatour_user’ with password ‘pgatour_user_password’, logged in, and created the first table, Player.

pgatour=# create table players (
  id serial PRIMARY KEY,
  name varchar (255) NOT NULL
);

Ok, now, as a test, I’m going to add myself into the database from the psql command line.

pgatour=# insert into players (name) values ('Jack Schultz')

Note that since id is serial, we don’t need to insert that value, just the name. Alas, I am not on the PGA Tour, so I’m going to need to delete myself.

pgatour=# select * from players;
id | name
----+--------------
1 | Jack Schultz
(1 row)

pgatour=# delete from players where name = 'Jack Schultz';
DELETE 1
pgatour=# select * from players;
id | name
----+------
(0 rows)

Looking good. Now onto the Python side.

Python SQL Alchemy Interface

Now that the Players table in the database is set up, we’re going to want to be able to modify the contents of it in Python.

Continue reading

The Special Relationship Between Noodles and Qdoba

I’ve had a theory that for every Noodles, there’s a Qdoba that’s right next door. It might be some sort of selection bias however, since I can think of a couple locations where they’re directly next to each other. To me, Noodles and Qdoba have a special relationship, at least compared to other restaurants. I figured now was about the time I should test this, and I can use Chipotle to test.

The question is: Which restaurant is more special to Noodles, Qdoba or Chipotle?

Finding the Noodles, Qdoba, and Chipotle locations

Initially, I went to Noodle’s website and their locations page and was planning on getting the data from there. But what I realized was that it just used the Google Maps API to get it’s data, so I might as well just go right to the Google source and use their api correctly.

Google’s docs are pretty good in this case, and after grabbing an API key, I started in on finding the Dobas. For prototyping, I just started with the latitude and longitude of Milwaukee, my home town, and a place where I know there multiple Qdobas / Noodles pairs.

import requests
url = 'https://maps.googleapis.com/maps/api/place/nearbysearch/json'
location_milwaukee = '43.0389,-87.9065' #Milwaukee
params = {}
params['key'] = GOOGLE_PLACES_API
params['type'] = 'restaurant'
params['radius'] = 50000 #in meters, and going be an issue
params['keyword'] = 'Qdoba'
params['location'] = location
r = requests.get(url, params=params)
results = r.json()['results']
print results

Put your Google Places API key in the ‘key’ param, run those lines of code (assuming you pip installed requests) and you’ll see 20 Qdoba locations along with some extra information spit out on your console.

Issues

Two obstacles came up with this part of the project – one simple to fix, the other decently tough. First the simple one.

In order to limit the amount of information coming across the wire, Google limits each API request to 20 results. When there are more than 20 results they find, they also pass back in the json a param named “next_page_token”. So when we see this param passed back, we need to stick with the same location, and add the param “pagetoken” and hit the same endpoint. There’s also a time aspect to this request where we need to wait a couple seconds before hitting the endpoint to grab the remaining locations. Not too bad.

Second issue here, and somewhat of an annoying one, is the radius parameter. 50 km is not quite the size of the entire US. This is actually a really interesting problem that, after talking with work colleagues, there isn’t a straightforward solution. What we really need here, is a set of latitudes and longitudes where, with the 50 km radius, will cover the entirety of the United States. Sure you could put a location every miles or so, but that would take forever to search for. So instead of finding a solution to this problem isn’t in the scope of this article (maybe later). Instead, I found this nice gist of the top 246 metro locations in the US and their latitude and longitudes and am just going to use that and hope it covers enough of the country to be useful.

Complete code for this part of the project includes writing the locations of the restaurants to a tab separated values (tsv) file. Normally would use a csv, but since the addresses have commas in them, it could get confusing.

from major_city_list import major_cities

keyword_qdoba = 'Qdoba Mexican Eats'
keyword_noodles = 'Noodles & Company'
keyword_chipotle = 'Chipotle'
search_keywords = [keyword_qdoba, keyword_noodles, keyword_chipotle]

params = {}
params['key'] = GOOGLE_PLACES_API
params['type'] = 'restaurant'
params['radius'] = 50000
for keyword in search_keywords:
  params['keyword'] = keyword
  keyword_info = {}
  for city in major_cities:
    print city["city"]
    location = "%s,%s" % (city["latitude"], city["longitude"])
    params['location'] = location
    while True:
      r = requests.get(url, params=params)
      results = r.json()['results']
      num_results = len(results)
      print "results: %s" % num_results
      for result in results:
        lat = result["geometry"]["location"]["lat"]
        lng = result["geometry"]["location"]["lng"]
        key = "%s%s" % (lat, lng * -1)
        address = result["vicinity"]
        info = {"lat": lat, "lng": lng, "address": address}
        keyword_info[key] = info
        try:
          next_page_token = r.json()['next_page_token']
          params["pagetoken"] = next_page_token
          time.sleep(2)
        except KeyError:
          params.pop("pagetoken", None)
          break

 filename = "%s.tsv" % keyword
 filename = filename.lower().replace(" ", "_")
 with open(filename, 'wb') as tsvfile:
   writer = csv.writer(tsvfile, delimiter='\t')
   for key, info in keyword_info.iteritems():
     writer.writerow([info['lat'],info['lng'],info['address']])

Final thing to point out here is about why I have this be a multi step process. I could have written a script that does this part, and then all the rest of the project at once. But you’ll find that when working on things and bugfixing, it’s better to split tasks up, save the results, and then use those results without having to go back out to the internet.

Finding nearest companion

Step two of this process here is finding the closest Qdoba and Chipotle for each Noodles. With that information, we can figure out how far away the nearest companion is. At first, I was tempted to go right back to the Google Places API since, well, it was designed for this purpose. However first, I decided to see if I could brute force it with the n^2 loop over every location and find the shortest distance algorithm. Turns out that was a great decision because it was way quicker and more accurate.

Code steps are 1) Read in the noodles.tsv file generated above, 2) read in the chipotle and qdoba .tsv files, 3) for each Noodles, loop the entire other file and store the closest location, 4) store that information in another tsv file. In this case, code is easier to figure out than explanation.

keywords = ['chipotle', 'qdoba']
noodles_locations = []
filename = "noodles.tsv"
with open(filename, 'rb') as tsvfile:
  reader = csv.reader(tsvfile, delimiter='\t')
  for row in reader:
    noodles_locations.append(row)
for keyword in keywords:
  information = []
  filename = "%s.tsv" % keyword
  keyword_locations = []
  with open(filename, 'rb') as tsvfile:
    reader = csv.reader(tsvfile, delimiter='\t')
    for row in reader:
      keyword_locations.append(row)
  count = 0
  for noodle_location in noodles_locations:
    print count
    test_loc = (noodle_location[0], noodle_location[1])
    best_distance = 100000 #something large
    for location in keyword_locations:
      found_loc = (location[0], location[1])
      distance = vincenty(test_loc, found_loc).miles
      if distance < best_distance:
        best_distance = distance
        best_location = [location[0], location[1], location[2]]
    info_row = [noodle_location[0], noodle_location[1], noodle_location[2], best_location[0], best_location[1], best_location[2]]
    information.append(info_row)
    count += 1
    filename = "noodles_closest_%s.tsv" % keyword
    with open(filename, 'wb') as tsvfile:
      writer = csv.writer(tsvfile, delimiter='\t')
      for info in information:
        writer.writerow(info)

Analyze!

For my dumb theory to be true, there needs to be a disproportionate number of Qdobas and Noodles within walking distance of each other, and specifically, right next to each other compared to Chipotle.

After analyzing the data, I’m totally right.

I found 418 Noodles, 790 Chipotles, and 618 Qdobas. Even with the extra 172 Chipotles, there’s a Qdoba closer to a Noodles than there is a Chipotle.

Some numbers. If you’re at a Noodles, there’s a 12.7% chance you’re within 0.1 miles of a Qdoba, 19.9% chance you’re within 0.25 miles, and 35.9% chance you’re within 1 mile. Chipotle has percentages of 6.4%, 12.7%, 30.6% respectively.

Check out the histograms:

chipotle qdoba

While not much of a difference, you can see a little more action on the left side of the Qdoba histogram compared to the Chipotle one.

As a final, final test, I went through each Noodle location again, found the nearest Qdoba and nearest Chipotle and counted the number of Noodles that had a Qdoba closer, and Noodles that had Chipotle closer. Final tally, 214 had a Qdoba closer, 204 had a Chipotle closer.

So how close are Qdobas and Chipotles from each other?

For fun, I ran the code to see how close the nearest Chipotle was from each Qdoba.

6.6% Qdobas had a Chipotle within 0.1 miles, 12.8% had one within 0.25 miles, and 28% within 1 mile. Semi-surprising that it was this high, but I guess people don’t want to go far for food.

The histogram is definitely more telling that Chipotles are further apart. Check out the y axis scaling here.

Screen Shot 2016-05-02 at 9.04.04 PM

What’s the point of this?

Knowing this kind of information really isn’t all that useful. Fun, sure, but not too particularly useful. But what it does show is how powerful knowledge of the internet and programming can be. In just a short amount of time, we went from a dumb theory about restaurants to finding an answer. Also, maybe you’re looking to open a Qdoba somewhere in the US, and want to know if there’s a lonely Noodles that needs a companion!

Follow on twitter, and get in contact if you have information you want on the internet. I can help you out!