Category Archives: Python

General Tips for Web Scraping with Python

The great majority of the projects about machine learning or data analysis I write about here on Bigish-Data have an initial step of scraping data from websites. And since I get a bunch of contact emails asking me to give them either the data I’ve scraped myself, or help with getting the code to work for themselves. Because of that, I figured I should write something here about the process of web scraping!

There are plenty of other things to talk about when scraping, such as specifics on how to grab the data from a particular site, which Python libraries to use and how to use them, how to write code that would scrape the data in a daily job, where exactly to look as to how to get the data from random sites, etc. But since there are tons of other specific tutorials online, I’m going to talk about overall thoughts on how to scrape. There are three parts of this post – How to grab the data, how to save the data, and how to be nice.

As is the case with everything, programming-wise, if you’re looking to learn scraping, you can’t just read tutorials and think to yourself that you know how to program. Pick a project, practice grabbing the data, and then write a blog post about what you learned.

There definitely are tons of different thoughts on scraping, but these are the ones that I’ve learned from doing it a while. If you have questions, comments, and want to call me out, feel free to comment, or get in contact!

Grabbing the Data

The first step for scraping data from websites is to figure out where the sites keep their data, and what method they use to display the data on the browser. For this part of your project, I’ll suggest writing in a file named gather.py which should performs all these tasks.

Continue reading

A Practical Use For Python Decorators — Logging, Error Checks, and Timing

When using a Python decorator, especially one defined in another library, they seem somewhat magical. Take for example Flask’s routing mechanism. If I put some statement like @app.route("/") above my logic, then poof, suddenly that code will be executed when I go to the root url on the server. And sure, decorators make sense when you read the many tutorials out there that describe them. But for the most part, those tutorials are just explaining what’s going on, mostly by just printing out some text, but not why you might want to use a decorator yourself.

I was of that opinion before, but recently, I realized I have the perfect use for a decorator in a project of mine. In order to get the content for Product Mentions, I have Python scrapers that go through Reddit looking for links to an Amazon product, and once I find one, I gather up the link, use the Amazon Product API to get information on the product. Once that’s in the database, I use Rails to display the items to the user.

While doing the scraping, I also wanted a web interface so I can check to see errors, check to see how long the jobs are taking, and overall to see that I haven’t missed anything. So along with the actual Python script that grabs the html and parses it, I created a table in the database for logging the scraping runs, and update that for each job. Simple, and does the job I want.

The issue I come across here, and where decorators come into play, is code reuse. After some code refactoring, I have a few different jobs, all of which have the following format: Create an object for this job, commit it to the db so I can see that it’s running in real time, try some code that depends on the job and except and log any error so we don’t crash that process, and then post the end time of the job.

def gather_comments():
  scrape_log = ScrapeLog(start_time=datetime.now(), job_type="comments")
  session.add(scrape_log)
  session.commit()

  try:
    rg = RedditGatherer()
    rg.gather_comments()
  except Exception as e:
    scrape_log.error = True
    scrape_log.error_message = e.message

  scrape_log.end_time = datetime.now()
  session.add(scrape_log)
  session.commit()

def gather_threads():
  scrape_log = ScrapeLog(start_time=datetime.now(), job_type="threads")
  session.add(scrape_log)
  session.commit()

  try:
     rg = RedditGatherer()
     rg.gather_threads()
  except Exception as e:
    scrape_log.error = True
    scrape_log.error_message = e.message

  scrape_log.end_time = datetime.now()
  session.add(scrape_log)
  session.commit()

If you know a bit about how decorators work, you can already see how perfect an opportunity using this concept is here, because decorators allow you to extend and reuse functionality on top of functions you already use. For me, I want to log, time, and error check my scraping, and reusing the same code is not ideal. But a decorator is. Here’s how to write one.

Decorator Time

First thing to do, is write a function, that takes a function as parameter and call that function at the appropriate time. Since the work of the functions above is done with the same format, this turns out really nice.

Continue reading

Running Python Background Jobs with Heroku

Recently, I’ve been working on a project that scrapes Reddit looking for links to products on Amazon. Basically the idea being that there’s valuable info in what people are linking to and talking about online, and a starting point would be looking for links to Amazon products on Reddit. And the result of that work turned into Product Mentions.

To build this, and I can talk more about this later, I have two parts. First being a basic Rails app that displays the products and where they’re talked about, and the second being a Python app that does the scraping, and also displays the scraping logs for me using Flask. I thought of just combining the two functionalities at first, but decided It was easier in both regards to separate the two functionalities. The scraper populates the database, and the Rails app displays what’s in there. I hosted the Rails app on Heroku, and after some poking around, decided to also run the Python scraper on Heroku as well (for now at least!)

Also, if at this point, you’re thinking to yourself, “why the hell is he using an overpriced, web app hosting service like Heroku when there are so many other options available?” you’re probably half right, but in terms of ease of getting started, Heroku was by far the easiest PaaS to get this churning. Heroku is nice, and this set up is really simple, especially compared to some of the other PaaS options out there that require more configuration. You can definitely look for different options if you’re doing a more full web crawl, but this’ll work for a lot of purposes.

So what I’m going to describe here today, is how I went about running the scrapers on Heroku as background jobs, using clock and worker processes. I’ll also talk a little about what’s going on so it makes a little more sense than those copy paste tutorials I see a lot (though that type of tutorial from Heroku’s docs is what I used here, so I can’t trash them too badly!).

worker.py

First file you’re going to need here is a worker file, which will perform the function that it sees coming off a queue. For ease, I’ll name this worker.py file. This will connect to Redis, and just wait for a job to be put on the queue, and then run whatever it sees. First, we need rq the library that deals with Redis in the background (all of this is assuming you’re in a virtualenv

$ pip install rq
$ pip freeze > requirements.txt

This is the only external library you’re going to need for a functioning worker.py file, as specified by the nice Heroku doc. This imports the required objects from rq, connects to Redis using either an environment variable (that would be set in a production / Heroku environment), creates a worker, and then calls work. So in the end, running python worker.py will just sit there waiting to get jobs to run, in this case, scraping Reddit. We also have ‘high’ ‘default’ and ‘low’ job types, so the queue will know which ones to run first, but we aren’t going to need that here.

import os

import redis
from rq import Worker, Queue, Connection

listen = ['high', 'default', 'low']

redis_url = os.getenv('REDISTOGO_URL', 'redis://localhost:6379')

conn = redis.from_url(redis_url)

if __name__ == '__main__':
 with Connection(conn):
 worker = Worker(map(Queue, listen))
 worker.work()

clock.py

Now that we have the worker set up, here’s the clock.py file that I’m using to do the scraping. Here, it imports the conn variable from the worker.py file, uses that to make sure we’re connected to the same Redis queue. We also import the functions that use the scrapers from run.py, and in this file, create functions that will enqueue the respective functions.  Then we use apscheduler to schedule when we want to call these functions, and then start the scheduler. If we run python clock.py, we scheduler will run in perpetuity (hopefully), and then will call the correct code on the intervals we defined.

Continue reading

Classifying Amazon Reviews with Scikit-Learn — More Data is Better Turns Out

Last time, I went through some basics of how naive Bayes algorithm works, and the logic behind it, and implemented the classifier myself, as well as using the NLTK. That’s great and all, and hopefully people reading it got a better understanding of what was going on, and possibly how to play along with classification for their own text documents.

But if you’re looking to train and actually deploy a model, say, a website where people can copy paste reviews from Amazon and see how our classifier performs, you’re going to want to use a library like Scikit-Learn. So with this post, I’ll walk through training a Scikit-Learn model, testing various classifiers and parameters, in order to see how we do, and also at the end, will have an initial, version 1, of a Amazon review classifier that we can use in a production setting.

Some notes before we get going:

  • For a lot of the testing, I only use 5 or 10 of the full 26 classes that are in the dataset.
  • Keep in mind, that what works here might not be the same for other data sets. We’re specifically looking at Amazon product reviews. For a different set of texts (you’ll also see the word corpus being thrown around), a different classifier, or parameter sets might be used.
  • The resulting classifier we come up with is, well, really really basic, and probably what we’d guess would perform the best if we guessed what would be the best at the onset. All the time and effort that goes into checking all the combinations
  • I’m going to mention here this good post that popped up when I was looking around for other people who wrote about this. It really nicely outlines going how to classify text with Scikit-learn. To reduce redundancy, something that we all should work towards, I’m going to point you to that article to get up to speed on Scikit-learn and how it can apply to text. In this article, I’m going to start at the end of that article, where we’re working with Scikit-learn pipelines.

As always, you can say hi on twitter, or yell at me there for messing up as well if you want.

How many grams?

First step to think about is how we want to represent the reviews in naive Bayes world, in this case, a bag of words / n-grams. In the other post, I simply used word counts since I wasn’t going into how to make the best model we could have. But besides word counts, we can also bump up the representations to include something called a bigram, which is a two word combos. The idea behind that is that there’s information in two word combos that we aren’t using with just single words. With Scikit-learn, this is very simple to do, and they take care of it for you. Oh, and besides bigrams, we can say we want trigrams, fourgrams, etc. Which we’ll do, to see if that improves performance. Take a look at the wikipedia article for n-grams here.

For example is if a review mentions “coconut oil cream”, as in some sort of face cream (yup, I actually saw this as a mis-classified review), simply using the words and we might get a classification of food since we just see “coconut” “oil” and “cream”. But if we use bigrams as well as the unigrams, we’re also using “coconut oil” and “oil cream” as information. Now this might not get us all the way to a classification of beauty, but it could tip us over the edge.

Continue reading

Practical Naive Bayes — Classification of Amazon Reviews

If you search around the internet looking for applying Naive Bayes classification on text, you’ll find a ton of articles that talk about the intuition behind the algorithm, maybe some slides from a lecture about the math and some notation behind it, and a bunch of articles I’m not going to link here that pretty much just paste some code and call it an explanation.

So I’m going to try to do a little more here, by hopefully writing and explaining enough, is let you yourself write a working Naive Bayes classifier.

There are three sections here. First is setup, and what format I’m expecting your text to be in for the classification. Second, I’ll talk about how to run naive Bayes on your own, using slow Python data structures. Finally, we’ll use Python’s NLTK and it’s classifier so you can see how to use that, since, let’s be honest, it’s gonna be quicker. Note that you wouldn’t want to use either of these in production, so look for a follow up post about how you might go about doing that.

As always, twitter, and check out the full code on github.

Setup

Data from this is going to be from this UCSD Amazon review data set. I swear one of the biggest issues with running these algorithms on your own is finding a data set big and varied enough to get interesting results. Otherwise you’ll spend most of your time scraping and cleaning data that by the time you get to the ML part of the project, you’re sufficiently annoyed. So big thanks that this data already exists.

You’ll notice that this set has millions of reviews for products across 24 different classes. In order to keep the complexity down here (this is a tutorial post after all), I’m sticking with two classes, and ones that are somewhat far enough different from each other to show that classification works, we’ll be classifying baby reviews against tools and home improvement reviews.

Preprocessing

First thing I want to do now, after unpacking the .gz file, is to get a train and test set that’s smaller than the 160,792 and 134,476 of baby and tool reviews respectively. For purposes here, I’m going to use 1000 of each, with 800 used for training, and 200 used for testing. The algorithms are able to support any number of training and test reviews, but for demonstration purposes, we’re making that number lower.

Check the github repo if you want to see the code, but I wrote a script that just takes the full file, picks 1000 random numbers, segments 800 into the training set, and 200 into the test set, and saves them to files with the names “train_CLASSNAME.json” and “test_CLASSNAME.json” where classname is either “baby” or “tool”.

Also, the files from that dataset are really nice, in that they’re already python objects. So to get them into a script, all you have to do is run “eval” on each line of the file if you want the dict object.

Features

There really wasn’t a good place to talk about this, so I’ll mention it here before getting into either of the self, and nltk running of the algorithm. The features we’re going to use are simply the lowercased version of all the words in the review. This means, in order to get a list of these words from the block of text, we remove punctuation, lowercase every word, split on spaces, and then remove words that are in the NLTK corpus of stopwords (basically boring words that don’t have any information about class).

from nltk.corpus import stopwords
STOP_WORDS = set(stopwords.words('english'))
STOP_WORDS.add('')
def clean_review(review):
  exclude = set(string.punctuation)
  review = ''.join(ch for ch in review if ch not in exclude)
  split_sentence = review.lower().split(" ")
  clean = [word for word in split_sentence if word not in STOP_WORDS]
  return clean

Realize here that there are tons of different ways to do this, and ways to get more sophisticated that hopefully can get you better results! Things like stemming, which takes words down to their root word (wikipedia gives the example of “stems”, “stemmer”, “stemming”, “stemmed” as based on “stem”). You might want to include n-grams, for an n larger than 1 in our case as well.

Basically, there’s tons of processing on the text that you could do here. But since this I’m just talking about how Naive Bayes works, I’m sticking with simplicity. Maybe in the future I can get fancy and see how well I can do in classifying these reviews.

Ok, on to the actual algorithm.

Continue reading

Talkin’ ‘Bout Trucks, Beer, and Love in Country Songs — Analyzing Genius Lyrics

Trucks, beer, and love, all things that make country music go round. I’ve said before that country music is just pop music with a slide, and then lyrics about slightly different topics than what you’ll hear in hip hop or “normal” pop music on the radio.

In my continuing quest to validate my theory that all country songs can fit into one of four different topics, in this post, I go through lyrics to see which artists talk about trucks, beer, and love the most. In my first post on this topic, I talked about how to get song lyrics from genius and print them out on the command line.

The goal here, and what I’m going to walk you through, is how I stored stored info and lyrics for all the songs for the country artists, how I made sure that all the lyrics were unique, and then ran some stats on the songs. Another note before we go is that a lot of data work is just janitorial. The actual code for getting “interesting” results is fairly simple. The key it to enjoy doing the janitor-style coding and then you’ll be good.

If you’re interested in which country music people talk most about trucks, beer, alcohol, or small towns, skip to the end where I list out some stats. For the rest, here’s some code.

https://www.pinterest.com/pin/59180182578213991/

I wonder how they feel about beer trucks. I’m guessing they’d all be fans of them.

Step 1 — Save the Lyrics!

When doing anything with web scraping, the one thing to always, always keep in mind here, is that you want to avoid hitting the server for as little as possible. With that in mind, we’re going to do here is assume the inputs are names of artists. For each of those artists, find all of their songs, and then for each of those songs, grab the lyrics in the way that I did in the first post, and then save them locally along with some meta information the API provides.

Now when I post the following code, don’t imagine that I knew what I wanted . Everything in here was created iteratively. Here’s a list of all the features of this piece of code does that were created iteratively.

Directory structure — Within the folder that contains the main .py file, there’s a folder named artists. And within that folder, when the code runs, a folder with the artist’s name is created (if not already). And within that folder, there are two more folders, info and lyrics. When we run the code, I put the lyrics in /artists/artist_name/lyrics/Song Title.txt and the info from the API, containing information about the song, like annotations, title, and song API id so we can grab it again if need be, in the file /artists/artist_name/info/Song Title.txt. The key, again, being saving all the info given to avoid unnecessary requests.

Redundancy Checking — Along with making sure to save all the info given, if we run an artist for the second time, we don’t want to get lyrics that we already have. So once we have all the songs for that artist, I run a check to see if we have a file with the name of the song already, and that the file isn’t empty. If the file is there, we continue to the next song.

Lyric Error Checking — Ahh unicode. While great for allowing multitudes of different characters rather than the standard English alphabet along with a few specialty characters, they’re not ideal when I’m trying to deal with simple song lyrics. And when saving the lyrics, I encountered more than a few random, unnecessary characters that Python threw errors for encoding problems. In a semi-janky rule-based solution (which isn’t great to use, see below), when I saw these errors being thrown, I would specifically replace them with the correct “normal” character. I assume there’s some library out there that would take care of all the encoding issues, but this worked for me. Also, on Genius’s end, it would be sweet if they, you know, checked for abnormal characters when lyrics were uploaded and didn’t have them in the first place. Also would be cool if they included the lyrics in the API.

def clean_lyrics(lyrics):
  lyrics = lyrics.replace(u"\u2019", "'") #right quotation mark
  lyrics = lyrics.replace(u"\u2018", "'") #left quotation mark
  lyrics = lyrics.replace(u"\u02bc", "'") #a with dots on top
  lyrics = lyrics.replace(u"\xe9", "e") #e with an accent
  lyrics = lyrics.replace(u"\xe8", "e") #e with an backwards accent
  lyrics = lyrics.replace(u"\xe0", "a") #a with an accent
  lyrics = lyrics.replace(u"\u2026", "...") #ellipsis apparently
  lyrics = lyrics.replace(u"\u2012", "-") #hyphen or dash
  lyrics = lyrics.replace(u"\u2013", "-") #other type of hyphen or dash
  lyrics = lyrics.replace(u"\u2014", "-") #other type of hyphen or dash
  lyrics = lyrics.replace(u"\u201c", '"') #left double quote
  lyrics = lyrics.replace(u"\u201d", '"') #right double quote
  lyrics = lyrics.replace(u"\u200b", ' ') #zero width space ?
  lyrics = lyrics.replace(u"\x92", "'") #different quote
  lyrics = lyrics.replace(u"\x91", "'") #still different quote
  lyrics = lyrics.replace(u"\xf1", "n") #n with tilde!
  lyrics = lyrics.replace(u"\xed", "i") #i with accent
  lyrics = lyrics.replace(u"\xe1", "a") #a with accent
  lyrics = lyrics.replace(u"\xea", "e") #e with circumflex
  lyrics = lyrics.replace(u"\xf3", "o") #o with accent
  lyrics = lyrics.replace(u"\xb4", "") #just an accent, so remove
  lyrics = lyrics.replace(u"\xeb", "e") #e with dots on top
  lyrics = lyrics.replace(u"\xe4", "a") #a with dots on top
  lyrics = lyrics.replace(u"\xe7", "c") #c with squigly bottom
  return lyrics

Check out the most of the main function below. If you’re looking for the actual full file, check out this gist. It’s easier to post that on Github than formatting the entire thing here.

Continue reading

Predicting PGA Tour Scoring Average from Statistics Using Linear Regression

First off, I admit, that’s probably the most boring title for a blog post ever. It gets a negative value on the clickbait scale that is generally unseen in the modern, “every click equals dollars” era that we live in. On the other hand, it tells you exactly what this article is about — predicting scoring average using stats.

In this article, I’ll go through getting the data from the database, cleaning that data for use, and then running a linear regression in order to generate coefficients for each of the stats to generate scoring average predictions. Oh, and some analysis and commentary at the end!

Shameless shoutout to my other blog, Golf on the Mind. Check it out and subscribe to the newsletter / twitter / instagram if you’re into golf at all. Or ignore, and keep reading for some code!

Here's a pic of a golf course to get you in the mood.

Here’s a pic of a golf course to get you in the mood.

Getting the data

Last time if you remember, I spent all this effort taking the csv stat files, and putting the information into a database. Start there if you haven’t read that post yet. It’ll show how I grabbed the stats and formatted them.

Now that you’re back in the present we need to create a query that gets the stats for the players for a specific year. An example row in a CSV file of the data would be something like:

player_id, player_name, stat_1_value, stat_2_value, … , stat_n_value

for stats 1 to n where n (the number of stats), and the which stats themselves (driving distance, greens in regulation, etc.) vary depending on inputs.

Now let me say, I am not an expert in writing sql queries. And since people on the internet loooove to dole out hate in comments sections, I’m just going to say that there’s probably a better way of writing this query. Feel free to let me know and I can throw an edit in here, but this query works just fine.

select players.id,
  players.name,
  max(case when stat_lines.stat_id=330 then stat_lines.raw else null end) as putting_average,
  max(case when stat_lines.stat_id=157 then stat_lines.raw else null end) as driving_distance,
  max(case when stat_lines.stat_id=250 then stat_lines.raw else null end) as gir,
  max(case when stat_lines.stat_id=156 then stat_lines.raw else null end) as driving_accuracy,
  max(case when stat_lines.stat_id=382 then stat_lines.raw else null end) as scoring_average
from players
  join stat_lines on stat_lines.player_id = players.id
  join stats on stat_lines.stat_id=stats.id
where stat_lines.year=2012 and (stats.id=157 or stats.id=330 or stats.id=382 or stats.id=250 or stats.id=156) and stat_lines.raw is not null
group by players.name,players.id;

High level overview time! We’re selecting player id, and player name, along with their stats for putting average, driving distance, greens in regulation, driving accuracy and scoring average for the year 2012. In order to get the right stats, we need to know the stat id for the stats.

One more thing. This query is funky, and I probably could have designed the schema differently to make this prettier. For example, I could have just gone with one table, stat_lines, with fields for player_name and stat_name (along with all the current fields) and then the sql would be very simple. But there are other applications to keep in mind. What if you wanted to display all stats by a player? Or all of a players stats for a certain year? With the way I have the schema set up, those queries are simple and logical. For this specific case, I’ll deal with the complexity.

Loading the Data

That query above is great, but it’s not going to cut it if I have to specify what the year, and the stat ids in that string every time I run the script. Gotta be dynamic here.

Continue reading