Category Archives: Machine Learning

Classifying Amazon Reviews with Scikit-Learn — More Data is Better Turns Out

Last time, I went through some basics of how naive Bayes algorithm works, and the logic behind it, and implemented the classifier myself, as well as using the NLTK. That’s great and all, and hopefully people reading it got a better understanding of what was going on, and possibly how to play along with classification for their own text documents.

But if you’re looking to train and actually deploy a model, say, a website where people can copy paste reviews from Amazon and see how our classifier performs, you’re going to want to use a library like Scikit-Learn. So with this post, I’ll walk through training a Scikit-Learn model, testing various classifiers and parameters, in order to see how we do, and also at the end, will have an initial, version 1, of a Amazon review classifier that we can use in a production setting.

Some notes before we get going:

  • For a lot of the testing, I only use 5 or 10 of the full 26 classes that are in the dataset.
  • Keep in mind, that what works here might not be the same for other data sets. We’re specifically looking at Amazon product reviews. For a different set of texts (you’ll also see the word corpus being thrown around), a different classifier, or parameter sets might be used.
  • The resulting classifier we come up with is, well, really really basic, and probably what we’d guess would perform the best if we guessed what would be the best at the onset. All the time and effort that goes into checking all the combinations
  • I’m going to mention here this good post that popped up when I was looking around for other people who wrote about this. It really nicely outlines going how to classify text with Scikit-learn. To reduce redundancy, something that we all should work towards, I’m going to point you to that article to get up to speed on Scikit-learn and how it can apply to text. In this article, I’m going to start at the end of that article, where we’re working with Scikit-learn pipelines.

As always, you can say hi on twitter, or yell at me there for messing up as well if you want.

How many grams?

First step to think about is how we want to represent the reviews in naive Bayes world, in this case, a bag of words / n-grams. In the other post, I simply used word counts since I wasn’t going into how to make the best model we could have. But besides word counts, we can also bump up the representations to include something called a bigram, which is a two word combos. The idea behind that is that there’s information in two word combos that we aren’t using with just single words. With Scikit-learn, this is very simple to do, and they take care of it for you. Oh, and besides bigrams, we can say we want trigrams, fourgrams, etc. Which we’ll do, to see if that improves performance. Take a look at the wikipedia article for n-grams here.

For example is if a review mentions “coconut oil cream”, as in some sort of face cream (yup, I actually saw this as a mis-classified review), simply using the words and we might get a classification of food since we just see “coconut” “oil” and “cream”. But if we use bigrams as well as the unigrams, we’re also using “coconut oil” and “oil cream” as information. Now this might not get us all the way to a classification of beauty, but it could tip us over the edge.

Continue reading

Practical Naive Bayes — Classification of Amazon Reviews

If you search around the internet looking for applying Naive Bayes classification on text, you’ll find a ton of articles that talk about the intuition behind the algorithm, maybe some slides from a lecture about the math and some notation behind it, and a bunch of articles I’m not going to link here that pretty much just paste some code and call it an explanation.

So I’m going to try to do a little more here, by hopefully writing and explaining enough, is let you yourself write a working Naive Bayes classifier.

There are three sections here. First is setup, and what format I’m expecting your text to be in for the classification. Second, I’ll talk about how to run naive Bayes on your own, using slow Python data structures. Finally, we’ll use Python’s NLTK and it’s classifier so you can see how to use that, since, let’s be honest, it’s gonna be quicker. Note that you wouldn’t want to use either of these in production, so look for a follow up post about how you might go about doing that.

As always, twitter, and check out the full code on github.


Data from this is going to be from this UCSD Amazon review data set. I swear one of the biggest issues with running these algorithms on your own is finding a data set big and varied enough to get interesting results. Otherwise you’ll spend most of your time scraping and cleaning data that by the time you get to the ML part of the project, you’re sufficiently annoyed. So big thanks that this data already exists.

You’ll notice that this set has millions of reviews for products across 24 different classes. In order to keep the complexity down here (this is a tutorial post after all), I’m sticking with two classes, and ones that are somewhat far enough different from each other to show that classification works, we’ll be classifying baby reviews against tools and home improvement reviews.


First thing I want to do now, after unpacking the .gz file, is to get a train and test set that’s smaller than the 160,792 and 134,476 of baby and tool reviews respectively. For purposes here, I’m going to use 1000 of each, with 800 used for training, and 200 used for testing. The algorithms are able to support any number of training and test reviews, but for demonstration purposes, we’re making that number lower.

Check the github repo if you want to see the code, but I wrote a script that just takes the full file, picks 1000 random numbers, segments 800 into the training set, and 200 into the test set, and saves them to files with the names “train_CLASSNAME.json” and “test_CLASSNAME.json” where classname is either “baby” or “tool”.

Also, the files from that dataset are really nice, in that they’re already python objects. So to get them into a script, all you have to do is run “eval” on each line of the file if you want the dict object.


There really wasn’t a good place to talk about this, so I’ll mention it here before getting into either of the self, and nltk running of the algorithm. The features we’re going to use are simply the lowercased version of all the words in the review. This means, in order to get a list of these words from the block of text, we remove punctuation, lowercase every word, split on spaces, and then remove words that are in the NLTK corpus of stopwords (basically boring words that don’t have any information about class).

from nltk.corpus import stopwords
STOP_WORDS = set(stopwords.words('english'))
def clean_review(review):
  exclude = set(string.punctuation)
  review = ''.join(ch for ch in review if ch not in exclude)
  split_sentence = review.lower().split(" ")
  clean = [word for word in split_sentence if word not in STOP_WORDS]
  return clean

Realize here that there are tons of different ways to do this, and ways to get more sophisticated that hopefully can get you better results! Things like stemming, which takes words down to their root word (wikipedia gives the example of “stems”, “stemmer”, “stemming”, “stemmed” as based on “stem”). You might want to include n-grams, for an n larger than 1 in our case as well.

Basically, there’s tons of processing on the text that you could do here. But since this I’m just talking about how Naive Bayes works, I’m sticking with simplicity. Maybe in the future I can get fancy and see how well I can do in classifying these reviews.

Ok, on to the actual algorithm.

Continue reading

Predicting PGA Tour Scoring Average from Statistics Using Linear Regression

First off, I admit, that’s probably the most boring title for a blog post ever. It gets a negative value on the clickbait scale that is generally unseen in the modern, “every click equals dollars” era that we live in. On the other hand, it tells you exactly what this article is about — predicting scoring average using stats.

In this article, I’ll go through getting the data from the database, cleaning that data for use, and then running a linear regression in order to generate coefficients for each of the stats to generate scoring average predictions. Oh, and some analysis and commentary at the end!

Shameless shoutout to my other blog, Golf on the Mind. Check it out and subscribe to the newsletter / twitter / instagram if you’re into golf at all. Or ignore, and keep reading for some code!

Here's a pic of a golf course to get you in the mood.

Here’s a pic of a golf course to get you in the mood.

Getting the data

Last time if you remember, I spent all this effort taking the csv stat files, and putting the information into a database. Start there if you haven’t read that post yet. It’ll show how I grabbed the stats and formatted them.

Now that you’re back in the present we need to create a query that gets the stats for the players for a specific year. An example row in a CSV file of the data would be something like:

player_id, player_name, stat_1_value, stat_2_value, … , stat_n_value

for stats 1 to n where n (the number of stats), and the which stats themselves (driving distance, greens in regulation, etc.) vary depending on inputs.

Now let me say, I am not an expert in writing sql queries. And since people on the internet loooove to dole out hate in comments sections, I’m just going to say that there’s probably a better way of writing this query. Feel free to let me know and I can throw an edit in here, but this query works just fine.

  max(case when stat_lines.stat_id=330 then stat_lines.raw else null end) as putting_average,
  max(case when stat_lines.stat_id=157 then stat_lines.raw else null end) as driving_distance,
  max(case when stat_lines.stat_id=250 then stat_lines.raw else null end) as gir,
  max(case when stat_lines.stat_id=156 then stat_lines.raw else null end) as driving_accuracy,
  max(case when stat_lines.stat_id=382 then stat_lines.raw else null end) as scoring_average
from players
  join stat_lines on stat_lines.player_id =
  join stats on
where stat_lines.year=2012 and ( or or or or and stat_lines.raw is not null
group by,;

High level overview time! We’re selecting player id, and player name, along with their stats for putting average, driving distance, greens in regulation, driving accuracy and scoring average for the year 2012. In order to get the right stats, we need to know the stat id for the stats.

One more thing. This query is funky, and I probably could have designed the schema differently to make this prettier. For example, I could have just gone with one table, stat_lines, with fields for player_name and stat_name (along with all the current fields) and then the sql would be very simple. But there are other applications to keep in mind. What if you wanted to display all stats by a player? Or all of a players stats for a certain year? With the way I have the schema set up, those queries are simple and logical. For this specific case, I’ll deal with the complexity.

Loading the Data

That query above is great, but it’s not going to cut it if I have to specify what the year, and the stat ids in that string every time I run the script. Gotta be dynamic here.

Continue reading

Defense Matters in the NBA, Apparently

Interesting article here from Arxiv titled Finding Common Characteristics Among NBA Playoff Teams: A Machine Learning Approach.

Everyone should skim the paper because it’ll show you that these academic papers aren’t overwhelming, and a lot of times, they’re good at showing steps that go into tackling a machine learning problem. This paper, for example, goes over the basics of decision trees, pruning decision trees, as well as more high powered decision trees. Great progression.

As for what they found, apparently opponent’s stats are the most important determinants in whether a team makes the playoffs, with opponent field goal percentage and opponent points per game leading the way. Play good defense, and doesn’t matter how many points you score yourselves.

Only comment is that if you’re sitting out there thinking “well NBA teams don’t play defense anyway blah blah blah” you’re wrong. Defense is probably the most impressive aspect of a game to watch, especially in the playoffs currently going on.

Importance of variables is a really interesting topic in machine learning, sports especially. Knowing which variables matter can help someone in charge figure out who to draft (Moneyball style) or possibly what aspects to focus on during games or practice. Considering I have a bunch of PGA Tour data, maybe figuring out which stats are important for golfers should be something to focus on here…

Funny, they say that their “dataset was quite large”, so they only provided a sample of the data. 30 teams * 15 years * 44 variables = 19,800 data points. Too big to fit on a table in a paper sure, but don’t think I agree it’s quite large. More like big-ish. Fits in perfectly here.

Weekly Fantasy Golf Results and a System of Linear Equations

I had an idea recently about how to use the “wisdom of the crowds” in forecasting performance for weekly fantasy golf. I see the benefits of crowdsourcing all the time working with prediction markets at Inkling Markets, so I figured I could leverage that to get an edge when choosing my lineups. I’ll get into the details of how I’m doing that in a later post since that’s worthy on it’s own, but I found a subproblem that I’d have to solve in order to make it work.

While some of the sites offer player salaries in an easily digestible csv format, they’re a little stingy when it comes to results. They only keep contest results for the last 30 days, and the only downloadable results they offer are the results from contests in the past week, which just show point totals for an entire lineup, not the point totals for individual golfers. And I need those player’s points for testing my forecasting algorithm, as well as using them in making the actual forecasts. One way to figure this out is to scrape the hole by hole data from and apply the site’s scoring algorithm to each of those holes, but I’ve found a better, simpler, and much more elegant solution.

I can model the results csv file as a system of linear equations, and by converting the different user’s lineups into a matrix and vector those equations, numpy can solve for the point totals that each golfer earned during the tournament.

Here’s an example row from that file, with the person’s username hidden.

263,85826104, "UNAME (1/100)", 0, 430.50, "(G) Ryan Palmer ,(G) Pat Perez ,(G) Kevin Kisner ,(G) Ben Martin ,(G) Shawn Stefani ,(G) John Peterson "

In order to solve, we need to create two data structures. The first is a num_lineups by num_players matrix of coefficients, where the value is a 1 if the player was used in that lineup, and 0 if he wasn’t. The second is an array of total points scored by that lineup.

The idea is that if we had an array of the points the players scored over the course of the tournament, we should be able to take the dot product of that and the corresponding row in the coefficient matrix to generate the point total of that lineup.

Here’s the code to generate the coefficient matrix and the point array:

points_label = "Points"
lineup_label = "Lineup"
player_coefficients = []
lineup_points = []
with open('outcome.csv', 'rb') as csvfile:
  rows = csv.reader(csvfile)
  headers =
  points_index = headers.index(points_label)
  lineup_index = headers.index(lineup_label)
  for row in rows:
    points = float(row[points_index])
    names = [name.strip() for name in row[lineup_index].replace('(G)','').split(',')]
    lineup_players = [0] * player_count
    for name in names:
      lineup_players[player_list.index(name)] = 1

From here, all we need to do is convert those arrays into numpy arrays, and run numpy’s linear algebra least squares algorithm to get the solution array!

coefficient_matrix = np.array(player_coefficients)
point_array = np.array(lineup_points).transpose()
solution = np.linalg.lstsq(coefficient_matrix, point_array)
player_points = list(solution[0])

Initially, I tried to use the numpy’s solve algorithm, but by looking at the docs realized that solve dealt with square coefficient matrices (something that I still remember from all those math classes). The lstsqrs method is used to get approximate results from rectangular matrices as is the case here.

Printing out the results yields the following results for the top and bottom 10:

Chris Kirk: 126.0
Jason Bohn: 105.5
Brandt Snedeker: 102.0
Jordan Spieth: 101.5
Kevin Kisner: 99.0
George McNeill: 96.5
Pat Perez: 95.0
Adam Hadwin: 92.0
Ian Poulter: 87.5
Brian Harman: 87.0
Kenny Perry: 18.0
Jason Kokrak: 17.5
Jonas Blixt: 16.5
Corey Pavin: 16.0
Bo Van Pelt: 15.0
David Toms: 14.0
Brian Davis: 11.5
Tom Watson: 4.61852778244e-13
Tom Purtzer: 3.87245790989e-13
Scott Stallings: 2.84217094304e-13

Chris Kirk won so him having the highest point total makes sense, and the three guys at the bottom with zero points all withdrew so they should be at 0 points. Unfortunate for the guys who didn’t take them out of their lineups, but they gotta pay a little more attention!

Only issue with the final result is that I only end up with point totals from 117 players, when 133 teed it up at the beginning of the tournament. That means that we’re missing point totals from some of the guys. That being said, I’m going to assume that those players weren’t picked because they weren’t likely to play well, so hopefully that 117 offer a good representation of point totals. Also, could easily be that draftkings in this case only listed 117 players to choose from. I’ll need to investigate this week.

In the end, it took about 3 hours to write the code and write this post. It’s fun little problems like this that really remind you that programming is fun and has value. Being able to create technical solution to something you’re interested in is probably the best part about being a programmer. Check out the entire script in this gist.

Follow on twitter.