NLTK | Big-Ish Data

Last time, I went through some basics of how naive Bayes algorithm works, and the logic behind it, and implemented the classifier myself, as well as using the NLTK. That’s great and all, and hopefully people reading it got a better understanding of what was going on, and possibly how to play along with classification for their own text documents.

But if you’re looking to train and actually deploy a model, say, a website where people can copy paste reviews from Amazon and see how our classifier performs, you’re going to want to use a library like Scikit-Learn. So with this post, I’ll walk through training a Scikit-Learn model, testing various classifiers and parameters, in order to see how we do, and also at the end, will have an initial, version 1, of a Amazon review classifier that we can use in a production setting.

Some notes before we get going:

For a lot of the testing, I only use 5 or 10 of the full 26 classes that are in the dataset.
Keep in mind, that what works here might not be the same for other data sets. We’re specifically looking at Amazon product reviews. For a different set of texts (you’ll also see the word corpus being thrown around), a different classifier, or parameter sets might be used.
The resulting classifier we come up with is, well, really really basic, and probably what we’d guess would perform the best if we guessed what would be the best at the onset. All the time and effort that goes into checking all the combinations
I’m going to mention here this good post that popped up when I was looking around for other people who wrote about this. It really nicely outlines going how to classify text with Scikit-learn. To reduce redundancy, something that we all should work towards, I’m going to point you to that article to get up to speed on Scikit-learn and how it can apply to text. In this article, I’m going to start at the end of that article, where we’re working with Scikit-learn pipelines.

As always, you can say hi on twitter, or yell at me there for messing up as well if you want.

How many grams?

First step to think about is how we want to represent the reviews in naive Bayes world, in this case, a bag of words / n-grams. In the other post, I simply used word counts since I wasn’t going into how to make the best model we could have. But besides word counts, we can also bump up the representations to include something called a bigram, which is a two word combos. The idea behind that is that there’s information in two word combos that we aren’t using with just single words. With Scikit-learn, this is very simple to do, and they take care of it for you. Oh, and besides bigrams, we can say we want trigrams, fourgrams, etc. Which we’ll do, to see if that improves performance. Take a look at the wikipedia article for n-grams here.

For example is if a review mentions “coconut oil cream”, as in some sort of face cream (yup, I actually saw this as a mis-classified review), simply using the words and we might get a classification of food since we just see “coconut” “oil” and “cream”. But if we use bigrams as well as the unigrams, we’re also using “coconut oil” and “oil cream” as information. Now this might not get us all the way to a classification of beauty, but it could tip us over the edge.

Continue reading →

If you search around the internet looking for applying Naive Bayes classification on text, you’ll find a ton of articles that talk about the intuition behind the algorithm, maybe some slides from a lecture about the math and some notation behind it, and a bunch of articles I’m not going to link here that pretty much just paste some code and call it an explanation.

So I’m going to try to do a little more here, by hopefully writing and explaining enough, is let you yourself write a working Naive Bayes classifier.

There are three sections here. First is setup, and what format I’m expecting your text to be in for the classification. Second, I’ll talk about how to run naive Bayes on your own, using slow Python data structures. Finally, we’ll use Python’s NLTK and it’s classifier so you can see how to use that, since, let’s be honest, it’s gonna be quicker. Note that you wouldn’t want to use either of these in production, so look for a follow up post about how you might go about doing that.

As always, twitter, and check out the full code on github.

Setup

Data from this is going to be from this UCSD Amazon review data set. I swear one of the biggest issues with running these algorithms on your own is finding a data set big and varied enough to get interesting results. Otherwise you’ll spend most of your time scraping and cleaning data that by the time you get to the ML part of the project, you’re sufficiently annoyed. So big thanks that this data already exists.

You’ll notice that this set has millions of reviews for products across 24 different classes. In order to keep the complexity down here (this is a tutorial post after all), I’m sticking with two classes, and ones that are somewhat far enough different from each other to show that classification works, we’ll be classifying baby reviews against tools and home improvement reviews.

Preprocessing

First thing I want to do now, after unpacking the .gz file, is to get a train and test set that’s smaller than the 160,792 and 134,476 of baby and tool reviews respectively. For purposes here, I’m going to use 1000 of each, with 800 used for training, and 200 used for testing. The algorithms are able to support any number of training and test reviews, but for demonstration purposes, we’re making that number lower.

Check the github repo if you want to see the code, but I wrote a script that just takes the full file, picks 1000 random numbers, segments 800 into the training set, and 200 into the test set, and saves them to files with the names “train_CLASSNAME.json” and “test_CLASSNAME.json” where classname is either “baby” or “tool”.

Also, the files from that dataset are really nice, in that they’re already python objects. So to get them into a script, all you have to do is run “eval” on each line of the file if you want the dict object.

Features

There really wasn’t a good place to talk about this, so I’ll mention it here before getting into either of the self, and nltk running of the algorithm. The features we’re going to use are simply the lowercased version of all the words in the review. This means, in order to get a list of these words from the block of text, we remove punctuation, lowercase every word, split on spaces, and then remove words that are in the NLTK corpus of stopwords (basically boring words that don’t have any information about class).

from nltk.corpus import stopwords
STOP_WORDS = set(stopwords.words('english'))
STOP_WORDS.add('')
def clean_review(review):
  exclude = set(string.punctuation)
  review = ''.join(ch for ch in review if ch not in exclude)
  split_sentence = review.lower().split(" ")
  clean = [word for word in split_sentence if word not in STOP_WORDS]
  return clean

Realize here that there are tons of different ways to do this, and ways to get more sophisticated that hopefully can get you better results! Things like stemming, which takes words down to their root word (wikipedia gives the example of “stems”, “stemmer”, “stemming”, “stemmed” as based on “stem”). You might want to include n-grams, for an n larger than 1 in our case as well.

Basically, there’s tons of processing on the text that you could do here. But since this I’m just talking about how Naive Bayes works, I’m sticking with simplicity. Maybe in the future I can get fancy and see how well I can do in classifying these reviews.

Ok, on to the actual algorithm.

Continue reading →

Big-Ish Data

Writings about data

Tag Archives: NLTK

Classifying Amazon Reviews with Scikit-Learn — More Data is Better Turns Out

How many grams?

Practical Naive Bayes — Classification of Amazon Reviews

Setup

Preprocessing

Features