Classifying Amazon Reviews with Scikit-Learn — More Data is Better Turns Out

Last time, I went through some basics of how naive Bayes algorithm works, and the logic behind it, and implemented the classifier myself, as well as using the NLTK. That’s great and all, and hopefully people reading it got a better understanding of what was going on, and possibly how to play along with classification for their own text documents.

But if you’re looking to train and actually deploy a model, say, a website where people can copy paste reviews from Amazon and see how our classifier performs, you’re going to want to use a library like Scikit-Learn. So with this post, I’ll walk through training a Scikit-Learn model, testing various classifiers and parameters, in order to see how we do, and also at the end, will have an initial, version 1, of a Amazon review classifier that we can use in a production setting.

Some notes before we get going:

  • For a lot of the testing, I only use 5 or 10 of the full 26 classes that are in the dataset.
  • Keep in mind, that what works here might not be the same for other data sets. We’re specifically looking at Amazon product reviews. For a different set of texts (you’ll also see the word corpus being thrown around), a different classifier, or parameter sets might be used.
  • The resulting classifier we come up with is, well, really really basic, and probably what we’d guess would perform the best if we guessed what would be the best at the onset. All the time and effort that goes into checking all the combinations
  • I’m going to mention here this good post that popped up when I was looking around for other people who wrote about this. It really nicely outlines going how to classify text with Scikit-learn. To reduce redundancy, something that we all should work towards, I’m going to point you to that article to get up to speed on Scikit-learn and how it can apply to text. In this article, I’m going to start at the end of that article, where we’re working with Scikit-learn pipelines.

As always, you can say hi on twitter, or yell at me there for messing up as well if you want.

How many grams?

First step to think about is how we want to represent the reviews in naive Bayes world, in this case, a bag of words / n-grams. In the other post, I simply used word counts since I wasn’t going into how to make the best model we could have. But besides word counts, we can also bump up the representations to include something called a bigram, which is a two word combos. The idea behind that is that there’s information in two word combos that we aren’t using with just single words. With Scikit-learn, this is very simple to do, and they take care of it for you. Oh, and besides bigrams, we can say we want trigrams, fourgrams, etc. Which we’ll do, to see if that improves performance. Take a look at the wikipedia article for n-grams here.

For example is if a review mentions “coconut oil cream”, as in some sort of face cream (yup, I actually saw this as a mis-classified review), simply using the words and we might get a classification of food since we just see “coconut” “oil” and “cream”. But if we use bigrams as well as the unigrams, we’re also using “coconut oil” and “oil cream” as information. Now this might not get us all the way to a classification of beauty, but it could tip us over the edge.

This is achieved by the ngram_range=(1, 2) parameter in the vectorizer code below. If we passed that into the initializer, then in our bucket, would include all the unigrams (words), as well as bigrams. If we had ngram_range=(2, 2) then we’d only have a bad of bigrams.

For this test here like I said, I’m going to test using uni, bi, tri, and four-gram trained classifiers and see what kind of results we get!

classifier = MultinomialNB()

count_vectorizer = CountVectorizer(min_df=1, stop_words='english'
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), min_df=1, stop_words='english')
trigram_vectorizer = CountVectorizer(ngram_range=(1, 3), min_df=1, stop_words='english')
fourgram_vectorizer = CountVectorizer(ngram_range=(1, 4), min_df=1, stop_words='english')


unigram_pipeline = Pipeline([
  ('count_vectorizer', unigram_vectorizer),
  ('classifier' , classifier)
])

bigram_pipeline = Pipeline([
  ('count_vectorizer', bigram_vectorizer),
  ('classifier' , classifier)
])

trigram_pipeline = Pipeline([
  ('count_vectorizer', trigram_vectorizer),
  ('classifier' , classifier)
])

fourgram_pipeline = Pipeline([
  ('count_vectorizer', fourgram_vectorizer),
  ('classifier' , classifier)
])

unigram_pipeline.fit(train_data['text'].values, train_data['class'].values)
unigram_predictions = pipeline.predict(test_data['text'].values)
unigram_score = accuracy_score(actual, unigram_predictions)
unigram_cmat = confusion_matrix(actual, unigram_predictions, labels)

print "Unigram"
print unigram_score
print labels
print unigram_cmat

bigram_pipeline.fit(train_data['text'].values, train_data['class'].values)
bigram_predictions = bigram_pipeline.predict(test_data['text'].values)
bigram_score = accuracy_score(actual, bigram_predictions)
bigram_cmat = confusion_matrix(actual, bigram_predictions, labels)

print
print "Bigram"
print bigram_score
print labels
print bigram_cmat

trigram_pipeline.fit(train_data['text'].values, train_data['class'].values)
trigram_predictions = trigram_pipeline.predict(test_data['text'].values)
trigram_score = accuracy_score(actual, trigram_predictions)
trigram_cmat = confusion_matrix(actual, trigram_predictions, labels)

print
print "Trigram"
print trigram_score
print labels
print trigram_cmat

fourgram_pipeline.fit(train_data['text'].values, train_data['class'].values)
fourgram_predictions = fourgram_pipeline.predict(test_data['text'].values)
fourgram_score = accuracy_score(actual, fourgram_predictions)
fourgram_cmat = confusion_matrix(actual, fourgram_predictions, labels)

print
print "Fourgram"
print fourgram_score
print labels
print fourgram_cmat

#output
Unigram
0.90215625
['baby', 'tool', 'home', 'pet', 'food', 'automotive', 'instant_video', 'beauty', 'cds_vinyl', 'clothes']
[[2922 57   92   32   12   20   3    36   9    54  ]
 [ 57  2793 93   19   2    177  4    21   2    60  ]
 [ 176 171  2635 44   72   53   10   25   7    130 ]
 [ 90  57   69   2844 23   35   2    48   3    44  ]
 [ 18  2    82   23   3041 5    2    16   1    8   ]
 [ 39  292  63   19   0    2667 1    39   4    40  ]
 [ 12  4    3    5    4    8    3149 6    19   8   ]
 [ 46  39   43   13   23   20   2    2913 2    96  ]
 [ 3   2    2    1    2    0    25   4    3079 2   ]
 [ 106 58   34   13   3    20   8    30   2    2826]]

Bigram
0.954125
['baby', 'tool', 'home', 'pet', 'food', 'automotive', 'instant_video', 'beauty', 'cds_vinyl', 'clothes']
[[3110 25   39   6    5    8    0    25   10   9   ]
 [ 19  3098 34   6    1    36   2    12   4    16  ]
 [ 48  73   3083 14   31   18   3    16   10   27  ]
 [ 61  44   35   2992 12   8    2    36   2    23  ]
 [ 9   1    39   5    3122 2    1    11   2    6   ]
 [ 13  162  24   2    2    2911 1    27   8    14  ]
 [ 6   2    5    3    1    3    3168 4    23   3   ]
 [ 26  27   28   5    9    1    0    3069 4    28  ]
 [ 0   1    4    0    0    2    4    2    3107 0   ]
 [ 76  65   29   5    2    6    3    37   5    2872]]

Trigram
0.96740625
['baby', 'tool', 'home', 'pet', 'food', 'automotive', 'instant_video', 'beauty', 'cds_vinyl', 'clothes']
[[3141 21   31   4    3    1    0    19   12   5   ]
 [ 15  3143 17   5    0    26   2    7    6    7   ]
 [ 27  71   3143 8    20   9    3    15   9    18  ]
 [ 45  25   22   3064 9    3    2    25   2    18  ]
 [ 8   1    27   5    3148 0    0    5    2    2   ]
 [ 6   125  15   0    1    2984 1    21   6    5   ]
 [ 6   2    3    2    1    3    3171 2    28   0   ]
 [ 19  18   20   4    6    0    0    3111 4    15  ]
 [ 0   0    3    0    0    2    0    2    3113 0   ]
 [ 53  42   23   3    2    5    2    27   4    2939]]

Fourgram
0.97021875
['baby', 'tool', 'home', 'pet', 'food', 'automotive', 'instant_video', 'beauty', 'cds_vinyl', 'clothes']
[[3129 21   13   3    2    0    0    4    0    2   ]
 [ 16  3056 19   6    2    23   0    8    2    7   ]
 [ 36  70   3078 8    17   8    4    6    3    12  ]
 [ 28  40   26   3113 16   1    0    24   2    6   ]
 [ 10  0    18   9    3098 2    0    7    2    0   ]
 [ 16  124  32   2    2    2955 0    12   5    11  ]
 [ 5   0    0    0    3    1    3121 2    49   2   ]
 [ 15  18   12   4    4    2    0    3190 5    8   ]
 [ 0   0    1    0    1    0    3    0    3189 0   ]
 [ 39  34   16   2    0    14   0    17   9    3118]]

And what do you know, having more information about the documents (in the form of more phrases) will get better accuracy!

Now in terms of overfitting, I’m not sure that’s an issue here when using 4 or more grams in the model. The reason we’re getting better performance from those is because these reviews are often of the same form, and use the same phrases. Reviews are sort of a technical document, and if we can get all the technicalities of a review in longer grams, then we can do that. Of course, in production, if we notice we’re not getting the kind of results we expected from training, then we might rethink this, but for now, more grams the better.

For the rest of the testing here however, I’m going to stick with using a bigram vectorizer since training larger-gram vectorizers takes longer, and I don’t want to sit and wait that long when all we’re doing is comparing accuracies. In the end, I can bump to tri or four grams to get final expected results.

Type of Naive Bayes?

Naive Bayes classifiers, specifically the P(word|class)  values aren’t just in a single form. In fact, there are three main types of forms for that term — Gaussian, Multinomial, and Bernoulli. You can read about them on Wikipedia here, and how they work in Scikit-Learn land with their docs here.

In this case, Gaussian doesn’t work since our matrices are sparse (and SKL will spit back and angry error if we even try to use that classifier), so we want to evaluate the Multinomial (which is what I used at first) and also Bernoulli.

Below is the code to test each of the classifiers, and we’re running them with the bigram_vectorizer since we’re just looking at relative performance, not overall so we don’t want to waste time on training the tri or four gram models.

mn_classifier = MultinomialNB()
b_classifier = BernoulliNB()

mn_pipeline = Pipeline([
  ('count_vectorizer', bigram_vectorizer),
  ('classifier' , mn_classifier)
])

b_pipeline = Pipeline([
  ('count_vectorizer', bigram_vectorizer),
  ('classifier' , b_classifier)
])

mn_pipeline.fit(train_data['text'].values, train_data['class'].values)
mn_predictions = mn_pipeline.predict(test_data['text'].values)
mn_score = accuracy_score(actual, mn_predictions)
mn_cmat = confusion_matrix(actual, mn_predictions, labels)

print
print "Multinomial"
print mn_score
print labels
print mn_cmat

b_pipeline.fit(train_data['text'].values, train_data['class'].values)
b_predictions = b_pipeline.predict(test_data['text'].values)
b_score = accuracy_score(actual, b_predictions)
b_cmat = confusion_matrix(actual, b_predictions, labels)

print
print "Bernoulli"
print b_score
print labels
print b_cmat

#output
Multinomial
0.9659375
['baby', 'tool', 'home', 'pet', 'food']
[[3124 20   21   7    7   ]
 [ 33  3113 54   13   2   ]
 [ 62  71   3011 17   26  ]
 [ 50  35   57   3025 12  ]
 [ 18  1    29   10   3182]]

Bernoulli
0.91925
['baby', 'tool', 'home', 'pet', 'food']
[[3113 2    5    55   4   ]
 [ 93  2567 85   468  2   ]
 [ 101 18   2764 269  35  ]
 [ 12  16   8    3143 0   ]
 [ 15  0    19   85   3121]]

Multinomial
0.95765625
['baby', 'tool', 'home', 'pet', 'food', 'automotive', 'instant_video', 'beauty', 'cds_vinyl', 'clothes']
[[3101 18   32   1    5    7    2    18   6    15  ]
 [ 23  3055 44   4    4    42   0    11   9    9   ]
 [ 54  63   2984 8    34   14   6    17   6    24  ]
 [ 33  34   33   2999 7    13   0    33   5    12  ]
 [ 13  0    27   7    3156 1    0    12   3    0   ]
 [ 19  119  45   8    3    2959 3    18   2    14  ]
 [ 8   1    3    2    5    2    3135 5    25   1   ]
 [ 29  14   19   4    12   1    0    3120 8    41  ]
 [ 0   0    3    0    3    0    4    0    3150 1   ]
 [ 79  49   34   11   2    12   5    22   10   2986]]

Bernoulli
0.80146875
['baby', 'tool', 'home', 'pet', 'food', 'automotive', 'instant_video', 'beauty', 'cds_vinyl', 'clothes']
[[2571 1    8    13   0    6    0    11   0    595 ]
 [ 5   2038 17   12   0    196  0    3    0    930 ]
 [ 24  12   2093 17   32   24   0    6    0    1002]
 [ 7   6    9    2725 2    19   0    9    0    392 ]
 [ 3   0    6    3    2881 3    0    9    0    314 ]
 [ 2   12   7    7    0    2657 0    1    0    504 ]
 [ 0   0    1    1    0    0    2645 0    0    540 ]
 [ 9   3    0    4    4    1    0    2640 0    587 ]
 [ 1   0    0    1    1    6    187  1    2206 758 ]
 [ 7   6    3    0    0    1    0    2    0    3191]]

It thinks everything is clothes! Look at that last column. You might be tricked into thinking that clothes is the outlier and compared to multinomial, look at the result when we drop the clothes class.


It’s still worse than multinomial.

What if we ran everything through a bernoulli, and use the class it predicts unless it thinks it’s clothes, and then run it as multinomial?

Tfidf

Term frequency inverse document frequency is another way of describing a word document. So rather than turning each document into a dictionary of counts, we turn it into a dictionary of frequencies. You can read up on it more on the Wikipedia article I linked above, and also many other articles on the internet.

In the code below, also note that there also is a TFIDF Vectorizer that we could have used, but having two steps in the pipeline below will get us to the same spot.

pipeline = Pipeline([
  ('count_vectorizer', bigram_vectorizer),
  ('classifier' , classifier)
])

pipeline.fit(train_data['text'].values, train_data['class'].values)
predictions = pipeline.predict(test_data['text'].values)
score = accuracy_score(actual, predictions)
cmat = confusion_matrix(actual, predictions, labels)

print "Bigram"
print score
print labels
print cmat

tfidf_pipeline = Pipeline([
  ('count_vectorizer', bigram_vectorizer),
  ('tfidf_transformer', TfidfTransformer()),
  ('classifier' , classifier)
])

tfidf_pipeline.fit(train_data['text'].values, train_data['class'].values)
tfidf_predictions = tfidf_pipeline.predict(test_data['text'].values)
tfidf_score = accuracy_score(actual, tfidf_predictions)
tfidf_cmat = confusion_matrix(actual, tfidf_predictions, labels)

print
print "Using TFIDF"
print tfidf_score
print labels
print tfidf_cmat

#output
Using TFIDF
0.9493125
['baby', 'tool', 'home', 'pet', 'food', 'automotive', 'instant_video', 'beauty', 'cds_vinyl', 'clothes']
[[3114 23   22   1    8    14   3    16   4    26  ]
 [ 27  3051 49   5    2    89   1    13   1    28  ]
 [ 63  95   2916 17   60   30   4    13   6    52  ]
 [ 41  46   39   2928 15   27   0    48   3    23  ]
 [ 20  1    22   7    3069 6    1    7    0    1   ]
 [ 16  116  42   3    1    2880 3    32   5    32  ]
 [ 7   2    2    2    8    2    3104 8    25   5   ]
 [ 23  31   21   5    19   4    2    3124 6    40  ]
 [ 0   0    1    0    2    1    6    2    3188 6   ]
 [ 57  27   17   2    4    19   5    27   5    3004]]

Not Using TFIDF
0.95303125
['baby', 'tool', 'home', 'pet', 'food', 'automotive', 'instant_video', 'beauty', 'cds_vinyl', 'clothes']
[[3124 22   23   3    3    11   3    17   3    22  ]
 [ 21  3099 49   2    2    57   1    10   3    22  ]
 [ 56  91   2993 3    32   25   2    15   6    33  ]
 [ 38  49   31   2962 8    19   0    44   2    17  ]
 [ 17  1    30   8    3061 4    1    11   0    1   ]
 [ 20  125  45   6    5    2879 3    25   3    19  ]
 [ 6   2    6    0    9    1    3101 9    29   2   ]
 [ 21  29   27   6    18   6    2    3129 3    34  ]
 [ 2   0    2    0    2    1    5    2    3188 4   ]
 [ 79  31   29   10   0    14   5    27   11   2961]]

Using the TFIDF transform, we get hit with a 0.4% drop in accuracy. Not much but I’m not going to use it if we’re not getting better results. Really, it seems, TFIDF is a good heuristic for smaller data sets where you’re not accounting for stopwords for example. Here we remove stopwords, and have a large data set, so we don’t really need the TFIDF mechanism.

Effect of amount of training examples

So far, I’ve been using 80,000 total data points for the original 5 classes, 64,000 used for training, and 16,000 used for testing, following the “train with 4/5, test with 1/5” mantra.

I figure now would be a good time to see the effect on accuracy score using variable amounts of data. Common sense would say that the more samples you use, the better the accuracy. Let’s see if that’s the case.

Total Samples: 10000
0.891
['baby', 'tool', 'home', 'pet', 'food']
[[414 4   8   8   4  ]
 [ 7  324 13  5   3  ]
 [ 34 46  289 14  16 ]
 [ 24 14  1   352 5  ]
 [ 2  0   5   5   403]]

Total Samples: 20000
0.9015
['baby', 'tool', 'home', 'pet', 'food']
[[746 19  15  7   2  ]
 [ 28 732 33  7   3  ]
 [ 54 58  640 8   20 ]
 [ 33 32  29  713 8  ]
 [ 9  2   23  4   775]]

Total Samples: 30000
0.915333333333
['baby', 'tool', 'home', 'pet', 'food']
[[1122 21   26   16   5   ]
 [ 23  1125 47   12   4   ]
 [ 78  67   1026 17   31  ]
 [ 30  43   36   1064 18  ]
 [ 4   1    20   9    1155]]

Total Samples: 40000
0.92375
['baby', 'tool', 'home', 'pet', 'food']
[[1566 32   42   8    3   ]
 [ 28  1523 53   6    6   ]
 [ 88  81   1352 25   59  ]
 [ 53  34   32   1441 9   ]
 [ 8   3    26   14   1508]]

Total Samples: 50000
0.9351
['baby', 'tool', 'home', 'pet', 'food']
[[1912 29   38   14   4   ]
 [ 27  1905 45   16   5   ]
 [ 93  104  1727 16   39  ]
 [ 57  44   49   1830 17  ]
 [ 9   3    27   13   1977]]

Total Samples: 60000
0.948
['baby', 'tool', 'home', 'pet', 'food']
[[2352 28   39   9    6   ]
 [ 30  2339 53   10   2   ]
 [ 85  82   2166 19   54  ]
 [ 38  49   43   2183 13  ]
 [ 13  4    31   16   2336]]

Total Samples: 70000
0.950357142857
['baby', 'tool', 'home', 'pet', 'food']
[[2654 33   34   5    6   ]
 [ 44  2659 52   6    3   ]
 [ 99  96   2534 17   54  ]
 [ 52  75   44   2624 8   ]
 [ 17  1    38   11   2834]]

Total Samples: 80000
0.9574375
['baby', 'tool', 'home', 'pet', 'food']
[[3010 33   32   2    5   ]
 [ 40  3072 51   6    2   ]
 [ 85  89   2944 26   35  ]
 [ 73  74   51   3080 15  ]
 [ 13  3    36   10   3213]]

Would you look at that, more data means more accuracy. Not much interesting here, but a good test to make sure things are looking right. If I had gotten worse accuracy with the increasing data, that’d be a sign there was a problem that I should be looking into.

Priors

Up until now, it’s been mostly plug and chug — there are a few different options for how to describe the documents, and a few different options for classifiers. All we have to do is try each and see which ones perform the best and then go with that.

Going back to Bayes’ Rule, remember that besides the terms for P(word|class), there’s also a leading term, P(class) which we’ve been basically ignoring up until now. That term accounts for the probability that we see a document of that class in the first place, and is basically the foundation of Bayes’ Rule. If we don’t expect to see a document of a certain class, like if we see reviews for tools 5 times more than reviews for baby products, then in order to classify a review as baby, we need to be really certain based on the words in the review that we’re hearing about babies.

In our case, what priors do we want, if any? In our use case here, just a simple test of the system where people copy and paste reviews to see if we guess correctly for fun, we pretty much have no idea what they’re going to input. So for now, the classifier is going to start with an even distribution for the prior probabilities, meaning that we think there’s an even chance a person will upload a review from all 26 of the classes. Perhaps later versions of the classifier can use the information for which types of reviews we actually see from people, but for now, no guessing.

The Big One

At this point, we’ve done pretty well here in figuring out which types of vectorizers and classifier type we want to use to get the best results here, so now is a good time to go all out — all 26 classes, with the four-gram vectorizer.

0.932907608696
['baby', 'tool', 'home', 'pet', 'food', 'automotive', 'instant_video', 'beauty', 'cds_vinyl', 'clothes', 'digital_music', 'cell_phones', 'electronics', 'kindle', 'movies_tv', 'instruments', 'office', 'patio', 'health', 'sports', 'toys', 'video_games', 'books']
[[3052 10   7    0    2    3    0    3    2    0    0    3    5    0    0    1    29   32   9    0    40   6    0   ]
 [ 4   3025 14   0    0    11   0    7    0    4    0    9    18   0    2    3    59   108  0    13   4    20   0   ]
 [ 25  31   2940 2    7    3    0    1    0    8    4    0    3    0    0    2    45   138  8    0    3    11   2   ]
 [ 14  5    4    2886 8    6    0    24   1    6    0    2    6    0    0    1    9    121  24   0    15   8    0   ]
 [ 3   0    18   0    3134 0    0    6    2    0    1    0    0    0    1    2    4    14   20   0    5    2    6   ]
 [ 6   51   2    0    0    2868 0    2    0    1    3    3    33   0    0    2    22   92   2    9    2    19   0   ]
 [ 0   0    0    0    4    0    2735 0    2    2    20   0    1    7    276  0    1    0    0    0    25   48   44  ]
 [ 5   2    6    2    8    0    0    3112 0    7    3    0    3    0    1    2    21   21   19   0    7    18   0   ]
 [ 0   0    0    0    0    0    0    0    2831 0    403  0    0    0    14   0    0    3    0    0    1    3    0   ]
 [ 17  7    7    2    0    3    0    10   0    2936 4    4    16   0    4    1    37   26   3    20   45   36   0   ]
 [ 0   1    0    0    0    0    0    2    77   0    3121 0    2    0    0    0    0    0    0    0    0    0    0   ]
 [ 4   2    3    0    1    6    0    0    0    3    7    3013 82   0    2    2    37   15   2    0    7    18   0   ]
 [ 2   6    0    0    0    5    0    0    3    5    3    58   3008 0    2    8    47   23   2    4    6    40   0   ]
 [ 0   0    0    0    9    0    0    0    0    0    2    5    6    2755 14   0    2    1    2    0    0    7    349 ]
 [ 0   0    2    0    1    0    33   3    20   2    8    0    8    4    3136 0    1    0    0    0    0    20   8   ]
 [ 0   19   0    3    0    9    0    5    5    2    0    2    22   0    0    3054 21   24   2    0    3    21   0   ]
 [ 4   3    0    0    0    2    0    0    0    0    0    2    6    0    0    2    3133 0    0    0    0    4    2   ]
 [ 2   6    5    0    0    10   0    0    0    0    0    2    2    0    0    0    5    3182 2    2    0    3    0   ]
 [ 18  9    8    6    53   2    0    72   2    11   6    2    17   0    0    2    34   62   2793 4    3    32   0   ]
 [ 10  37   15   4    6    9    0    8    0    33   2    8    19   0    0    6    56   118  4    2853 10   48   3   ]
 [ 23  2    4    2    0    1    0    2    1    2    2    2    4    0    8    2    16   24   0    4    2963 103  2   ]
 [ 0   0    0    0    0    3    2    0    2    0    2    11   15   0    0    3    12   4    0    0    13   3118 3   ]
 [ 1   0    2    0    4    0    0    0    6    0    4    0    0    127  24   0    2    2    0    0    8    20   3014]]

Why Are We Still Getting Mistakes?

93% is pretty damn good. But what about that 7%? Why are we still getting errors? And is there anyway to fix that within the Naive Bayes constraints?

Length of reviews

More data means better classification rates for sure. So does that hold with length of reviews? Check out the histogram for word counts on the reviews here. I expected an exponential decay type of distribution, and that’s what we got.

Looks nice, except for that beginning part.

Looks nice, except for that beginning part.

Zooming in here, looking for the word count where we bump up, and it looks like 20 words is some sort of cutoff.

Looks like about 20 words is a cutoff!

Who writes 20 word reviews anyway?

Running the classifier with ranges of word lengths, we get the following accuracy scores with bigram vectorizer and 10 classes. Again, with more data (words in a review, in this case), we get higher accuracy.

20 Word Max
0.792656587473

Between 25 and 50 Words
0.937370020797

Between 50 and 100 Words
0.964392717216

100 Word Min
0.981591937474

Clearly shorter reviews are tougher to classify. One word in there that is usually associated with a different class and we’d be screwed. Not much you can do here in this case.

“My cats love it too! Exactly what I needed for my babies. Bright light, long battery life, compact, easy to use- It’s a laser- what else is there to say?”
Actual class: pet, Guessed class: tool

You can absolutely see here how the algorithm gets this one wrong. Lights, battery, compact, lasers, all words that more often describe tools than pet products. But that doesn’t mean the result is correct.

Another example:

Very gentle on my dog. Smells nice and leaves her coat nice and soft. Would buy again
Actual: pet, Guessed: beauty

Though the review mentions “dog”, that one word alone apparently wasn’t enough to overcome the words “nice” and “soft”.

Similar Classes

Movies and TV and Instant Video. Tools and Automotive. Books and Kindle. Digital Music and Cds and Vinyl. These classes are so close together that you could reasonably argue that those should be just one class, and you can see with the confusion matrices that these classes get confused with each other way more often than normal. It’s also worth noting that even though they’re basically the same thing, the classifier gets the correct class still way more often than not. Impressive there.

Generic Reviews

Sometimes, like the reviews above, the algorithm messes up and guesses the incorrect class even though us humans know the real answer. But sometimes, humans would get the classification incorrect too.

The battery dies in a matter of months. Shame on you, Black and Decker!!!!!
Actual: home, Guessed: tool

Wouldn’t have guessed home here either.

I bought this as a treat for my cat. I mix it into his wet food to help him eat it. He says it’s delicious.
Actual: food, Guessed: pet

Shoutout to a talking cat!

Research for Version 2

With all that in mind, I’m going to say that for our case here, a simple web interface that can be used to classify text that people input, we have a pretty good first version of the classifier. But as all good tutorials should do, I’m going to mention some possible ways to improve classification accuracy for a second version.

Knowing when you don’t know

Naive Bayes calculates the probability for each class, and Scikit-learn pipeline classification actually has functionality for outputting the relative probabilities for each class. There are some questions here about

For example, if the classifier thinks there’s more than a 50% chance the review belongs to a class, what percentage of those guesses are correct?

Might there be some better way to get.

Knowing error rates for the individual classes

Are there some classes that we get incorrect more often than others? Maybe there’s a way to know that beforehand, and adjust the

Ensembles

When you hear ‘ensemble’ in terms of machine learning algorithms, all that means is using combining classifiers, or using different classifiers in different cases.

If you know of any more ways to improve the model, reach out on twitter, or the contact page on the site, and I’d love to hear from you. Next time, productionizing the model so we can use it in a web app.

Leave a comment