How to Build Your Own Blockchain Part 1 — Creating, Storing, Syncing, Displaying, Mining, and Proving Work

Posted on October 17, 2017 by Jack Schultz

I can actually look up how long I have by logging into my Coinbase account, looking at the history of the Bitcoin wallet, and seeing this transaction I got back in 2012 after signing up for Coinbase. Bitcoin was trading at about $6.50 per. If I still had that 0.1 BTC, that’d be worth over $500 at the time of this writing. In case people are wondering, I ended up selling that when a Bitcoin was worth $2000. So I only made $200 out of it rather than the $550 now. Should have held on.

Thank you Brian.

Despite knowing about Bitcoin’s existence, I never got much involved. I saw the rises and falls of the $/BTC ratio. I’ve seen people talk about how much of the future it is, and seen a few articles about how pointless BTC is. I never had an opinion on that, only somewhat followed along.

Similarly, I have barely followed blockchains themselves. Recently, my dad has brought up multiple times how the CNBC and Bloomberg stations he watches in the mornings bring up blockchains often, and he doesn’t know what it means at all.

And then suddenly, I figured I should try to learn about the blockchain more than the top level information I had. I started by doing a lot of “research”, which means I would search all around the internet trying to find other articles explaining the blockchain. Some were good, some were bad, some were dense, some were super upper level.

Reading only goes so far, and if there’s one thing I know, it’s that reading to learn doesn’t get you even close to the knowledge you get from programming to learn. So I figured I should go through and try to write my own basic local blockchain.

A big thing to mention here is that there are differences in a basic blockchain like I’m describing here and a ‘professional’ blockchain. This chain will not create a crypto currency. Blockchains do not require producing coins that can be traded and exchanged for physical money. Blockchains are used to store and verify information. Coins help incentive nodes to participate in validation but don’t need to exist.

The reason I’m writing this post is 1) so people reading this can learn more about blockchains themselves, and 2) so I can try to learn more by explaining the code and not just writing it.

In this post, I’ll show the way I want to store the blockchain data and generate an initial block, how a node can sync up with the local blockchain data, how to display the blockchain (which will be used in the future to sync with other nodes), and then how to go through and mine and create valid new blocks. For this first post, there are no other nodes. There are no wallets, no peers, no important data. Information on those will come later.

TL;DR

If you don’t want to get into specifics and read the code, or if you came across this post while searching for an article that describes blockchains understandably, I’ll attempt to write a summary about how a blockchains work.

At a super high level, a blockchain is a database where everyone participating in the blockchain is able to store, view, confirm, and never delete the data.

On a somewhat lower level, the data in these blocks can be anything as long as that specific blockchain allows it. For example, the data in the Bitcoin blockchain is only transactions of Bitcoins between accounts. The Ethereum blockchain allows similar transactions of Ether’s, but also transactions that are used to run code.

Slightly more downward, before a block is created and linked into the blockchain, it is validated by a majority of people working on the blockchain, referred to as nodes. The true blockchain is the chain containing the greatest number of blocks that is correctly verified by the majority of the nodes. That means if a node attempts to change the data in a previous block, the newer blocks will not be valid and nodes will not trust the data from the incorrect block.

Don’t worry if this is all confusing. It took me a while to figure that out myself and a much longer time to be able to write this in a way that my sister (who has no background in anything blockchain) understands.

If you want to look at the code, check out the part 1 branch on Github. Anyone with questions, comments, corrections, or praise (if you feel like being super nice!), get in contact, or let me know on twitter.

Step 1 — Classes and Files

Step 1 for me is to write a class that handles the blocks when a node is running. I’ll call this class Block. Frankly, there isn’t much to do with this class. In the __init__ function, we’re going to trust that all the required information is provided in a dictionary. If I were writing a production blockchain, this wouldn’t be smart, but it’s fine for the example where I’m the only one writing all the code. I also want to write a method that spits out the important block information into a dict, and then have a nicer way to show block information if I print a block to the terminal.

class Block(object):
  def __init__(self, dictionary):
  '''
    We're looking for index, timestamp, data, prev_hash, nonce
  '''
  for k, v in dictionary.items():
    setattr(self, k, v)
  if not hasattr(self, 'hash'): #in creating the first block, needs to be removed in future
    self.hash = self.create_self_hash()

  def __dict__(self):
    info = {}
    info['index'] = str(self.index)
    info['timestamp'] = str(self.timestamp)
    info['prev_hash'] = str(self.prev_hash)
    info['hash'] = str(self.hash)
    info['data'] = str(self.data)
    return info

  def __str__(self):
    return "Block<prev_hash: %s,hash: %s>" % (self.prev_hash, self.hash)

When we’re looking to create a first block, we can run the simple code.

Continue reading →

NPR Sunday Puzzle Solving, And Other Baby Name Questions

Posted on October 2, 2017 by Jack Schultz

If you have a long drive and no bluetooth or aux cord to listen to podcasts, NPR is easily the best alternative. Truck drivers agree with this statement no matter their overall views. For me, this was the case when driving home to Milwaukee from Ann Arbor where I went to a college friend’s wedding.

While driving back I listened to NPR and heard Weekend Edition Sunday and their Sunday Puzzle pop up. If you haven’t heart of it before, at the end of every week’s episode they state a puzzle. Throughout the next week listeners can submit their answer and one random correct submitter is chosen to be recorded doing a mini puzzle on air.

The puzzle they stated for the week after the wedding was as follows:

Think of a familiar 6-letter boy’s name starting with a vowel. Change the first letter to a consonant to get another familiar boy’s name. Then change the first letter to another consonant to get another familiar boy’s name. What names are these?

They’ve already released the show for this question (I didn’t win of course) so I figure I can write about how I found out the answer!

Solving The Name Question

First step as always for these types of posts is gathering the required list of familiar boy’s names. Searching on Google for lists will show that there are a ton of sites which exist try to SEO themselves for the money. When scraping, you should to poke around and make sure to choose the post that has the correct data as well as being the most simple to gather. I went with this one.

Since there’s only one page with the data, there’s no need to use the requests library to scrape the different pages. So clicking save html file to the folder you’re programming in is the best way to get the data.

The scraping code itself is pretty simple.

from bs4 import BeautifulSoup

filename = 'boy_names.html'
vowels = ('A', 'E', 'I', 'O', 'U')

vowel_starters = []
consonant_starters = []

with open(filename, 'r') as file:
  page = file.read()
  html = BeautifulSoup(page.replace('\n',''), 'html.parser')
  for name_link in html.find_all("li", class_="p1"):
    name = name_link.text
    first_letter = name[0]
    if len(name) == 6:
      if first_letter in vowels:
        vowel_starters.append(name)
      else:
        consonant_starters.append(name)

for vname in vowel_starters:
  cname_same = []
  for cname in consonant_starters:
    if vname[1:] == cname[1:]:
      cname_same.append(cname)
  if cname_same:
    print vname
    for match in cname_same:
      print match

And the results are…

Austin, Justin, Dustin

Justin and Dustin rhyme which makes it more simple to realize that they match, but Austin isn’t exactly on the same page. If I didn’t have the code, zero chance I’d have gotten this correct.

That’s it right? Nope, I have all the code, I figured I should check to see if there’s a match for girls names with that same rules. All there was to do is save the popular girl names to the same folder, change the filename to ‘girl_names.html’, run the code, and we’ll get Ariana and Briana. A and B are the starting letters, and if Criana was a popular name (at this moment), we’d be good to for the full 3 name answers.

By going through this part, I came up with some other fun questions that could be answered with this list of names, and the rest of the post is about those.

Continue reading →

Web Scraping with Python — Part Two — Library overview of requests, urllib2, BeautifulSoup, lxml, Scrapy, and more!

Posted on June 6, 2017 by Jack Schultz

Welcome to part 2 of the Big-Ish Data general web scraping writeups! I wrote the first one a little bit ago, got some good feedback, and figured I should take some time to go through some of the many Python libraries that you can use for scraping, talk about them a little, and then give suggestions on how to use them.

If you want to check the code I used and not just copy and paste from the sections below, I pushed the code to github in my bigishdata repo. In that folder you’ll find a requirements.txt file with all the libraries you need to pip install, and I highly suggest using a virtualenv to install them. Gotta keep it all contained and easier to deploy if that’s the type of project you’re working on. On this front, also let me know if you’re running this and have any issues!

Overall, the goal of the scraping project in this post is to grab all the information – text, headings, code segments and image urls – from the first post on this subject. We want to get the headings (both h1 and h3), paragraphs, and code sections and print them into local files, one for each tag. This task is very simple overall which means it doesn’t require super advanced parts of the libraries. Some scraping tasks require authentication, remote JSON data loading, or scheduling the scraping tasks. I might write an article about other scraping projects that require this type of knowledge, but it does not apply here. The goal here is to show basics of all the libraries as an introduction.

In this article, there’ll be three sections. First, I’ll talk about libraries that execute http requests to obtain HTML. Second, I’ll talk about libraries that are great for parsing HTML to allow you to scrape the data. Third, I’ll write about libraries that perform both actions at once. And if you have more suggestions of libraries to show, let me know on twitter and I’ll throw them in here.

Finally, a couple notes:

Note 1: There are many different ways of web scraping. People like using different methods, different libraries, different code structures, etc. I understand that. I recognize that there are other useful methods out there – this is what I’ve found to be successful over time.

Note 2: I’m not here to tell you that it’s legal to scrape every website. There are laws about what data is copyrighted, what data that is owned by the company, and whether or not public data is actually legal to scrape. You might have to check things like robots.txt, their Terms of Service, maybe a frequently asked questions page.

Note 3: If you’re looking for data, or any other data engineering task, get in contact and we’ll see what I can do!

Ok! That all being said, it’s time to get going!

Requesting the Page

The first section here is showing a few libraries that can hit web servers and ask nicely for the HTML.

For all the examples here, I request the page, and then save the HTML in a local file. The other note on this section is that if you’re going to use one of these libraries, this is part one of the scraping! I talked about that a lot in the first post of this series, how you need to make sure you split up getting the HTML, and then work on scraping the data from the HTML.

First library on the first section is the famous, popular, and very simple to use library, requests.

requests

Let’s see it in action.

import requests
url = "https://bigishdata.com/2017/05/11/general-tips-for-web-scraping-with-python/" 
params = {"limit": 48, 'p': 2} #used for query string (?) values
headers = {'user-agent' : 'Jack Schultz, bigishdata.com, contact@bigishdata.com'}
page = requests.get(url, headers=headers)
helpers.write_html('requests', page.text.encode('UTF-8'))

Like I said, incredibly simple.

Requests also has the ability to use the more advanced features like SSL, credentials, https, cookies, and more. Like I said, I’m not going to go into those features (but maybe later). Time for simple examples for an actual project.

Overall, even before talking about the other libraries below, requests is the way to go.

urllib / urllib2

Ok, time to ignore that last sentence in the requests section, and move on to another simple library, urllib2. If you’re using Python 2.X, then it’s very simple to request a single page. And by simple, I mean couple lines.

Continue reading →

General Tips for Web Scraping with Python

Posted on May 11, 2017 by Jack Schultz

The great majority of the projects about machine learning or data analysis I write about here on Bigish-Data have an initial step of scraping data from websites. And since I get a bunch of contact emails asking me to give them either the data I’ve scraped myself, or help with getting the code to work for themselves. Because of that, I figured I should write something here about the process of web scraping!

There are plenty of other things to talk about when scraping, such as specifics on how to grab the data from a particular site, which Python libraries to use and how to use them, how to write code that would scrape the data in a daily job, where exactly to look as to how to get the data from random sites, etc. But since there are tons of other specific tutorials online, I’m going to talk about overall thoughts on how to scrape. There are three parts of this post – How to grab the data, how to save the data, and how to be nice.

As is the case with everything, programming-wise, if you’re looking to learn scraping, you can’t just read tutorials and think to yourself that you know how to program. Pick a project, practice grabbing the data, and then write a blog post about what you learned.

There definitely are tons of different thoughts on scraping, but these are the ones that I’ve learned from doing it a while. If you have questions, comments, and want to call me out, feel free to comment, or get in contact!

Grabbing the Data

The first step for scraping data from websites is to figure out where the sites keep their data, and what method they use to display the data on the browser. For this part of your project, I’ll suggest writing in a file named gather.py which should performs all these tasks.

Continue reading →

Product Mentions Update — Thoughts When Reviewing the Reddit Mentions

Posted on May 11, 2017 by Jack Schultz

More than a few months ago, I created a Python script and a Rails website that tracks links to Amazon that people put in their comments and posts on Reddit. Clearly, a great name for this type of site is Product Mentions. Now that it’s been a while where the site is gathering the mentions, figure it’s time enough to look through the mentions and talk about interesting thoughts!

And before we get started, if you’re looking for information about Reddit comments on your site, blog, company, etc., shoot me an email and we can get started.

Technology!

Obviously the first thing to check is what Amazon product groups are the most mentioned on Reddit, and when you check the page, it’s incredibly clear that people love mentioning specific computer technology. Check out the frequency of product mentions of personal computers. Laptops on laptops, and apparently so many mentions of Acer brand laptops.

Books!

Since books are the second most mentioned product, it is also very interesting to see what type of subreddit’s are the ones to link books. And there are tons of them, but they’re much more specific subreddits.

Continue reading →

A Practical Use For Python Decorators — Logging, Error Checks, and Timing

Posted on December 16, 2016 by Jack Schultz

When using a Python decorator, especially one defined in another library, they seem somewhat magical. Take for example Flask’s routing mechanism. If I put some statement like @app.route("/") above my logic, then poof, suddenly that code will be executed when I go to the root url on the server. And sure, decorators make sense when you read the many tutorials out there that describe them. But for the most part, those tutorials are just explaining what’s going on, mostly by just printing out some text, but not why you might want to use a decorator yourself.

I was of that opinion before, but recently, I realized I have the perfect use for a decorator in a project of mine. In order to get the content for Product Mentions, I have Python scrapers that go through Reddit looking for links to an Amazon product, and once I find one, I gather up the link, use the Amazon Product API to get information on the product. Once that’s in the database, I use Rails to display the items to the user.

While doing the scraping, I also wanted a web interface so I can check to see errors, check to see how long the jobs are taking, and overall to see that I haven’t missed anything. So along with the actual Python script that grabs the html and parses it, I created a table in the database for logging the scraping runs, and update that for each job. Simple, and does the job I want.

The issue I come across here, and where decorators come into play, is code reuse. After some code refactoring, I have a few different jobs, all of which have the following format: Create an object for this job, commit it to the db so I can see that it’s running in real time, try some code that depends on the job and except and log any error so we don’t crash that process, and then post the end time of the job.

def gather_comments():
  scrape_log = ScrapeLog(start_time=datetime.now(), job_type="comments")
  session.add(scrape_log)
  session.commit()

  try:
    rg = RedditGatherer()
    rg.gather_comments()
  except Exception as e:
    scrape_log.error = True
    scrape_log.error_message = e.message

  scrape_log.end_time = datetime.now()
  session.add(scrape_log)
  session.commit()

def gather_threads():
  scrape_log = ScrapeLog(start_time=datetime.now(), job_type="threads")
  session.add(scrape_log)
  session.commit()

  try:
     rg = RedditGatherer()
     rg.gather_threads()
  except Exception as e:
    scrape_log.error = True
    scrape_log.error_message = e.message

  scrape_log.end_time = datetime.now()
  session.add(scrape_log)
  session.commit()

If you know a bit about how decorators work, you can already see how perfect an opportunity using this concept is here, because decorators allow you to extend and reuse functionality on top of functions you already use. For me, I want to log, time, and error check my scraping, and reusing the same code is not ideal. But a decorator is. Here’s how to write one.

Decorator Time

First thing to do, is write a function, that takes a function as parameter and call that function at the appropriate time. Since the work of the functions above is done with the same format, this turns out really nice.

Continue reading →

Running Python Background Jobs with Heroku

Posted on December 15, 2016 by Jack Schultz

Recently, I’ve been working on a project that scrapes Reddit looking for links to products on Amazon. Basically the idea being that there’s valuable info in what people are linking to and talking about online, and a starting point would be looking for links to Amazon products on Reddit. And the result of that work turned into Product Mentions.

To build this, and I can talk more about this later, I have two parts. First being a basic Rails app that displays the products and where they’re talked about, and the second being a Python app that does the scraping, and also displays the scraping logs for me using Flask. I thought of just combining the two functionalities at first, but decided It was easier in both regards to separate the two functionalities. The scraper populates the database, and the Rails app displays what’s in there. I hosted the Rails app on Heroku, and after some poking around, decided to also run the Python scraper on Heroku as well (for now at least!)

Also, if at this point, you’re thinking to yourself, “why the hell is he using an overpriced, web app hosting service like Heroku when there are so many other options available?” you’re probably half right, but in terms of ease of getting started, Heroku was by far the easiest PaaS to get this churning. Heroku is nice, and this set up is really simple, especially compared to some of the other PaaS options out there that require more configuration. You can definitely look for different options if you’re doing a more full web crawl, but this’ll work for a lot of purposes.

So what I’m going to describe here today, is how I went about running the scrapers on Heroku as background jobs, using clock and worker processes. I’ll also talk a little about what’s going on so it makes a little more sense than those copy paste tutorials I see a lot (though that type of tutorial from Heroku’s docs is what I used here, so I can’t trash them too badly!).

worker.py

First file you’re going to need here is a worker file, which will perform the function that it sees coming off a queue. For ease, I’ll name this worker.py file. This will connect to Redis, and just wait for a job to be put on the queue, and then run whatever it sees. First, we need rq the library that deals with Redis in the background (all of this is assuming you’re in a virtualenv

$ pip install rq
$ pip freeze > requirements.txt

This is the only external library you’re going to need for a functioning worker.py file, as specified by the nice Heroku doc. This imports the required objects from rq, connects to Redis using either an environment variable (that would be set in a production / Heroku environment), creates a worker, and then calls work. So in the end, running python worker.py will just sit there waiting to get jobs to run, in this case, scraping Reddit. We also have ‘high’ ‘default’ and ‘low’ job types, so the queue will know which ones to run first, but we aren’t going to need that here.

import os

import redis
from rq import Worker, Queue, Connection

listen = ['high', 'default', 'low']

redis_url = os.getenv('REDISTOGO_URL', 'redis://localhost:6379')

conn = redis.from_url(redis_url)

if __name__ == '__main__':
 with Connection(conn):
 worker = Worker(map(Queue, listen))
 worker.work()

clock.py

Now that we have the worker set up, here’s the clock.py file that I’m using to do the scraping. Here, it imports the conn variable from the worker.py file, uses that to make sure we’re connected to the same Redis queue. We also import the functions that use the scrapers from run.py, and in this file, create functions that will enqueue the respective functions. Then we use apscheduler to schedule when we want to call these functions, and then start the scheduler. If we run python clock.py, we scheduler will run in perpetuity (hopefully), and then will call the correct code on the intervals we defined.

Continue reading →

Classifying Amazon Reviews with Scikit-Learn — More Data is Better Turns Out

Posted on December 5, 2016 by Jack Schultz

Last time, I went through some basics of how naive Bayes algorithm works, and the logic behind it, and implemented the classifier myself, as well as using the NLTK. That’s great and all, and hopefully people reading it got a better understanding of what was going on, and possibly how to play along with classification for their own text documents.

But if you’re looking to train and actually deploy a model, say, a website where people can copy paste reviews from Amazon and see how our classifier performs, you’re going to want to use a library like Scikit-Learn. So with this post, I’ll walk through training a Scikit-Learn model, testing various classifiers and parameters, in order to see how we do, and also at the end, will have an initial, version 1, of a Amazon review classifier that we can use in a production setting.

Some notes before we get going:

For a lot of the testing, I only use 5 or 10 of the full 26 classes that are in the dataset.
Keep in mind, that what works here might not be the same for other data sets. We’re specifically looking at Amazon product reviews. For a different set of texts (you’ll also see the word corpus being thrown around), a different classifier, or parameter sets might be used.
The resulting classifier we come up with is, well, really really basic, and probably what we’d guess would perform the best if we guessed what would be the best at the onset. All the time and effort that goes into checking all the combinations
I’m going to mention here this good post that popped up when I was looking around for other people who wrote about this. It really nicely outlines going how to classify text with Scikit-learn. To reduce redundancy, something that we all should work towards, I’m going to point you to that article to get up to speed on Scikit-learn and how it can apply to text. In this article, I’m going to start at the end of that article, where we’re working with Scikit-learn pipelines.

As always, you can say hi on twitter, or yell at me there for messing up as well if you want.

How many grams?

First step to think about is how we want to represent the reviews in naive Bayes world, in this case, a bag of words / n-grams. In the other post, I simply used word counts since I wasn’t going into how to make the best model we could have. But besides word counts, we can also bump up the representations to include something called a bigram, which is a two word combos. The idea behind that is that there’s information in two word combos that we aren’t using with just single words. With Scikit-learn, this is very simple to do, and they take care of it for you. Oh, and besides bigrams, we can say we want trigrams, fourgrams, etc. Which we’ll do, to see if that improves performance. Take a look at the wikipedia article for n-grams here.

For example is if a review mentions “coconut oil cream”, as in some sort of face cream (yup, I actually saw this as a mis-classified review), simply using the words and we might get a classification of food since we just see “coconut” “oil” and “cream”. But if we use bigrams as well as the unigrams, we’re also using “coconut oil” and “oil cream” as information. Now this might not get us all the way to a classification of beauty, but it could tip us over the edge.

Continue reading →

Practical Naive Bayes — Classification of Amazon Reviews

Posted on November 26, 2016 by Jack Schultz

If you search around the internet looking for applying Naive Bayes classification on text, you’ll find a ton of articles that talk about the intuition behind the algorithm, maybe some slides from a lecture about the math and some notation behind it, and a bunch of articles I’m not going to link here that pretty much just paste some code and call it an explanation.

So I’m going to try to do a little more here, by hopefully writing and explaining enough, is let you yourself write a working Naive Bayes classifier.

There are three sections here. First is setup, and what format I’m expecting your text to be in for the classification. Second, I’ll talk about how to run naive Bayes on your own, using slow Python data structures. Finally, we’ll use Python’s NLTK and it’s classifier so you can see how to use that, since, let’s be honest, it’s gonna be quicker. Note that you wouldn’t want to use either of these in production, so look for a follow up post about how you might go about doing that.

As always, twitter, and check out the full code on github.

Setup

Data from this is going to be from this UCSD Amazon review data set. I swear one of the biggest issues with running these algorithms on your own is finding a data set big and varied enough to get interesting results. Otherwise you’ll spend most of your time scraping and cleaning data that by the time you get to the ML part of the project, you’re sufficiently annoyed. So big thanks that this data already exists.

You’ll notice that this set has millions of reviews for products across 24 different classes. In order to keep the complexity down here (this is a tutorial post after all), I’m sticking with two classes, and ones that are somewhat far enough different from each other to show that classification works, we’ll be classifying baby reviews against tools and home improvement reviews.

Preprocessing

First thing I want to do now, after unpacking the .gz file, is to get a train and test set that’s smaller than the 160,792 and 134,476 of baby and tool reviews respectively. For purposes here, I’m going to use 1000 of each, with 800 used for training, and 200 used for testing. The algorithms are able to support any number of training and test reviews, but for demonstration purposes, we’re making that number lower.

Check the github repo if you want to see the code, but I wrote a script that just takes the full file, picks 1000 random numbers, segments 800 into the training set, and 200 into the test set, and saves them to files with the names “train_CLASSNAME.json” and “test_CLASSNAME.json” where classname is either “baby” or “tool”.

Also, the files from that dataset are really nice, in that they’re already python objects. So to get them into a script, all you have to do is run “eval” on each line of the file if you want the dict object.

Features

There really wasn’t a good place to talk about this, so I’ll mention it here before getting into either of the self, and nltk running of the algorithm. The features we’re going to use are simply the lowercased version of all the words in the review. This means, in order to get a list of these words from the block of text, we remove punctuation, lowercase every word, split on spaces, and then remove words that are in the NLTK corpus of stopwords (basically boring words that don’t have any information about class).

from nltk.corpus import stopwords
STOP_WORDS = set(stopwords.words('english'))
STOP_WORDS.add('')
def clean_review(review):
  exclude = set(string.punctuation)
  review = ''.join(ch for ch in review if ch not in exclude)
  split_sentence = review.lower().split(" ")
  clean = [word for word in split_sentence if word not in STOP_WORDS]
  return clean

Realize here that there are tons of different ways to do this, and ways to get more sophisticated that hopefully can get you better results! Things like stemming, which takes words down to their root word (wikipedia gives the example of “stems”, “stemmer”, “stemming”, “stemmed” as based on “stem”). You might want to include n-grams, for an n larger than 1 in our case as well.

Basically, there’s tons of processing on the text that you could do here. But since this I’m just talking about how Naive Bayes works, I’m sticking with simplicity. Maybe in the future I can get fancy and see how well I can do in classifying these reviews.

Ok, on to the actual algorithm.

Continue reading →

Classifying Country Music Songs is an Art — Getting Training Data

Posted on November 1, 2016 by Jack Schultz

If you’ve been following along recently, I’ve been writing about my theory of country music, and how unlike most other genres out there, country music song topics are, let’s just say, much more centralized. And so in my continuing effort to automatically classify the country songs topic, I need to take all the songs lyrics I downloaded, and manually classify them so I have some training data.

This is actually the third post on this topic I’ve written. In the first post where I showed how to get song lyrics using Genius’s API and scraping, and then the second post, where I gathered up all the lyrics from country artists, removed the duplicates, and realized that Lee Brice talks about beer and trucks much more than he does about love. The stats I ran at the end of the second entry are fine and all, but really what I have at the moment is some 5 thousand songs that are uncategorized, which isn’t going to allow me to do any more sophisticated classification than simple word analysis.

What this means is I’m going to need some help classifying those 5000 songs. To do this, I wrote a rails app deployed on Heroku free mode that will allow anyone to sign up and help with this task. Obviously I’m not expecting people to get through all 5000 themselves (other than me of course), but hopefully if I can get enough people to do more than a few songs, I can get a good representation from which I can get interesting results.

Rest of the article is as follows. First, I’ll have a section where I talk about my theory of country music song topics, which I’ve been annoying my friends by talking about whenever we talk about country music. Then in the next / last section, I’ll talk about how I’m looking to get these songs classified, and what the interface is like and what’s going on behind the scenes.

As an aside, I am somewhat of a fan of country music. I usually just say it’s pop music with a slide, and by definition, pop music is catchy. But still, those country song lyrics can get quite ridiculous () which is definitely fun to laugh at.

Also, follow me along on twitter for more updates on this, and other topics.

Country Music Song Topics

Topic 1: Love

Love. The classic song topic — universally relatable, unbounded in subtopics, and somewhat of a default topic for any story, song or otherwise.Now that I think about it, I’m not sure there’s any song genre out there that doesn’t have love as a main song topic. So it makes sense that love is quite prevalent in many of the country songs.

Whether happy songs, about how Brett Young can’t go to sleep unless his girl is next to him at night,

or sad songs, about how the guy in Billy Currington’s song, “It Don’t Hurt Like It Used To”, was broken up with, but got over it eventually. Or somewhat, cause it still hurts.

Now that I think about it, I’m not sure there’s any song genre out there that doesn’t have love as a main song topic.

Topic 2: Small Town Life

Nothing says small town life like boots, dirt roads, dumpy bars with a cover band, railroad tracks, barns, white churches, and crop fields. No, I’m not making this up, those are just some of the things the band LoCash sings about in their recent song titled “I Love this Life”

If that wasn’t enough, how about picket fences, blue sky and green grass, old Ford trucks, back porches, homemade wine, tire swings, fireworks, and dead deer heads waiting to be hung on the wall. Yup, that’s just what Drake White is singing about in his song “Livin’ The Dream”. I admit, if I hear this when scrolling through stations on the radio in my car that doesn’t have an aux port, I’ll turn it up.

Continue reading →

Big-Ish Data

Writings about data

Author Archives: Jack Schultz

How to Build Your Own Blockchain Part 1 — Creating, Storing, Syncing, Displaying, Mining, and Proving Work

Other Posts in This Series

TL;DR

Step 1 — Classes and Files

NPR Sunday Puzzle Solving, And Other Baby Name Questions

Solving The Name Question

Web Scraping with Python — Part Two — Library overview of requests, urllib2, BeautifulSoup, lxml, Scrapy, and more!

Requesting the Page

requests

urllib / urllib2

General Tips for Web Scraping with Python

Grabbing the Data

Product Mentions Update — Thoughts When Reviewing the Reddit Mentions

Technology!

Books!

A Practical Use For Python Decorators — Logging, Error Checks, and Timing

Decorator Time

Running Python Background Jobs with Heroku

worker.py

clock.py

Classifying Amazon Reviews with Scikit-Learn — More Data is Better Turns Out

How many grams?

Practical Naive Bayes — Classification of Amazon Reviews

Setup

Preprocessing

Features

Classifying Country Music Songs is an Art — Getting Training Data

Country Music Song Topics