Talkin’ ‘Bout Trucks, Beer, and Love in Country Songs — Analyzing Genius Lyrics

Trucks, beer, and love, all things that make country music go round. I’ve said before that country music is just pop music with a slide, and then lyrics about slightly different topics than what you’ll hear in hip hop or “normal” pop music on the radio.

In my continuing quest to validate my theory that all country songs can fit into one of four different topics, in this post, I go through lyrics to see which artists talk about trucks, beer, and love the most. In my first post on this topic, I talked about how to get song lyrics from genius and print them out on the command line.

The goal here, and what I’m going to walk you through, is how I stored stored info and lyrics for all the songs for the country artists, how I made sure that all the lyrics were unique, and then ran some stats on the songs. Another note before we go is that a lot of data work is just janitorial. The actual code for getting “interesting” results is fairly simple. The key it to enjoy doing the janitor-style coding and then you’ll be good.

If you’re interested in which country music people talk most about trucks, beer, alcohol, or small towns, skip to the end where I list out some stats. For the rest, here’s some code.

https://www.pinterest.com/pin/59180182578213991/

I wonder how they feel about beer trucks. I’m guessing they’d all be fans of them.

Step 1 — Save the Lyrics!

When doing anything with web scraping, the one thing to always, always keep in mind here, is that you want to avoid hitting the server for as little as possible. With that in mind, we’re going to do here is assume the inputs are names of artists. For each of those artists, find all of their songs, and then for each of those songs, grab the lyrics in the way that I did in the first post, and then save them locally along with some meta information the API provides.

Now when I post the following code, don’t imagine that I knew what I wanted . Everything in here was created iteratively. Here’s a list of all the features of this piece of code does that were created iteratively.

Directory structure — Within the folder that contains the main .py file, there’s a folder named artists. And within that folder, when the code runs, a folder with the artist’s name is created (if not already). And within that folder, there are two more folders, info and lyrics. When we run the code, I put the lyrics in /artists/artist_name/lyrics/Song Title.txt and the info from the API, containing information about the song, like annotations, title, and song API id so we can grab it again if need be, in the file /artists/artist_name/info/Song Title.txt. The key, again, being saving all the info given to avoid unnecessary requests.

Redundancy Checking — Along with making sure to save all the info given, if we run an artist for the second time, we don’t want to get lyrics that we already have. So once we have all the songs for that artist, I run a check to see if we have a file with the name of the song already, and that the file isn’t empty. If the file is there, we continue to the next song.

Lyric Error Checking — Ahh unicode. While great for allowing multitudes of different characters rather than the standard English alphabet along with a few specialty characters, they’re not ideal when I’m trying to deal with simple song lyrics. And when saving the lyrics, I encountered more than a few random, unnecessary characters that Python threw errors for encoding problems. In a semi-janky rule-based solution (which isn’t great to use, see below), when I saw these errors being thrown, I would specifically replace them with the correct “normal” character. I assume there’s some library out there that would take care of all the encoding issues, but this worked for me. Also, on Genius’s end, it would be sweet if they, you know, checked for abnormal characters when lyrics were uploaded and didn’t have them in the first place. Also would be cool if they included the lyrics in the API.

def clean_lyrics(lyrics):
  lyrics = lyrics.replace(u"\u2019", "'") #right quotation mark
  lyrics = lyrics.replace(u"\u2018", "'") #left quotation mark
  lyrics = lyrics.replace(u"\u02bc", "'") #a with dots on top
  lyrics = lyrics.replace(u"\xe9", "e") #e with an accent
  lyrics = lyrics.replace(u"\xe8", "e") #e with an backwards accent
  lyrics = lyrics.replace(u"\xe0", "a") #a with an accent
  lyrics = lyrics.replace(u"\u2026", "...") #ellipsis apparently
  lyrics = lyrics.replace(u"\u2012", "-") #hyphen or dash
  lyrics = lyrics.replace(u"\u2013", "-") #other type of hyphen or dash
  lyrics = lyrics.replace(u"\u2014", "-") #other type of hyphen or dash
  lyrics = lyrics.replace(u"\u201c", '"') #left double quote
  lyrics = lyrics.replace(u"\u201d", '"') #right double quote
  lyrics = lyrics.replace(u"\u200b", ' ') #zero width space ?
  lyrics = lyrics.replace(u"\x92", "'") #different quote
  lyrics = lyrics.replace(u"\x91", "'") #still different quote
  lyrics = lyrics.replace(u"\xf1", "n") #n with tilde!
  lyrics = lyrics.replace(u"\xed", "i") #i with accent
  lyrics = lyrics.replace(u"\xe1", "a") #a with accent
  lyrics = lyrics.replace(u"\xea", "e") #e with circumflex
  lyrics = lyrics.replace(u"\xf3", "o") #o with accent
  lyrics = lyrics.replace(u"\xb4", "") #just an accent, so remove
  lyrics = lyrics.replace(u"\xeb", "e") #e with dots on top
  lyrics = lyrics.replace(u"\xe4", "a") #a with dots on top
  lyrics = lyrics.replace(u"\xe7", "c") #c with squigly bottom
  return lyrics

Check out the most of the main function below. If you’re looking for the actual full file, check out this gist. It’s easier to post that on Github than formatting the entire thing here.

def song_ids_already_scraped(artist_folder_path, force=False):
  #check for ids already scraped so we don't redo
  if force:
    return []
  song_ids = []
    files = os.listdir(artist_folder_path)
    for file_name in files:
      dot_split = file_name.split('.')
      #sometimes the file is empty, we don't want to include if that's the case
      if dot_split[1] == 'txt':
        try:
          song_id = dot_split[0].split("_")[-1]
          if os.path.getsize(artist_folder_path + '/' + file_name) != 0:
            song_ids.append(song_id)
        except:
          pass
  return song_ids

def info_from_song_api_path(song_api_path):
  song_url = base_url + song_api_path
  response = requests.get(song_url, headers=headers)
  json = response.json()
  return json

def songs_from_artist_api_path(artist_api_path):
  api_paths = []
  artist_url = base_url + artist_api_path + "/songs"
  data = {"per_page": 50}
  while True:
    response = requests.get(artist_url, data=data, headers=headers)
    json = response.json()
    songs = json["response"]["songs"]
    for song in songs:
      api_paths.append(song["api_path"])
    if len(songs) < 50:
      break #no more songs for artist
    else:
      if "page" in data:
        data["page"] = data["page"] + 1
      else:
        data["page"] = 1
  return list(set(api_paths))

if __name__ == "__main__":
  for artist_name in artist_names:
    #setting up path to artist's directories
    artist_folder_path = "artists/%s" % artist_name.replace(' ', '_').lower()
    artist_lyrics_path = "%s/lyrics" % artist_folder_path
    artist_info_path = "%s/info" % artist_folder_path
    if not os.path.exists(artist_folder_path):
      os.makedirs(artist_folder_path)
    if not os.path.exists(artist_lyrics_path):
      os.makedirs(artist_lyrics_path)
    if not os.path.exists(artist_info_path):
      os.makedirs(artist_info_path)

    #only using lyrics since they're saved second
    prev_song_ids = song_ids_already_scraped(artist_lyrics_path)

    #find the artist!
    search_url = base_url + "/search"
    data = {'q': artist_name}
    response = requests.get(search_url, data=data, headers=headers)
    artist_info = response.json()
    for hit in artist_info["response"]["hits"]:
      song_api_path = hit["result"]["api_path"]
      artist_api_path = artist_id_from_song_api_path(song_api_path, artist_name)
      if artist_api_path: #done searching if we found the guy
        break

  if not artist_api_path:
    print "Could not find %s" % artist_name

  #find the song api ids for the artist
  song_api_paths = songs_from_artist_api_path(artist_api_path)

  #print out how many songs we have left
  print len(song_api_paths) - len(prev_song_ids)

  for song_api_path in song_api_paths:
    api_id = song_api_path.split('/')[2]
    if api_id in prev_song_ids:
      continue #don't redo
    full_song_info = info_from_song_api_path(song_api_path)
    song_title = full_song_info["response"]["song"]["title"]
    song_title_path = song_title.replace('/', '_')#.replace(' ', '_').lower()
    song_web_path = full_song_info["response"]["song"]["path"]

    lyrics = lyrics_from_song_web_path(song_web_path)

    lyric_path = "%s/lyrics/%s_%s.txt" % (artist_folder_path, song_title_path, api_id)
    info_path = "%s/info/%s_%s.txt" % (artist_folder_path, song_title_path, api_id)

    #for record keeping purposes
    print lyric_path

    with open(info_path, "w") as lfile:
      lfile.write(json.dumps(full_song_info))
    with open(lyric_path, "w") as ifile:
      try:
        ifile.write(lyrics)
      except UnicodeEncodeError as error:
        print error

Running this piece with a giant array of country music artists, and after a while, you’ll have a giant directory full of lyrics to run and play with.

Step 2 — Creating a Copy of the Lyrics

First thing, I want to copy over the lyrics directory from the base directory I named “lyrics” to another one, I’ll call “lyrics_orig” because I couldn’t think of a better name at the moment. Reason for this is because I want to keep a record of all the lyrics I downloaded in the first place. That’s valuable information, for example, if I ever wanted to go and look at the full range of songs that I gathered the first time I ran the script. Just like with saving the info from the API, I don’t want to remove information if I don’t have to. Below is the code for looping through the artists, and copying the files over to the new dir.

import os
import shutil

lyric_path = "artists/%s/lyrics_orig"
lyric_orig_path = "artists/%s/lyrics_orig"
song_path = "artists/%s/lyrics/%s"
lyric_song_orig_path = "artists/%s/lyrics_orig/%s"

artists = os.listdir(artist_path)
for artist in artists:
  artist_lyrics_path = lyric_path % artist
  artist_lyric_dedup_path = lyric_orig_path % artist
  if not os.path.exists(artist_lyric_orig_path): #create secondary folder
    os.makedirs(artist_lyric_orig_path)
  for f in os.listdir(artist_lyrics_path):
    orig_song_path = song_path % (artist, f)
    dup_song_path = lyric_song_orig_path % (artist, f)
    shutil.copy2(orig_song_path, dup_song_path)

Cool, now I feel comfortable destroying some of the files in the lyrics folders since I know I have a backup.

Step 3 — Removing Duplicates

This is the meat, of what I’m trying to do here, so listen up. In order to get accurate information on who sings about trucks, we need to make sure that we don’t have any duplicate song lyrics so lyrics don’t get double counted.

I’m pretty happy with the solution I came up with, but I also want to point out here that I didn’t come up with that in my first attempt. This is real world data and finding an angle of attack doesn’t just come first try. So I’m going to outline my failures first, before showing the code and what I came up with that actually worked.

Attempt 1 — Title Rules

Most of the duplicate songs I see are those that are the same song, just recorded on a different album. A song released as a single (Raise Your Bottle and Raise Your Bottle (Single)), a live version of a song (Who Knows and Who knows – live from bonnaroo), or a title that also credits the featured artists as well (Texas Boys and Texas Boys with Pat Green & Josh Abbott), or when they just have different spellings for the names (Chattahoochee and Chattahoochie). Man, some of those song names are pretty brutal.

Anyway, my first thought for removing those songs was to look for keywords like “(Single)” or “Live”, I should be able to pick off the ones .

In general, rule based learning is tough to get right because you need deep knowledge of your data, and often you’ll get quickly overwhelmed with the number of rules needed as well as the number of one off cases that present themselves. It’s best to avoid this. Remember above when I used rules to remove the Unicode oddities? Yeah, that was no fun, but the difference with that is there are a limited number of Unicode characters and there are correct replacements.

Attempt 2 — Beginning of Song Titles

If you look above, many of the duplicate songs have the same title as the “original” song, but then have an extra phrase tacked on to the end. Phrases like (Single), or with Pat Green & Josh Abbott. So I figured I might be able to grab the duplicates by comparing the title of a song to the other titles, and if any of the other titles starts with the the full title of the comparer, we’d have a duplicate. It worked alright, but I saw I was missing too many songs. That Chattahoochie/y song wouldn’t be caught because it’s a different spelling, and then even the Who Knows wouldn’t be caught because the ‘K’ in ‘Knows’ is lowercase for some reason on Genius for the live version of the song. I could have just lowercased all the letters in the title to catch that, but it seems in elegant and just forcing a method that wasn’t ideal.

There are just too many different reasons beyond what I listed above for duplicate titles, so there isn’t a simple way of going through and writing rules for removing those songs. Also, no way I’m going to go through all of that by hand.

Attempt 3 — Lyric Matching

I didn’t want to initially, but after failing at everything having to do with titles, I finally succumbed to the call of the lyrics and used those to remove the duplicate songs.

Here’s what I did. For each song, I read in the lyrics, remove new lines from the file, make all the letters lowercase, split it into the different words, and then put those into a set. Those are the different tokens in the song. Once that’s done I loop through all the different song word sets, and use difflib’s SequenceMatcher to compare the similarity of the words in the different song. The SequenceMatcher gives me a ratio for each of the comparisons. If the comparison ratio is greater than 0.5, then I consider that a match, and I use some logic to pick which for the titles is up for deletion (using length of the title), and save the path of the song for later deletion!

Quick note on the 0.5 cutoff. Because of the nature of scraped data from the internet, I can’t just assume that the sets of words in the lyrics would be the same for the duplicate songs. So once I had the measurement I wanted, I played around with that number and looked at the different ones returned, and 0.5 seems like a good one. I observed the matches and their ratios were pretty much all either above 0.7, or under 0.3 with a nice chasm between the two. If that number were continuous, finding a cutoff would be difficult because we want perfect removal of the duplicate songs, and in that case I’d need to find a new way to measure the similarity of the lyrics.

Here’s the code.

import os
from difflib import SequenceMatcher as sm
import shutil

artist_path = "artists"
lyric_path = "artists/%s/lyrics_dedup"
lyric_dedup_path = "artists/%s/lyrics_dedup"
info_path = "artists/%s/info"
song_path = "artists/%s/lyrics/%s"
lyric_song_dedup_path = "artists/%s/lyrics_dedup/%s"

artists = os.listdir(artist_path)
for artist in artists:
  artist_lyrics_path = lyric_path % artist

  paths_for_removal = []
  songs = []
  titles = []
  for title in os.listdir(artist_lyrics_path):
    artist_lyrics_song_path = song_path % (artist, title)
    words = open(artist_lyrics_song_path).read()
    words = list(set(words.strip().replace('\n',' ').lower().split()))
    songs.append((title, words))
    titles.append(title)
  for index, (title, word_list) in enumerate(songs):
    for compare_index, (compare_title, compare_word_list) in enumerate(songs[index+1:-1]):
      sm_instance = sm(None, word_list, compare_word_list)
      ratio = float(sm_instance.ratio())
      if ratio > 0.5 and ratio < 1.0:
        print "%s : %s : %s" % (str(ratio), title, compare_title)
        if len(title) == len(compare_title):
          title_to_remove = compare_title if title < compare_title else title
        elif len(title) < len(compare_title):
          title_to_remove = compare_title
        else:
          title_to_remove = title
        path_for_removal = song_path % (artist, title_to_remove)
        paths_for_removal.append(path_for_removal)
  print set(paths_for_removal)

  for path in list(set(paths_for_removal)):
    os.remove(path)

I didn’t realize it at the time, but there was one song that was a duplicate that didn’t even have remotely the same title. Barbed Wire Halo and Philippians 3:12-14 by Aaron Watson were the same song, but I would have had no idea if I had just used the title for removal of duplicates. Pretty happy that song was found. And with that, I now have a directory of lyrics that I’m confident have only one of each of the songs.

Step 4 — Trks

Now for the main event of this post, which country artist talks about trucks the most! Well I guess the main event was dealing with the duplicate songs, but now for the payoff here.

Here’s the code for finding average number of truck mentions per song that a singer has in their song arsenal.

import os

artist_path = "artists"
lyric_path = "artists/%s/lyrics"
song_path = "artists/%s/lyrics/%s"

keyword = "truck"

artists = os.listdir(artist_path)
artist_counts = []
for artist in artists:
  counts = {keyword: 0}
  artist_lyrics_path = lyric_path % artist
  song_titles = os.listdir(artist_lyrics_path)
  num_songs = len(song_titles)
  for song in song_titles:
    artist_lyrics_song_path = song_path % (artist, song)
    words = [word.lower() for line in open(artist_lyrics_song_path, 'r') for word in line.split()]
    for key in counts.keys():
      for word in words:
        if word.lower() == key:
          counts[key] += 1
  artist_counts.append((artist, counts[keyword]/float(num_songs)))
for artist, val in sorted(artist_counts, key=lambda x: x[1], reverse=True):
  full_artist_name = ' '.join(artist.split('_')).title()
  print "%s: %s" % (full_artist_name, val)

Change the keyword from ‘truck’ to anything you’re trying to look at, and this snippet will spit out the average number of references to that keyword the artist has in their song library!

Without waiting any longer, here’s the list of trucks per song for the artists I have in my file:

Sam Hunt: 0.619047619048
Cole Swindell: 0.470588235294
Thomas Rhett: 0.46875
Lee Brice: 0.45652173913
Brantley Gilbert: 0.36
Jason Aldean: 0.266666666667
Luke Bryan: 0.243697478992
Justin Moore: 0.2
Florida Georgia Line: 0.171875
Jake Owen: 0.159420289855
Aaron Watson: 0.153153153153
Jon Pardi: 0.137931034483
Kip Moore: 0.135135135135
Keith Urban: 0.133333333333
Billy Currington: 0.130434782609
Randy Houser: 0.130434782609
Chris Young: 0.123076923077
Toby Keith: 0.115241635688
Tim Mcgraw: 0.113636363636
Dierks Bentley: 0.110169491525
Eric Church: 0.106666666667
Thompson Square: 0.103448275862
Eli Young Band: 0.0933333333333
Joe Nichols: 0.0877192982456
Garth Brooks: 0.0862068965517
Blake Shelton: 0.0857142857143
Trace Adkins: 0.0855263157895
Kellie Pickler: 0.08
Josh Turner: 0.078125
Alan Jackson: 0.0669144981413
Kenny Chesney: 0.0536585365854
Hunter Hayes: 0.0526315789474
Zac Brown Band: 0.0519480519481
Gary Allan: 0.0353982300885
Little Big Town: 0.0333333333333
Brad Paisley: 0.031746031746
George Strait: 0.0303867403315
Miranda Lambert: 0.0288461538462
Chris Stapleton: 0.0212765957447
Randy Travis: 0.0150943396226
Clint Black: 0.0141843971631
Reba Mcentire: 0.0139664804469
Shania Twain: 0.0119047619048
Brett Eldredge: 0.0
Brett Young: 0.0
Carrie Underwood: 0.0
Darius Rucker: 0.0
Jennifer Nettles: 0.0
Kacey Musgraves: 0.0
Kalie Shorr: 0.0
Lady Antebellum: 0.0
Maren Morris: 0.0
Martina Mcbride: 0.0
The Band Perry: 0.0

Big props to Sam Hunt for winning the award for most trucks per song (TPS)! Aided by the chorus in his song “Speakers” where there are two truck mentions in the chorus alone, meaning 6 trucks in that song.

Also interesting that doesn’t seem like the women artists sing about trucks all that much. Kellie Pickler wins the award for most trucks per song at a measly 0.08 TPS. I went through those mentions also, and of the 5 songs that she mentions trucks, only one is how she herself likes trucks, where the others are talking about the men in her life who own trucks.

And cause I know you want to know who sings about beer the most, Cole Swindell crushes the competition with a comical 0.94 mentions per song, a full 0.3 mentions more than the second place singer, Kip Moore. On a hunch, I tried ‘love’ as the keyword, and Cole Swindell came in second to last with only 0.559 mentions of love per song (Brett Young had 2.75 loves per song for reference). So I guess the moral of this article is Cole Swindell loves trucks and beer, but really hates love.

Tune in next time for the final article in this country series, classifying the songs according to their subjects, and my theory of the 4 subjects that a country song can be about!

2 thoughts on “Talkin’ ‘Bout Trucks, Beer, and Love in Country Songs — Analyzing Genius Lyrics

  1. Pingback: Classifying Country Music Songs is an Art — Getting Training Data | Big-Ish Data

  2. Pingback: Popular music lyrics have become more negative over the decades | Big-Ish Data

Leave a comment