Category Archives: Scraping

NPR Sunday Puzzle Solving, And Other Baby Name Questions

If you have a long drive and no bluetooth or aux cord to listen to podcasts, NPR is easily the best alternative. Truck drivers agree with this statement no matter their overall views. For me, this was the case when driving home to Milwaukee from Ann Arbor where I went to a college friend’s wedding.

While driving back I listened to NPR and heard Weekend Edition Sunday and their Sunday Puzzle pop up. If you haven’t heart of it before, at the end of every week’s episode they state a puzzle. Throughout the next week listeners can submit their answer and one random correct submitter is chosen to be recorded doing a mini puzzle on air.

The puzzle they stated for the week after the wedding was as follows:

Think of a familiar 6-letter boy’s name starting with a vowel. Change the first letter to a consonant to get another familiar boy’s name. Then change the first letter to another consonant to get another familiar boy’s name. What names are these?

They’ve already released the show for this question (I didn’t win of course) so I figure I can write about how I found out the answer!

Solving The Name Question

First step as always for these types of posts is gathering the required list of familiar boy’s names.  Searching on Google for lists will show that there are a ton of sites which exist try to SEO themselves for the money. When scraping, you should to poke around and make sure to choose the post that has the correct data as well as being the most simple to gather. I went with this one.

Since there’s only one page with the data, there’s no need to use the requests library to scrape the different pages. So clicking save html file to the folder you’re programming in is the best way to get the data.

The scraping code itself is pretty simple.

from bs4 import BeautifulSoup

filename = 'boy_names.html'
vowels = ('A', 'E', 'I', 'O', 'U')

vowel_starters = []
consonant_starters = []

with open(filename, 'r') as file:
  page = file.read()
  html = BeautifulSoup(page.replace('\n',''), 'html.parser')
  for name_link in html.find_all("li", class_="p1"):
    name = name_link.text
    first_letter = name[0]
    if len(name) == 6:
      if first_letter in vowels:
        vowel_starters.append(name)
      else:
        consonant_starters.append(name)

for vname in vowel_starters:
  cname_same = []
  for cname in consonant_starters:
    if vname[1:] == cname[1:]:
      cname_same.append(cname)
  if cname_same:
    print vname
    for match in cname_same:
      print match

And the results are…

Austin, Justin, Dustin

Justin and Dustin rhyme which makes it more simple to realize that they match, but Austin isn’t exactly on the same page. If I didn’t have the code, zero chance I’d have gotten this correct.

That’s it right? Nope, I have all the code, I figured I should check to see if there’s a match for girls names with that same rules. All there was to do is save the popular girl names to the same folder, change the filename to ‘girl_names.html’, run the code, and we’ll get Ariana and Briana. A and B are the starting letters, and if Criana was a popular name (at this moment), we’d be good to for the full 3 name answers.

By going through this part, I came up with some other fun questions that could be answered with this list of names, and the rest of the post is about those.

Continue reading

General Tips for Web Scraping with Python

The great majority of the projects about machine learning or data analysis I write about here on Bigish-Data have an initial step of scraping data from websites. And since I get a bunch of contact emails asking me to give them either the data I’ve scraped myself, or help with getting the code to work for themselves. Because of that, I figured I should write something here about the process of web scraping!

There are plenty of other things to talk about when scraping, such as specifics on how to grab the data from a particular site, which Python libraries to use and how to use them, how to write code that would scrape the data in a daily job, where exactly to look as to how to get the data from random sites, etc. But since there are tons of other specific tutorials online, I’m going to talk about overall thoughts on how to scrape. There are three parts of this post – How to grab the data, how to save the data, and how to be nice.

As is the case with everything, programming-wise, if you’re looking to learn scraping, you can’t just read tutorials and think to yourself that you know how to program. Pick a project, practice grabbing the data, and then write a blog post about what you learned.

There definitely are tons of different thoughts on scraping, but these are the ones that I’ve learned from doing it a while. If you have questions, comments, and want to call me out, feel free to comment, or get in contact!

Grabbing the Data

The first step for scraping data from websites is to figure out where the sites keep their data, and what method they use to display the data on the browser. For this part of your project, I’ll suggest writing in a file named gather.py which should performs all these tasks.

Continue reading