Gather all the PGA Tour stats

As someone who likes writing and investigating data sets, and as a huge fan of golf (and writer of a golf blog, Golf on the Mind), when I realized that the PGA Tour website has a crap ton of stats about players on the PGA Tour going back to the early 80s, I figured there was definitely some investigating to do. And the first step, as with any data analysis, is getting the data into a standard and usable form. And in reality, you’ll find this effort takes up most of your time if you do this sort of thing.

So before I can start looking at anything interesting, I need to do some scraping. This article will take you through that process.

UPDATE — 5/25/18 — I get way too many questions about whether the data is available, so I went back through and updated the code and currently scraping it every week. I’m not going to post the link here, but shoot me an email and I can go ahead and share the links.

Step 1 — Downloading the HTML files

The usual process for scraping, is to grab the html page, extract the data from that page, store that data. Repeat for however many web pages have the data you want. In this case however I wanted to do something different. Instead of grabbing the data from a web page and storing that data, I wanted to actually store the html file itself as a first step. Only after would I deal with getting the info from that page.

Reasoning behind this was to avoid unnecessarily hitting pgatour.com’s servers. Undoubtedly, when scraping, you’ll run into errors in your code – either missing data, oddly formatted data you didn’t account for, or any other random errors that can happen. When this happens, and you have to go back and grab the web page again, you’re both wasting time by grabbing the same file over the internet, and using up resources on that server’s end. Neither a good result.

So in my case here, I wrote code to download and save the html once, and then I can extract data as I please from those files without going over the internet. Now for the specifics.

On pgatour.com, the stats base page is located at pgatour.com/stats.html. If you notice at the top. This will land you at the overview page, but you can notice at the top there are eight categories of stats: Off the Tee, Approach the Green, Around the Green, Putting, Scoring, Streaks, Money/Finishes, and Points/Rankings. Under each of these categories are a list of stats in a table. Clicking on any of those links and you’ll get the current year’s stats for all the qualifying players. On the side, you’ll notice a dropdown where you can select the year you want the stat for. Our goal is to get the pages for each of those stats, for every year offered, and save the page in a directory named for the stat, and the year as the filename.

The url pattern when you’re on a single stat is straight forward. For example the url for current Driving Distance is http://www.pgatour.com/stats/stat.101.html, and the url for Driving Distance in 2015 is http://www.pgatour.com/stats/stat.101.2015.html. Simply injecting the year into the url after the stat id will get you what you need.

In order to get the different stats from the category page, we’re going to loop the categories, yank out url and name for a stat, grab the current page, see which years the stat is offered for, generate the required urls, and loop those urls saving the page! Reading the code should make this make more sense.

The last issue with grabbing the html pages is how long it takes. In the end, we’re talking about over 100 stats, with about 15-20 years of history. At first, I wanted to play nice not overwhelm the pgatour.com servers, but then I realized that pgatour.com can probably handle the load since they need to be able to deal with the constant refreshing that people do when checking leaderboards at the end of a tournament. Thankfully, python’s Gevent library allows us to easily, in parallel, grab pages and save them. After all that explanation, take a look at the code I used to save the files.

url_stub = "http://www.pgatour.com/stats/stat.%s.%s.html" #stat id, year
category_url_stub = 'http://www.pgatour.com/stats/categories.%s.html'
category_labels = ['RPTS_INQ', 'ROTT_INQ', 'RAPP_INQ', 'RARG_INQ', 'RPUT_INQ', 'RSCR_INQ', 'RSTR_INQ', 'RMNY_INQ']
pga_tour_base_url = "http://www.pgatour.com"
def gather_pages(url, filename):
 print filename
 urllib.urlretrieve(url, filename)

def gather_html():
 stat_ids = []
 for category in category_labels:
 category_url = category_url_stub % (category)
 page = requests.get(category_url)
 html = BeautifulSoup(page.text.replace('\n',''), 'html.parser')
 for table in html.find_all("div", class_="table-content"):
   for link in table.find_all("a"):
     stat_ids.append(link['href'].split('.')[1])
 starting_year = 2015 #page in order to see which years we have info for
 for stat_id in stat_ids:
   url = url_stub % (stat_id, starting_year)
   page = requests.get(url)
   html = BeautifulSoup(page.text.replace('\n',''), 'html.parser')
   stat = html.find("div", class_="parsys mainParsys").find('h3').text
   print stat
   directory = "stats_html/%s" % stat.replace('/', ' ') #need to replace to avoid
   if not os.path.exists(directory):
     os.makedirs(directory)
   years = []
   for option in html.find("select", class_="statistics-details-select").find_all("option"):
     year = option['value']
     if year not in years:
       years.append(year)
   url_filenames = []
   for year in years:
     url = url_stub % (stat_id, year)
     filename = "%s/%s.html" % (directory, year)
     if not os.path.isfile(filename): #this check saves time if you've already downloaded the page
       url_filenames.append((url, filename))
     jobs = [gevent.spawn(gather_pages, pair[0], pair[1]) for pair in url_filenames]
     gevent.joinall(jobs)

Step 2 — Convert HTML to CSV

Now that I have the html files for every stat, I want to go through the process of getting the info from the tables in the html, into a consumable csv format. Luckily, the html is very nicely formatted so I can actually use the info. I saved all the html files in a directory called stats_html, and I basically want to create the same folder structure in a top level directory I’m calling stats_csv.

Steps in this task are 1) Read in the files, 2) using Beautiful Soup, extract the headers for the table, and then all of the data rows and 3) write that info as a csv file. I’ll just go right to the code since that’s easiest to understand as well.


for folder in os.listdir("stats_html"):
 path = "stats_html/%s" % folder
 if os.path.isdir(path):
   for file in os.listdir(path):
   if file[0] == '.':
     continue #.DS_Store issues
   csv_lines = []
   file_path = path + "/" + file
   csv_dir = "stats_csv/" + folder
   if not os.path.exists(csv_dir):
     os.makedirs(csv_dir)
   csv_file_path = csv_dir + "/" + file.split('.')[0] + '.csv'
   print csv_file_path
   if os.path.isfile(csv_file_path): #pass if already done the conversion
     continue
   with open(file_path, 'r') as ff:
     f = ff.read()
     html = BeautifulSoup(f.replace('\n',''), 'html.parser')
     table = html.find('table', class_='table-styled')
     headings = [t.text for t in table.find('thead').find_all('th')]
     csv_lines.append(headings)
     for tr in table.find('tbody').find_all('tr'):
       info = [td.text.replace(u'\xa0', u' ').strip() for td in tr.find_all('td')]
     csv_lines.append(info)
     #write the array to csv
     with open(csv_file_path, 'wb') as csvfile:
       writer = spamwriter = csv.writer(csvfile, delimiter=',')
       for row in csv_lines:
         writer.writerow(row)

And that’s it for the scraping! There are still a couple issues before you can actually use the data, but those issues are dependent on what you’re trying to find out. The big example being getting the important piece of info from the csv. Some of the stats are percentage based stats, others are distance measured in yards. There are also stats measured in feet / inches (23’3″ for example). Also an issue is that sometimes, the desired stat is in a different column in the csv file depending on the year the stat is from. But like I said, those issues aren’t for an article on data scraping, but we’ll have to deal with them when looking at the data later.

What’s the Average Age of a Nobel Prize Winner?

tl;dr —  Average age of a Noble Prize winner is 59.14 years old.

There was a comment on HN the other day about wondering about the average age of Nobel Prize winners. I did a quick search for lists of Nobel Prize winners, and the Nobel Prize org’s website actually has a page listing winners and their ages. The data’s tucked in the html file, but I figured with scraping and a little numerical work, I could do a little analysis easily.

The first thing I did was download that html file, and store it locally. Sure I’m scraping data from an html file, but there’s no reason for me to hit the server every time I’m testing / adjusting my script. It’s important to realize that even though I’m dealing with a web page, I don’t have to actually use the internet to do the analysis. Downloading the page simplifies things on my end by not having to use the requests library, and also saves a few server hits on the other end.

The other thing it allows me to do is modify the html and put an id on a div tag which helps me locate the data I want. After look through, the div that contained all the data about the winners and their ages didn’t have a class, id, or anything else identifiable. It was literally just a div tag, and when you’re trying to automate data collection from a DOM, classes and ids are key. But since I downloaded the page, I was able to put an id on the div I needed to grab, and didn’t have to deal around with maneuvering to it using parent tags.

The relevant info for each of the winners was structured pretty well within that div. In order to organize the information, I created a class for each of the prize winners, and input the data by looping through the html.

class Prize:
 def __init__(self, name, age, year, prize_type):
 self.name = unicodedata.normalize('NFKD', name).encode('ascii','ignore') #umlaut issues
 self.age = age
 self.year = year
 self.prize_type = prize_type

 def __str__(self):
   return self.name + ' won ' + str(self.prize_type) + ' at age ' + str(self.age) + ' in ' + str(self.year)

f = open('nobel_laureates_by_age.html', 'r')
html = BeautifulSoup(f.read())

winners = []
prize_types = set()
nobel_prize_string = "The Nobel Prize in "
for tag in html.find("div", id="nobel-age-info").children:
  # we're looking for a specific div, that doesn't have a class, id, or anything noteworthy
  #so I'm going to count the divs that are in this outerdiv until we hit the one I want
  if tag.name == None:
    next
  elif tag.name == 'h3':
    current_age = int(tag.text.split(" ")[-1]) #update the age
  elif tag.name == 'div':
    name = tag.find("h6").text #winner's name
    description = tag.find_all("p")[0].find("a").text #winner's name
    year = int(description.split(' ')[-1])
    prize_type = ' '.join(description.split(' ')[0:-1])
    prize_types.add(prize_type)
    prize = Prize(name, current_age, int(year), prize_type, description)
    winners.append(prize)

From here, we want to get an average and a visualization of the ages of the winners for each prize.

all_prize_string = "All Prizes"
ts = list(prize_types)
ts.append(all_prize_string) #want to get all prizes too

print "Type, Number of Winners, Mean Age, Variance of Ages"
for prize_type in ts:
  ages = [p.age for p in winners if p.prize_type == prize_type or prize_type == all_prize_string]
  num_bins = ages[-1] - ages[0]
  fig = plt.figure()
  n, bins, patches = plt.hist(ages, num_bins, normed=1, facecolor='green', alpha=0.2)
  mean, var = norm.fit(ages)
  y = mlab.normpdf(bins, mean, var)
  plt.plot(bins, y, 'r--')
  plt.ylabel('Number of Winners')
  plt.xlabel('Age')
  plt.title(prize_type + '. Mean: ' + str(round(mean,2)) + ', Var: ' + str(round(var,2)))
  fig.savefig('nobel_hist_' + prize_type.lower().replace(' ', '_') + '.png', dpi=500,format='png')
  print prize_type +', '+ str(len(ages)) +', '+ str(round(mean,2)) +', '+ str(round(var,2))

The code above print out a little csv table for each of the prize types, as well as creating a histogram and fitted distribution for each, as well as the ages for everyone, regardless of prize type.

Somewhat grainy images of the fits are below

nobel_hist_all_prizes

nobel_hist_the_prize_in_economic_sciences

nobel_hist_the_nobel_prize_in_physiology_or_medicine

nobel_hist_the_nobel_prize_in_physics

nobel_hist_the_nobel_prize_in_chemistry

nobel_hist_the_nobel_prize_in__literature

nobel_hist_the_nobel_peace_prize

Some Thoughts

The overall age distribution is impressively normal. The couple outliers on the younger side are the 2014 Peace Prize winner Malala Yousafzai, and the 1915 Physics winner William Lawrence Bragg who won jointly with his father for work with X-Rays. Besides those winners, the rest seem pretty centered around the 60 year old mark.

There’s a funny dip in the graph for the prize on literature right around the mean. Only one winner with an age of 64-66. Funny because the mean for that award is about 65.

Youngest winners for each:

Chemistry: 35
Literature: 42
Peace: 17
Physiology or Medicine: 32
Economics: 51

Oldest Winners for Each

Chemistry: 85
Literature: 88
Peace: 87
Physiology or Medicine: 87
Economics: 90
Physics: 88

Oldest winners seem to be around the same age, while the younger winners seem to differ by prize type. Kind of interesting, given that the prize for Economics wasn’t started by Nobel in 1895 like the others, but rather in 1969. (Check out the wikipedia entry here.) The fewer number of winners could explain the youngest winner outlier. Once the award has been around for longer, you’d expect someone younger than 51 to win. Using the distributions, we can actually guess the probability that someone younger than 51 will win: about 2.5%.

Another explanation that I’ve heard before is that sometimes prizes are won for contributions over time. They want to recognize a person for their contributions over their careers, but not necessarily their research in their winning year. That could easily push the average age up. Obviously the Nobel Foundation would refute that, but who knows.

Possible Continuations

NLP on the descriptions — Most of the winner’s have a little sentence below that talks about what they did to deserve the prize. Some processing on that text might be interesting, like seeing what the popular keywords are for example.

Deal with multiple people sharing the prize — The reason there are over 800 winners of the 6 prizes is because people share the prize. The links on the page go to a more full description of the prize winner(s). For shared prizes, I might want to take the average age for the winners and only use that. I could also do an analysis on how often the prize is shared as opposed to won outright. Maybe the percentage of shared awards have changed over time?

Check out the gist here. Requires that you download the html like and add the id to the tag like mentioned above. And also have the required libraries installed with pip.

Comments? Want further analysis? Want to yell at me for bad analysis? Let me know on twitter.

Weekly Fantasy Golf Results and a System of Linear Equations

I had an idea recently about how to use the “wisdom of the crowds” in forecasting performance for weekly fantasy golf. I see the benefits of crowdsourcing all the time working with prediction markets at Inkling Markets, so I figured I could leverage that to get an edge when choosing my lineups. I’ll get into the details of how I’m doing that in a later post since that’s worthy on it’s own, but I found a subproblem that I’d have to solve in order to make it work.

While some of the sites offer player salaries in an easily digestible csv format, they’re a little stingy when it comes to results. They only keep contest results for the last 30 days, and the only downloadable results they offer are the results from contests in the past week, which just show point totals for an entire lineup, not the point totals for individual golfers. And I need those player’s points for testing my forecasting algorithm, as well as using them in making the actual forecasts. One way to figure this out is to scrape the hole by hole data from pgatour.com and apply the site’s scoring algorithm to each of those holes, but I’ve found a better, simpler, and much more elegant solution.

I can model the results csv file as a system of linear equations, and by converting the different user’s lineups into a matrix and vector those equations, numpy can solve for the point totals that each golfer earned during the tournament.

Here’s an example row from that file, with the person’s username hidden.

"Rank","EntryId","EntryName","TimeRemaining","Points","Lineup"
263,85826104, "UNAME (1/100)", 0, 430.50, "(G) Ryan Palmer ,(G) Pat Perez ,(G) Kevin Kisner ,(G) Ben Martin ,(G) Shawn Stefani ,(G) John Peterson "

In order to solve, we need to create two data structures. The first is a num_lineups by num_players matrix of coefficients, where the value is a 1 if the player was used in that lineup, and 0 if he wasn’t. The second is an array of total points scored by that lineup.

The idea is that if we had an array of the points the players scored over the course of the tournament, we should be able to take the dot product of that and the corresponding row in the coefficient matrix to generate the point total of that lineup.

Here’s the code to generate the coefficient matrix and the point array:

points_label = "Points"
lineup_label = "Lineup"
player_coefficients = []
lineup_points = []
with open('outcome.csv', 'rb') as csvfile:
  rows = csv.reader(csvfile)
  headers = rows.next()
  points_index = headers.index(points_label)
  lineup_index = headers.index(lineup_label)
  for row in rows:
    points = float(row[points_index])
    lineup_points.append(points)
    names = [name.strip() for name in row[lineup_index].replace('(G)','').split(',')]
    lineup_players = [0] * player_count
    for name in names:
      lineup_players[player_list.index(name)] = 1
      player_coefficients.append(lineup_players)

From here, all we need to do is convert those arrays into numpy arrays, and run numpy’s linear algebra least squares algorithm to get the solution array!

coefficient_matrix = np.array(player_coefficients)
point_array = np.array(lineup_points).transpose()
solution = np.linalg.lstsq(coefficient_matrix, point_array)
player_points = list(solution[0])

Initially, I tried to use the numpy’s solve algorithm, but by looking at the docs realized that solve dealt with square coefficient matrices (something that I still remember from all those math classes). The lstsqrs method is used to get approximate results from rectangular matrices as is the case here.

Printing out the results yields the following results for the top and bottom 10:

Chris Kirk: 126.0
Jason Bohn: 105.5
Brandt Snedeker: 102.0
Jordan Spieth: 101.5
Kevin Kisner: 99.0
George McNeill: 96.5
Pat Perez: 95.0
Adam Hadwin: 92.0
Ian Poulter: 87.5
Brian Harman: 87.0
...
Kenny Perry: 18.0
Jason Kokrak: 17.5
Jonas Blixt: 16.5
Corey Pavin: 16.0
Bo Van Pelt: 15.0
David Toms: 14.0
Brian Davis: 11.5
Tom Watson: 4.61852778244e-13
Tom Purtzer: 3.87245790989e-13
Scott Stallings: 2.84217094304e-13

Chris Kirk won so him having the highest point total makes sense, and the three guys at the bottom with zero points all withdrew so they should be at 0 points. Unfortunate for the guys who didn’t take them out of their lineups, but they gotta pay a little more attention!

Only issue with the final result is that I only end up with point totals from 117 players, when 133 teed it up at the beginning of the tournament. That means that we’re missing point totals from some of the guys. That being said, I’m going to assume that those players weren’t picked because they weren’t likely to play well, so hopefully that 117 offer a good representation of point totals. Also, could easily be that draftkings in this case only listed 117 players to choose from. I’ll need to investigate this week.

In the end, it took about 3 hours to write the code and write this post. It’s fun little problems like this that really remind you that programming is fun and has value. Being able to create technical solution to something you’re interested in is probably the best part about being a programmer. Check out the entire script in this gist.

Follow on twitter.

NBA Data Scraping — Game Data

In sports, as in most endeavors, playing on your home field is an advantage. The crowd, the lines of sight, the mascot (eh, kind of) all contribute to the home field advantage. But how valuable is it to play on your home field? That was the question that inspired me to look into using actual results and data to come up with a number that can quantify the effect of playing at home. First up, NBA.

Note: I’m separating this post into two posts, the first of which is this, the data scraping portion. Code for all this is here, written a while ago and will probably change while I do a little more research on these questions.

There are a ton of different end goals for scraping data from the internet. Maybe it’s to get a csv file with all the information to load into excel. Maybe it’s to put into the database for a webapp (like Rails or Django). Or maybe it’s to run some analysis and generate some visuals to answer a question as is the case here. There are endless languages, libraries, databases, and ORMs that you can use. But the one thing that’s consistent for all web scraping is that you need to figure out how the data is organized on the page, and how it gets there.

First step in all of these is to go to what you think of as the best source for the data, in this case, nba.com. First thing I noticed was that it has a stats specific section and if you click on a specific game to get the stats, you get a nice table of all the player lines and the quarter by quarter results. Even better, the table seems to load after the page, meaning that there’s probably some sort of AJAX call to load the data. Jackpot for scrapers.

Digging with Chrome’s developer tools, I see some requests going out to the site with ids for the game, league, and date. Trying a few values and I’m able to come up with a pattern for their urls. Also interesting that http://stats.nba.com/stats/scoreboard/ let’s you know what you’re missing in order to get the data you’re looking for. Thanks nba.com. I’ll leave the actual pattern they use as an exercise to the reader by either playing around with the code, or looking at the code I provide here.

Next comes parsing the resulting JSON. By use of a chrome extension that formats the JSON, I went through and deciphered the object so I would be able to know where the correct data. In this case, there are bunches of headers and data arrays to figure out, but all the information is there in one query which makes it really nice to deal with. Again, if you want to see what one of these results looks like, I’ll leave it to you to look at. Though I will say using parameters of LeagueID=00, DayOffset=0 and GameDate=01/01/2015 is a decent starting point.

Onto the code part of it. For this demo, I’m using Python (my personal favorite scripting language and the one I started out learning back in the day), with MongoDB, MongoEngine as an ORM, and gevent to help make it quicker.

First off, check out the gist here. To run the code, you’ll need to have gevent, mongoengine, and requests all installed with pip (and preferably using virtualenv). All of that is info for another article and overviews on how to do that are all over the web. If you’re trying to learn scraping, read through the code and try to understand what’s going on before reading on. Much better to learn that way I think.

Couple notes on the implementation. First is the number of gevent workers you have. I have 10 going in the code, but that value can increase probably. Scraping is all about being respectful to the site you’re getting the data from. You absolutely do not want to overwhelm their servers with requests which can easily happen with concurrent workers. Something like nba.com is expected to get a lot of traffic, and the fact that it’s rendering json instead of on actual html page make it possible to have a few more workers at once. But don’t overdo it.

Another thing to note is handling of cases that shouldn’t happen. In this case there are two cases where a random event could cause data oddities — ones that should never happen. To deal with those, I like to put big logger warnings in, so I’m able to search after running the code to see if one of those cases happened.

Finally, I make sure to put the data in a database. For data this small, and really for any amount of data you can scrape, any type of database works (With all the talk about “Big Data”, it’s important to think about how big the data has to be to qualify to be called that.) Again, we don’t want to overload the NBA’s servers and by storing the data, we only have to run the scraping once to get the data. If you don’t want to use a database, putting the data in a csv file with each row having the game data is also a valid storage method. You’d still only need to run the scraping once and then you can play with the data all you want.

And that’s it for the scraping. In the end, it wasn’t so much scraping as it was hitting an api for the data, but it’s all the same in the end once you’ve stored it all. Look for some analysis soon!

Twitter: @jack_schultz