As someone who likes writing and investigating data sets, and as a huge fan of golf (and writer of a golf blog, Golf on the Mind), when I realized that the PGA Tour website has a crap ton of stats about players on the PGA Tour going back to the early 80s, I figured there was definitely some investigating to do. And the first step, as with any data analysis, is getting the data into a standard and usable form. And in reality, you’ll find this effort takes up most of your time if you do this sort of thing.
So before I can start looking at anything interesting, I need to do some scraping. This article will take you through that process.
Step 1 — Downloading the HTML files
The usual process for scraping, is to grab the html page, extract the data from that page, store that data. Repeat for however many web pages have the data you want. In this case however I wanted to do something different. Instead of grabbing the data from a web page and storing that data, I wanted to actually store the html file itself as a first step. Only after would I deal with getting the info from that page.
Reasoning behind this was to avoid unnecessarily hitting pgatour.com’s servers. Undoubtedly, when scraping, you’ll run into errors in your code – either missing data, oddly formatted data you didn’t account for, or any other random errors that can happen. When this happens, and you have to go back and grab the web page again, you’re both wasting time by grabbing the same file over the internet, and using up resources on that server’s end. Neither a good result.
So in my case here, I wrote code to download and save the html once, and then I can extract data as I please from those files without going over the internet. Now for the specifics.
On pgatour.com, the stats base page is located at pgatour.com/stats.html. If you notice at the top. This will land you at the overview page, but you can notice at the top there are eight categories of stats: Off the Tee, Approach the Green, Around the Green, Putting, Scoring, Streaks, Money/Finishes, and Points/Rankings. Under each of these categories are a list of stats in a table. Clicking on any of those links and you’ll get the current year’s stats for all the qualifying players. On the side, you’ll notice a dropdown where you can select the year you want the stat for. Our goal is to get the pages for each of those stats, for every year offered, and save the page in a directory named for the stat, and the year as the filename.
The url pattern when you’re on a single stat is straight forward. For example the url for current Driving Distance is http://www.pgatour.com/stats/stat.101.html, and the url for Driving Distance in 2015 is http://www.pgatour.com/stats/stat.101.2015.html. Simply injecting the year into the url after the stat id will get you what you need.
In order to get the different stats from the category page, we’re going to loop the categories, yank out url and name for a stat, grab the current page, see which years the stat is offered for, generate the required urls, and loop those urls saving the page! Reading the code should make this make more sense.
The last issue with grabbing the html pages is how long it takes. In the end, we’re talking about over 100 stats, with about 15-20 years of history. At first, I wanted to play nice not overwhelm the pgatour.com servers, but then I realized that pgatour.com can probably handle the load since they need to be able to deal with the constant refreshing that people do when checking leaderboards at the end of a tournament. Thankfully, python’s Gevent library allows us to easily, in parallel, grab pages and save them. After all that explanation, take a look at the code I used to save the files.
category_url_stub = 'http://www.pgatour.com/stats/categories.%s.html' category_labels = ['RPTS_INQ', 'ROTT_INQ', 'RAPP_INQ', 'RARG_INQ', 'RPUT_INQ', 'RSCR_INQ', 'RSTR_INQ', 'RMNY_INQ'] pga_tour_base_url = "http://www.pgatour.com" def gather_pages(url, filename): print filename urllib.urlretrieve(url, filename) def gather_html(): stat_ids =  for category in category_labels: category_url = category_url_stub % (category) page = requests.get(category_url) html = BeautifulSoup(page.text.replace('\n',''), 'html.parser') for table in html.find_all("div", class_="table-content"): for link in table.find_all("a"): stat_ids.append(link['href'].split('.')) starting_year = 2015 #page in order to see which years we have info for for stat_id in stat_ids: url = url_stub % (stat_id, starting_year) page = requests.get(url) html = BeautifulSoup(page.text.replace('\n',''), 'html.parser') stat = html.find("div", class_="parsys mainParsys section").find('h3').text print stat directory = "stats_html/%s" % stat.replace('/', ' ') #need to replace to avoid if not os.path.exists(directory): os.makedirs(directory) years =  for option in html.find("select", class_="statistics-details-select").find_all("option"): year = option['value'] if year not in years: years.append(year) url_filenames =  for year in years: url = url_stub % (stat_id, year) filename = "%s/%s.html" % (directory, year) if not os.path.isfile(filename): #this check saves time if you've already downloaded the page url_filenames.append((url, filename)) jobs = [gevent.spawn(gather_pages, pair, pair) for pair in url_filenames] gevent.joinall(jobs)
Step 2 — Convert HTML to CSV
Now that I have the html files for every stat, I want to go through the process of getting the info from the tables in the html, into a consumable csv format. Luckily, the html is very nicely formatted so I can actually use the info. I saved all the html files in a directory called stats_html, and I basically want to create the same folder structure in a top level directory I’m calling stats_csv.
Steps in this task are 1) Read in the files, 2) using Beautiful Soup, extract the headers for the table, and then all of the data rows and 3) write that info as a csv file. I’ll just go right to the code since that’s easiest to understand as well.
for folder in os.listdir("stats_html"): path = "stats_html/%s" % folder if os.path.isdir(path): for file in os.listdir(path): if file == '.': continue #.DS_Store issues csv_lines =  file_path = path + "/" + file csv_dir = "stats_csv/" + folder if not os.path.exists(csv_dir): os.makedirs(csv_dir) csv_file_path = csv_dir + "/" + file.split('.') + '.csv' print csv_file_path if os.path.isfile(csv_file_path): #pass if already done the conversion continue with open(file_path, 'r') as ff: f = ff.read() html = BeautifulSoup(f.replace('\n',''), 'html.parser') table = html.find('table', class_='table-styled') headings = [t.text for t in table.find('thead').find_all('td')] csv_lines.append(headings) for tr in table.find('tbody').find_all('tr'): info = [td.text.replace(u'\xa0', u' ').strip() for td in tr.find_all('td')] csv_lines.append(info) #write the array to csv with open(csv_file_path, 'wb') as csvfile: writer = spamwriter = csv.writer(csvfile, delimiter=',') for row in csv_lines: writer.writerow(row)
And that’s it for the scraping! There are still a couple issues before you can actually use the data, but those issues are dependent on what you’re trying to find out. The big example being getting the important piece of info from the csv. Some of the stats are percentage based stats, others are distance measured in yards. There are also stats measured in feet / inches (23’3″ for example). Also an issue is that sometimes, the desired stat is in a different column in the csv file depending on the year the stat is from. But like I said, those issues aren’t for an article on data scraping, but we’ll have to deal with them when looking at the data later.