As someone who likes writing and investigating data sets, and as a huge fan of golf (and writer of a golf blog, Golf on the Mind), when I realized that the PGA Tour website has a crap ton of stats about players on the PGA Tour going back to the early 80s, I figured there was definitely some investigating to do. And the first step, as with any data analysis, is getting the data into a standard and usable form. And in reality, you’ll find this effort takes up most of your time if you do this sort of thing.
So before I can start looking at anything interesting, I need to do some scraping. This article will take you through that process.
UPDATE — 5/25/18 — I get way too many questions about whether the data is available, so I went back through and updated the code and currently scraping it every week. I’m not going to post the link here, but shoot me an email and I can go ahead and share the links.
Step 1 — Downloading the HTML files
The usual process for scraping, is to grab the html page, extract the data from that page, store that data. Repeat for however many web pages have the data you want. In this case however I wanted to do something different. Instead of grabbing the data from a web page and storing that data, I wanted to actually store the html file itself as a first step. Only after would I deal with getting the info from that page.
Reasoning behind this was to avoid unnecessarily hitting pgatour.com’s servers. Undoubtedly, when scraping, you’ll run into errors in your code – either missing data, oddly formatted data you didn’t account for, or any other random errors that can happen. When this happens, and you have to go back and grab the web page again, you’re both wasting time by grabbing the same file over the internet, and using up resources on that server’s end. Neither a good result.
So in my case here, I wrote code to download and save the html once, and then I can extract data as I please from those files without going over the internet. Now for the specifics.
On pgatour.com, the stats base page is located at pgatour.com/stats.html. If you notice at the top. This will land you at the overview page, but you can notice at the top there are eight categories of stats: Off the Tee, Approach the Green, Around the Green, Putting, Scoring, Streaks, Money/Finishes, and Points/Rankings. Under each of these categories are a list of stats in a table. Clicking on any of those links and you’ll get the current year’s stats for all the qualifying players. On the side, you’ll notice a dropdown where you can select the year you want the stat for. Our goal is to get the pages for each of those stats, for every year offered, and save the page in a directory named for the stat, and the year as the filename.
The url pattern when you’re on a single stat is straight forward. For example the url for current Driving Distance is http://www.pgatour.com/stats/stat.101.html, and the url for Driving Distance in 2015 is http://www.pgatour.com/stats/stat.101.2015.html. Simply injecting the year into the url after the stat id will get you what you need.
In order to get the different stats from the category page, we’re going to loop the categories, yank out url and name for a stat, grab the current page, see which years the stat is offered for, generate the required urls, and loop those urls saving the page! Reading the code should make this make more sense.
The last issue with grabbing the html pages is how long it takes. In the end, we’re talking about over 100 stats, with about 15-20 years of history. At first, I wanted to play nice not overwhelm the pgatour.com servers, but then I realized that pgatour.com can probably handle the load since they need to be able to deal with the constant refreshing that people do when checking leaderboards at the end of a tournament. Thankfully, python’s Gevent library allows us to easily, in parallel, grab pages and save them. After all that explanation, take a look at the code I used to save the files.
url_stub = "http://www.pgatour.com/stats/stat.%s.%s.html" #stat id, year category_url_stub = 'http://www.pgatour.com/stats/categories.%s.html' category_labels = ['RPTS_INQ', 'ROTT_INQ', 'RAPP_INQ', 'RARG_INQ', 'RPUT_INQ', 'RSCR_INQ', 'RSTR_INQ', 'RMNY_INQ'] pga_tour_base_url = "http://www.pgatour.com" def gather_pages(url, filename): print filename urllib.urlretrieve(url, filename) def gather_html(): stat_ids = [] for category in category_labels: category_url = category_url_stub % (category) page = requests.get(category_url) html = BeautifulSoup(page.text.replace('\n',''), 'html.parser') for table in html.find_all("div", class_="table-content"): for link in table.find_all("a"): stat_ids.append(link['href'].split('.')[1]) starting_year = 2015 #page in order to see which years we have info for for stat_id in stat_ids: url = url_stub % (stat_id, starting_year) page = requests.get(url) html = BeautifulSoup(page.text.replace('\n',''), 'html.parser') stat = html.find("div", class_="parsys mainParsys").find('h3').text print stat directory = "stats_html/%s" % stat.replace('/', ' ') #need to replace to avoid if not os.path.exists(directory): os.makedirs(directory) years = [] for option in html.find("select", class_="statistics-details-select").find_all("option"): year = option['value'] if year not in years: years.append(year) url_filenames = [] for year in years: url = url_stub % (stat_id, year) filename = "%s/%s.html" % (directory, year) if not os.path.isfile(filename): #this check saves time if you've already downloaded the page url_filenames.append((url, filename)) jobs = [gevent.spawn(gather_pages, pair[0], pair[1]) for pair in url_filenames] gevent.joinall(jobs)
Step 2 — Convert HTML to CSV
Now that I have the html files for every stat, I want to go through the process of getting the info from the tables in the html, into a consumable csv format. Luckily, the html is very nicely formatted so I can actually use the info. I saved all the html files in a directory called stats_html, and I basically want to create the same folder structure in a top level directory I’m calling stats_csv.
Steps in this task are 1) Read in the files, 2) using Beautiful Soup, extract the headers for the table, and then all of the data rows and 3) write that info as a csv file. I’ll just go right to the code since that’s easiest to understand as well.
for folder in os.listdir("stats_html"): path = "stats_html/%s" % folder if os.path.isdir(path): for file in os.listdir(path): if file[0] == '.': continue #.DS_Store issues csv_lines = [] file_path = path + "/" + file csv_dir = "stats_csv/" + folder if not os.path.exists(csv_dir): os.makedirs(csv_dir) csv_file_path = csv_dir + "/" + file.split('.')[0] + '.csv' print csv_file_path if os.path.isfile(csv_file_path): #pass if already done the conversion continue with open(file_path, 'r') as ff: f = ff.read() html = BeautifulSoup(f.replace('\n',''), 'html.parser') table = html.find('table', class_='table-styled') headings = [t.text for t in table.find('thead').find_all('th')] csv_lines.append(headings) for tr in table.find('tbody').find_all('tr'): info = [td.text.replace(u'\xa0', u' ').strip() for td in tr.find_all('td')] csv_lines.append(info) #write the array to csv with open(csv_file_path, 'wb') as csvfile: writer = spamwriter = csv.writer(csvfile, delimiter=',') for row in csv_lines: writer.writerow(row)
And that’s it for the scraping! There are still a couple issues before you can actually use the data, but those issues are dependent on what you’re trying to find out. The big example being getting the important piece of info from the csv. Some of the stats are percentage based stats, others are distance measured in yards. There are also stats measured in feet / inches (23’3″ for example). Also an issue is that sometimes, the desired stat is in a different column in the csv file depending on the year the stat is from. But like I said, those issues aren’t for an article on data scraping, but we’ll have to deal with them when looking at the data later.
Very interesting, do you happen to have the resulting csv files you could post?
LikeLiked by 1 person
Yeah I have them, actually, I need to go and rescrape the last few years, 1) so I can get the end of the 2016 and start of the 2017 stats, and 2) also from like 2014-2016 because pgatour.com had an error in posting those years for some reason. Can’t remember if I mentioned that in the post, but they fixed it quickly but after I finished this post.
I’ll look up the best way to share the csv, whether in a zip file or somewhere online, but also if you have specific stats you want to look at, I can send those right to you.
LikeLike
Just looking for tournaments, cuts, top 10s, wins, majors, scoring, avg driving dist, avgfairways hit per round, greens in reg , avg putts per round and earnings for each year to 2000.
Thanks a bunch
LikeLiked by 1 person
Sure, might take a little bit but I can get that over to you. Also curious what you’re working on with those, shoot me an email if you want to discuss. Or from that contact page that turns into an email conversation.
LikeLike
Jack, I’d be interested in getting a copy of those as well if possible
LikeLike
I have been trying to get this to work for a few days but keep on running into some hiccups. Would you mind posting the zipped file somewhere?
LikeLike
Jack – having some issues scraping using the code. Getting lots of different errors like, AttributeError: ‘module’ object has no attribute ‘makdirs’. Anyway you can email me the csv file? Working on a class project for my MBA. Thanks.
LikeLike
Feel free to delete these comments, Jack, I was trying to get you a screenshot of the 404 error I got. It may have been on my end, sorry!
LikeLike
Pingback: Python, Postgres, SQLAlchemy, and PGA Tour Stats | Big-Ish Data
Hi Jack,
Fully disclosure I’ve never used python before. I just installed an IDE and the gevent library. When I run the code I get an error at line 12 saying (NameError: name ‘requests’ is not defined’. Any idea what this could be?
Thanks, this is an awesome post btw.
LikeLike
requests is another python library. You will also need to from bs4 import BeautifulSoup
LikeLike
Pingback: PGA Tour Stat Analytics Part 1 — Are the Strokes Gained PGA Tour stats correlated to scoring average? | Golf On The Mind
Jack,
are you aware of any stroke-level data? Particularly interested in distance to pin before and after each putt.
Thanks for the post, this is a good start for me.
LikeLike
Haven’t looked to see if they supply that sort of data. I’d assume it’s there since they show a graphic of the specific shots, but haven’t checked. I can take a look though to make sure and see how hard it is to get that.
LikeLiked by 1 person
getting an error on tihs line: “url = url_stub % (stat_id, starting_year)” any ideas? Seems like we are passing 2 arguments in and it is only expecting one
LikeLike
Good catch, I didn’t include the creation of the url_stub variable.
url_stub = “http://www.pgatour.com/stats/stat.%s.%s.html” #stat id, year
I updated the code here to make sure it’s there.
LikeLiked by 1 person
Is the PGA data site still working for everyone? I’m getting a forbidden error when trying to hit it. Thanks!
LikeLike
Are you able to get the code to run? I am able to access the site but nothing is pulled down when running the function – I receive the same error I’ve seen elsewhere – AttributeError: ‘NoneType’ object has no attribute ‘find’. Its not finding anything as nothing is puling down.
I can access the site directly though.
LikeLike
great post but would be good if you included the libraries you need to import in the code. might seem like a simple thing, but for beginners it can be confusing and seems to be causing a lot of the errors people comment about.
havent tried the code yet, but seems like you need:
import requests
from bs4 import BeautifulSoup
import os
LikeLike
and just in general a bit more detail on the code as well would make it an excellent resource for beginners, but maybe thats not the purpose and the intention is that people go read the documentation for the libraries first
LikeLike
First, nice name and contact email! I don’t know why that’s required for someone to post a comment, so I’ll try to get rid needing that before a comment.
And you’re right with some of this being confusing. These types of posts are written for different audiences, and I guess this is just the type for people who just want general tips for scraping in general, and how the code is structured and how to get it quickly but not wasting time. I’d more on that front where I poke around and see how other people code and if they do something that would benefit me. But writing a post for beginners to start from the beginning is something I should do so it can hopefully help. Thanks for this.
LikeLike
Hi Jack,
Do you have the html or csv data files ? and if, yes, where can I download them?
LikeLike
Any chance I can get the data from you for a data science project? Please and thanks.
Best,
LikeLiked by 1 person
Would it be possible to get the CSV files with this data? Don’t have a background with Python so I’m a little lost.
LikeLike
Anyone else getting on this line “stat = html.find(“div”, class_=”parsys mainParsys section”).find(‘h3’).text” a “AttributeError: ‘NoneType’ object has no attribute ‘find'” error?
LikeLike
Finally was able to check for the fix. Remove the “section” class from the string and should go through. PGA Tour must have changed the html somewhat.
LikeLike