linear regression | Big-Ish Data

First off, I admit, that’s probably the most boring title for a blog post ever. It gets a negative value on the clickbait scale that is generally unseen in the modern, “every click equals dollars” era that we live in. On the other hand, it tells you exactly what this article is about — predicting scoring average using stats.

In this article, I’ll go through getting the data from the database, cleaning that data for use, and then running a linear regression in order to generate coefficients for each of the stats to generate scoring average predictions. Oh, and some analysis and commentary at the end!

Shameless shoutout to my other blog, Golf on the Mind. Check it out and subscribe to the newsletter / twitter / instagram if you’re into golf at all. Or ignore, and keep reading for some code!

Here’s a pic of a golf course to get you in the mood.

Getting the data

Last time if you remember, I spent all this effort taking the csv stat files, and putting the information into a database. Start there if you haven’t read that post yet. It’ll show how I grabbed the stats and formatted them.

Now that you’re back in the present we need to create a query that gets the stats for the players for a specific year. An example row in a CSV file of the data would be something like:

player_id, player_name, stat_1_value, stat_2_value, … , stat_n_value

for stats 1 to n where n (the number of stats), and the which stats themselves (driving distance, greens in regulation, etc.) vary depending on inputs.

Now let me say, I am not an expert in writing sql queries. And since people on the internet loooove to dole out hate in comments sections, I’m just going to say that there’s probably a better way of writing this query. Feel free to let me know and I can throw an edit in here, but this query works just fine.

select players.id,
  players.name,
  max(case when stat_lines.stat_id=330 then stat_lines.raw else null end) as putting_average,
  max(case when stat_lines.stat_id=157 then stat_lines.raw else null end) as driving_distance,
  max(case when stat_lines.stat_id=250 then stat_lines.raw else null end) as gir,
  max(case when stat_lines.stat_id=156 then stat_lines.raw else null end) as driving_accuracy,
  max(case when stat_lines.stat_id=382 then stat_lines.raw else null end) as scoring_average
from players
  join stat_lines on stat_lines.player_id = players.id
  join stats on stat_lines.stat_id=stats.id
where stat_lines.year=2012 and (stats.id=157 or stats.id=330 or stats.id=382 or stats.id=250 or stats.id=156) and stat_lines.raw is not null
group by players.name,players.id;

High level overview time! We’re selecting player id, and player name, along with their stats for putting average, driving distance, greens in regulation, driving accuracy and scoring average for the year 2012. In order to get the right stats, we need to know the stat id for the stats.

One more thing. This query is funky, and I probably could have designed the schema differently to make this prettier. For example, I could have just gone with one table, stat_lines, with fields for player_name and stat_name (along with all the current fields) and then the sql would be very simple. But there are other applications to keep in mind. What if you wanted to display all stats by a player? Or all of a players stats for a certain year? With the way I have the schema set up, those queries are simple and logical. For this specific case, I’ll deal with the complexity.

Loading the Data

That query above is great, but it’s not going to cut it if I have to specify what the year, and the stat ids in that string every time I run the script. Gotta be dynamic here.

Continue reading →

Big-Ish Data

Writings about data

Tag Archives: linear regression

Predicting PGA Tour Scoring Average from Statistics Using Linear Regression

Getting the data

Loading the Data