linear equations | Big-Ish Data

I had an idea recently about how to use the “wisdom of the crowds” in forecasting performance for weekly fantasy golf. I see the benefits of crowdsourcing all the time working with prediction markets at Inkling Markets, so I figured I could leverage that to get an edge when choosing my lineups. I’ll get into the details of how I’m doing that in a later post since that’s worthy on it’s own, but I found a subproblem that I’d have to solve in order to make it work.

While some of the sites offer player salaries in an easily digestible csv format, they’re a little stingy when it comes to results. They only keep contest results for the last 30 days, and the only downloadable results they offer are the results from contests in the past week, which just show point totals for an entire lineup, not the point totals for individual golfers. And I need those player’s points for testing my forecasting algorithm, as well as using them in making the actual forecasts. One way to figure this out is to scrape the hole by hole data from pgatour.com and apply the site’s scoring algorithm to each of those holes, but I’ve found a better, simpler, and much more elegant solution.

I can model the results csv file as a system of linear equations, and by converting the different user’s lineups into a matrix and vector those equations, numpy can solve for the point totals that each golfer earned during the tournament.

Here’s an example row from that file, with the person’s username hidden.

"Rank","EntryId","EntryName","TimeRemaining","Points","Lineup"
263,85826104, "UNAME (1/100)", 0, 430.50, "(G) Ryan Palmer ,(G) Pat Perez ,(G) Kevin Kisner ,(G) Ben Martin ,(G) Shawn Stefani ,(G) John Peterson "

In order to solve, we need to create two data structures. The first is a num_lineups by num_players matrix of coefficients, where the value is a 1 if the player was used in that lineup, and 0 if he wasn’t. The second is an array of total points scored by that lineup.

The idea is that if we had an array of the points the players scored over the course of the tournament, we should be able to take the dot product of that and the corresponding row in the coefficient matrix to generate the point total of that lineup.

Here’s the code to generate the coefficient matrix and the point array:

points_label = "Points"
lineup_label = "Lineup"
player_coefficients = []
lineup_points = []
with open('outcome.csv', 'rb') as csvfile:
  rows = csv.reader(csvfile)
  headers = rows.next()
  points_index = headers.index(points_label)
  lineup_index = headers.index(lineup_label)
  for row in rows:
    points = float(row[points_index])
    lineup_points.append(points)
    names = [name.strip() for name in row[lineup_index].replace('(G)','').split(',')]
    lineup_players = [0] * player_count
    for name in names:
      lineup_players[player_list.index(name)] = 1
      player_coefficients.append(lineup_players)

From here, all we need to do is convert those arrays into numpy arrays, and run numpy’s linear algebra least squares algorithm to get the solution array!

coefficient_matrix = np.array(player_coefficients)
point_array = np.array(lineup_points).transpose()
solution = np.linalg.lstsq(coefficient_matrix, point_array)
player_points = list(solution[0])

Initially, I tried to use the numpy’s solve algorithm, but by looking at the docs realized that solve dealt with square coefficient matrices (something that I still remember from all those math classes). The lstsqrs method is used to get approximate results from rectangular matrices as is the case here.

Printing out the results yields the following results for the top and bottom 10:

Chris Kirk: 126.0
Jason Bohn: 105.5
Brandt Snedeker: 102.0
Jordan Spieth: 101.5
Kevin Kisner: 99.0
George McNeill: 96.5
Pat Perez: 95.0
Adam Hadwin: 92.0
Ian Poulter: 87.5
Brian Harman: 87.0
...
Kenny Perry: 18.0
Jason Kokrak: 17.5
Jonas Blixt: 16.5
Corey Pavin: 16.0
Bo Van Pelt: 15.0
David Toms: 14.0
Brian Davis: 11.5
Tom Watson: 4.61852778244e-13
Tom Purtzer: 3.87245790989e-13
Scott Stallings: 2.84217094304e-13

Chris Kirk won so him having the highest point total makes sense, and the three guys at the bottom with zero points all withdrew so they should be at 0 points. Unfortunate for the guys who didn’t take them out of their lineups, but they gotta pay a little more attention!

Only issue with the final result is that I only end up with point totals from 117 players, when 133 teed it up at the beginning of the tournament. That means that we’re missing point totals from some of the guys. That being said, I’m going to assume that those players weren’t picked because they weren’t likely to play well, so hopefully that 117 offer a good representation of point totals. Also, could easily be that draftkings in this case only listed 117 players to choose from. I’ll need to investigate this week.

In the end, it took about 3 hours to write the code and write this post. It’s fun little problems like this that really remind you that programming is fun and has value. Being able to create technical solution to something you’re interested in is probably the best part about being a programmer. Check out the entire script in this gist.

Follow on twitter.

Big-Ish Data

Writings about data

Tag Archives: linear equations

Weekly Fantasy Golf Results and a System of Linear Equations