Optimizing a Daily Fantasy Sports NBA lineup — Knapsack, NumPy, and Giannis

Pic when I was sitting courtside on Oct 24th, 2018. If you zoom in a little, you can see Giannis about to make a screen, while Embiid watches from the other side of the court to help out if the screen is successful. Both players were in the optimal lineup that night.

Opener

In the data world, when looking for projects or an interesting problem, sports almost always gives you the opportunity. For a lot of what I write, I talk about getting the data because the sports leagues rarely if ever give the data away. That gets a little repetitive, so I wanted to change it up to something interesting that’s done after you get the data, like how to optimize a lineup for NBA Daily Fantasy Sports (DFS).

Before continuing, I’ll say that this isn’t about me making money from betting. In the past season I made lineups for some of the nights, but realized quickly that in order to win, you really need to know a ton about the sport. I love watching the NBA in the winter, love watching the Bucks, but don’t follow all other teams close to enough compared to others. Still, I found it worth it to keep getting the data during the regular season and found it most interesting to find out who would have been in the best lineup that night, and then look back at the highlights to see why a certain player did well.

Because of that, I took the code I used, refactored it some, and wrote this up to show what I did to get to the point where I can calculate the best lineup.

Knapsacking

This optimization is generally categorized as a Knapsack problem. The wiki page for the Knapsack Problem defines it as follows:

“””Given a set of items, each with a weight and a value, determine the number of each item to include in a collection so that the total weight is less than or equal to a given limit and the total value is as large as possible.”””

Or, self described – If you’re stealing items of known value, but only are able to carry a certain amount of weight, how do you figure out which items to take?

This DFS problem though is slightly different than the standard Knapsack problem, and makes it much more interesting.

FanDuel Rules

The DFS site I’m using for this is FanDuel, one of the two main Daily Fantasy Sports sites. Their rules for the NBA are that all players are assigned a position, Point Guard (PG), Shooting Guard (SG), Small Forward (SF), Power Forward (PF), and Center (C). A line up will have 2 PGs, 2 SGs, 2 SFs, 2 PFs, and 1 C. Each player is given a salary, and the combined salary of the players in your lineup must not be above $60,000. For reference, to give a sense of salary distribution, and what you’ll see in the final solution for best lineup of October 24th, 2018, MVP Giannis Antetokounmpo had a salary of $11,700, and ear blower and member of the Lakers Meme-Team, Lance Stephenson has a salary of $3,900. This data was given in csv files from FanDuel that we can download.

The amount of points a player gets depends on a bunch of stats for the night, positive points for things like actual points, rebounds, 3 point attempts made, assists, steals, and negative points for things like turnovers. This data comes from nba.com which I scraped and loaded into postgres.

Data

Below is a screenshot of what an example salary csv file that we can download looks like. Note that this is for a different date than the example day I’m using. I didn’t get the csv from FanDuel on that date, I had to scrape it from somewhere else, but it’s still important to give a look of what the csv file looks like. For our simple optimization, we only need the name, the position, and the salary of all the players.

Secondly, we need the stat lines which I then use to calculate the number of points a player got in a night. Below is a screenshot from stats.nba.com where it show’s how a player did that night. I have a script that scrapes that data the next day and puts that into the db.

If you look at the data csv files in the repo, all I have here is the name, position, salary, and points earned. This is a post about optimization, not about data gathering. If you’re wondering a little, here’s the query I used to get the data. I have players, positions, stat_lines, games, and some other tables. A lot of work goes into getting all this data synced up.

select p.id as pid, p.fd_name as name, sl.fd_positions as pos, sl.fd_salary as sal, sl.fd_points as pts from stat_lines sl join games g on sl.game_id=g.id join players p on p.id=sl.player_id where g.date='2018-10-24' and sl.fd_salary is not null order by sal desc

Code

Here’s the link to all the code on github. In it, you’ll find the csv files of data, three separate scripts to run the three different optimization methods, and three files for the Jupyter notebooks to look at a simplified example of the code.

Continuing

In the rest of the post, I’ll go through the three slightly different solutions for the problem. The first uses basic python elements and is pretty slow. The second brisk solution uses libraries like Pandas and NumPy to speed up the calculation quite a bit. The final fast solution goes beyond the second, ignoring most python structures, and uses matrices to improve the quickness an impressive amount.

In all cases, I made simple Jupyter files that go through how they each combine positions which hopefully give a little interactive example of the differences to try to show it more than words can do. In each case, when you go through them, you’ll see at the bottom they all return the same answer of what are the best players, what their combined salary is, and what their point totals are.

I talk about salaries, points, and indexes a lot. Salaries are the combined salaries of the players in a group, points are the combined points of the players in a group, and indexes are the the indexes from the csv file or the pandas dataframe which represent which players are in a group. Instead of indexes, we could use their names instead. Don’t get this confused when I talk about the indexes in the numpy arrays / matrixes that are needed to find which groupings are the best. To keep the index talk apart, I’ll refer to the indexes of the players as the player indexes. Also, I sometimes mix salary and cost, so if you see either of those words, they refer to the same thing.

If you have any questions, want clarification, or find mistakes, get in contact. Also I have twitter if you feel like looking at how little I tweet.

Basic solution

Time to talk about the solutions. There are effectively two parts to the problem at the start of basic. The first is combining the positions themselves together. From the FD rules, we need two PGs together. The goal of this is to return, for each salary as an input, the combination of players of the same position who have a combined salary less than the inputted salary with the most combined points.

Said a different way, for each salary of the test, we want to double loop through the same initial position array, and find the most successful combination where the combined salary is less than the salary we’re testing against.

The second part deals with combining each of those returned values together. Say we have the information about the best two PGs and the best two SGs. Again, for each salary as input, it returns the best combination of players below that salary. This is pretty much identical to what I said about the first part, with the only difference being that we didn’t start with two groups of the same players. Loop through the possible values of the salary possibilities, double loop through the arrays of positions, find the players who have the max points where the sum of their salaries is less than the salary value we’re testing.

There’s a lot of code in the solution, so I’ll post only a little, which was taken from the Jupyter file I created to further demonstrate. Click that, go through the lines of example code, and look at the double loops and see how the combinations are created. If you’re reading this, it’s worth it. To get a full look look here’s the link directly to the file on github.

#pgs, and sgs have the format of [(salary, points, [inds...])...]
#where salary is the combined cost of the players with inds in inds, points is the sum of points.

test_salary = 45000 #example test salary.
max_found_points = 0
for g1 in pgs:
    for g2 in sgs:
        if g1[0] + g2[0] > test_salary:
            break #assuming in sorted salary order, which they are
        points = g1[1] + g2[1]
        if points > max_found_points:
            max_found_points = points
            top_players = g1[2] + g2[2] #combining two lists
            top_points = points
            top_sal = g1[0] + g2[0]
return (top_sal, top_points, top_players)
#after the loop we have a new tuple of the same format (salary, points, [inds])
#where this is the best combo of players in pgs and sgs who don't have a total salary
#sum greater than the test salary

Here’s a slow gif of it running where you can see the time it takes to do the combinations. In the end, it prints out the names and info for the winners in the lineup. I also use cProfile and pstats to time the script, and also show where it’s being slow. This run took a tiny bit under 50 seconds to run (as you’ll see from the timing logs) so don’t think you’ll have to sit there and wait for minutes.

Brisk solution

After completing the first, most basic solution, it was time to move forward and write the solution which removes some of those loops by using numpy arrays.

Continue reading

Defense Matters in the NBA, Apparently

Interesting article here from Arxiv titled Finding Common Characteristics Among NBA Playoff Teams: A Machine Learning Approach.

Everyone should skim the paper because it’ll show you that these academic papers aren’t overwhelming, and a lot of times, they’re good at showing steps that go into tackling a machine learning problem. This paper, for example, goes over the basics of decision trees, pruning decision trees, as well as more high powered decision trees. Great progression.

As for what they found, apparently opponent’s stats are the most important determinants in whether a team makes the playoffs, with opponent field goal percentage and opponent points per game leading the way. Play good defense, and doesn’t matter how many points you score yourselves.

Only comment is that if you’re sitting out there thinking “well NBA teams don’t play defense anyway blah blah blah” you’re wrong. Defense is probably the most impressive aspect of a game to watch, especially in the playoffs currently going on.

Importance of variables is a really interesting topic in machine learning, sports especially. Knowing which variables matter can help someone in charge figure out who to draft (Moneyball style) or possibly what aspects to focus on during games or practice. Considering I have a bunch of PGA Tour data, maybe figuring out which stats are important for golfers should be something to focus on here…

Funny, they say that their “dataset was quite large”, so they only provided a sample of the data. 30 teams * 15 years * 44 variables = 19,800 data points. Too big to fit on a table in a paper sure, but don’t think I agree it’s quite large. More like big-ish. Fits in perfectly here.

NBA Data Scraping — Game Data

In sports, as in most endeavors, playing on your home field is an advantage. The crowd, the lines of sight, the mascot (eh, kind of) all contribute to the home field advantage. But how valuable is it to play on your home field? That was the question that inspired me to look into using actual results and data to come up with a number that can quantify the effect of playing at home. First up, NBA.

Note: I’m separating this post into two posts, the first of which is this, the data scraping portion. Code for all this is here, written a while ago and will probably change while I do a little more research on these questions.

There are a ton of different end goals for scraping data from the internet. Maybe it’s to get a csv file with all the information to load into excel. Maybe it’s to put into the database for a webapp (like Rails or Django). Or maybe it’s to run some analysis and generate some visuals to answer a question as is the case here. There are endless languages, libraries, databases, and ORMs that you can use. But the one thing that’s consistent for all web scraping is that you need to figure out how the data is organized on the page, and how it gets there.

First step in all of these is to go to what you think of as the best source for the data, in this case, nba.com. First thing I noticed was that it has a stats specific section and if you click on a specific game to get the stats, you get a nice table of all the player lines and the quarter by quarter results. Even better, the table seems to load after the page, meaning that there’s probably some sort of AJAX call to load the data. Jackpot for scrapers.

Digging with Chrome’s developer tools, I see some requests going out to the site with ids for the game, league, and date. Trying a few values and I’m able to come up with a pattern for their urls. Also interesting that http://stats.nba.com/stats/scoreboard/ let’s you know what you’re missing in order to get the data you’re looking for. Thanks nba.com. I’ll leave the actual pattern they use as an exercise to the reader by either playing around with the code, or looking at the code I provide here.

Next comes parsing the resulting JSON. By use of a chrome extension that formats the JSON, I went through and deciphered the object so I would be able to know where the correct data. In this case, there are bunches of headers and data arrays to figure out, but all the information is there in one query which makes it really nice to deal with. Again, if you want to see what one of these results looks like, I’ll leave it to you to look at. Though I will say using parameters of LeagueID=00, DayOffset=0 and GameDate=01/01/2015 is a decent starting point.

Onto the code part of it. For this demo, I’m using Python (my personal favorite scripting language and the one I started out learning back in the day), with MongoDB, MongoEngine as an ORM, and gevent to help make it quicker.

First off, check out the gist here. To run the code, you’ll need to have gevent, mongoengine, and requests all installed with pip (and preferably using virtualenv). All of that is info for another article and overviews on how to do that are all over the web. If you’re trying to learn scraping, read through the code and try to understand what’s going on before reading on. Much better to learn that way I think.

Couple notes on the implementation. First is the number of gevent workers you have. I have 10 going in the code, but that value can increase probably. Scraping is all about being respectful to the site you’re getting the data from. You absolutely do not want to overwhelm their servers with requests which can easily happen with concurrent workers. Something like nba.com is expected to get a lot of traffic, and the fact that it’s rendering json instead of on actual html page make it possible to have a few more workers at once. But don’t overdo it.

Another thing to note is handling of cases that shouldn’t happen. In this case there are two cases where a random event could cause data oddities — ones that should never happen. To deal with those, I like to put big logger warnings in, so I’m able to search after running the code to see if one of those cases happened.

Finally, I make sure to put the data in a database. For data this small, and really for any amount of data you can scrape, any type of database works (With all the talk about “Big Data”, it’s important to think about how big the data has to be to qualify to be called that.) Again, we don’t want to overload the NBA’s servers and by storing the data, we only have to run the scraping once to get the data. If you don’t want to use a database, putting the data in a csv file with each row having the game data is also a valid storage method. You’d still only need to run the scraping once and then you can play with the data all you want.

And that’s it for the scraping. In the end, it wasn’t so much scraping as it was hitting an api for the data, but it’s all the same in the end once you’ve stored it all. Look for some analysis soon!

Twitter: @jack_schultz