Apache Airflow Part 2 — Connections, Hooks, reading and writing to Postgres, and XComs

In part 1, we went through have have basic DAGs that read, logged, and write to custom files, and got an overall sense of file location and places in Airflow. A lot of the work was getting Airflow running locally, and then at the end of the post, a quick start in having it do work.

In part 2 here, we’re going to look through and start some read and writes to a database, and show how tasks can run together in a directed, acyclical manner. Even though they’ll both be with a single database, you can think of stretching them out in other situations.

Again, this post will assume you’re going through and writing this code on your own with some copy paste. If you’re on the path of using what I have written, checkout the github repo here.

Creating database

This post is going to go through and write to postgres. We already created a database for Airflow itself to use, but we want to leave that alone.

So before we get to any of the python code, go and create the new database, add a new user with password, and then create the dts (short name for datetimes, since that’s all we’re doing here) table.

bigishdata=> create table dts (id serial primary key, run_time timestamp, execution_time timestamp);
CREATE TABLE
bigishdata=> \dt
            List of relations
 Schema | Name  | Type  |     Owner
--------+-------+-------+----------------
 public | dts   | table | bigishdatauser
(1 rows)

bigishdata=# \d dts
                                          Table "public.dts"
     Column     |            Type             | Collation | Nullable |             Default
----------------+-----------------------------+-----------+----------+---------------------------------
 id             | integer                     |           | not null | nextval('dts_id_seq'::regclass)
 run_time       | timestamp without time zone |           |          |
 execution_time | timestamp without time zone |           |          |
Indexes:
    "dts_pkey" PRIMARY KEY, btree (id)

Adding the Connection

Connections is well named term you’ll see all over in Airflow speak. They’re defined as “[t]he connection information to external systems” which could mean usernames, passwords, ports, etc. for when you want to connect to things like databases, AWS, Google Cloud, various data lakes or warehouses. Anything that requires information to connect to, you’ll be able to put that information in a Connection.

With airflow webserver running, go to the UI, find the Admin dropdown on the top navbar, and click Connections. Like example DAGs, you’ll see many default Connections, which are really great to see what information is needed for those connections, and also to see what connections are available and what platforms you can move data to and from.

Take the values for the database we’re using here — the (local)host, schema (meaning database name), login (username), password, port — and put that into the form shown below. At the top, you’ll see Conn Id, and in that input create a name for the connection. This name is clearly important, and you’ll see that we use that in order to say which Connection we want.

Screen Shot 2020-03-29 at 4.40.22 PM.png

When you save this, you can go to the Airflow database, find the connection table, and you can see the see the values you inputted in that form. You’ll also probably see that your password is there in plain text. For this post, I’m not going to talk about encrypting it, but you’re able to do that, and should, of course.

One more thing to look at is in the source code, the Connection model, form, and view. It’s a flask app! And great to see the source code to get a much better understanding for something like adding information for a connection.

Hooks

In order to use the information in a Connection, we use what is called a Hook. A Hook takes the information in the Connection, and hooks you up with the service that you created the Connection with. Another nicely named term.

Continue reading

Apache Airflow Part 1 — Introduction, setup, and writing data to files

When searching for “Python ETL pipeline frameworks”, you’ll see tons of posts about all of the different solutions and products available, where people throw around terms and small explanations of the them.

When you go through articles, the one you will see over and over is Apache Airflow. It’s defined on wikipedia as a “platform created by community to programmatically author, schedule and monitor workflows”. I’d call Airflow big, well used, and worth it to get started and create solutions because knowledge with a running Airflow environment really does help with tons of data work anywhere on the scale.

It isn’t quick to get started with the necessary understanding, but I’ve found that once getting over the initial hump, knowing Airflow is well worth it for the amount of use cases.

Searching for Airflow tutorials, I found most posts being super basic in order to get started and then being left hanging not knowing where to go after. That’s good for a start, but not far enough with what I want to see written.

They also seem to stray by talking about things, like Operators or Plugins, as if everyone knows about them. When starting out, I’d want to have a process that starts at the basic, and takes the more advanced with a good background first. The goal of this series is to get you over that initial hump.

To combat that, this will be a series that starts basic like the other tutorials, where by the end, we will have gone through all the steps of creating an Apache Airflow project from basics of getting it to run locally, writing to files, to using different hooks, connections, operators, to write to different data storages, and write custom plugins that can then be used to write larger, specific tasks.

These posts will talk through creating and having Airflow set up and running locally, and written as if you’re starting out and going to start on your own and I’m talking about what to do to an audience that will use the examples and write their own code. Ff you’d rather have the code first, not write it yourself, and focus on getting Airflow running go ahead and clone the full repo for the whole series here, and get to the point where you can run the tasks and see the outputs. That’s a big enough part of its own.

Twitter, even though I rarely tweet.

Part 1 Overview

I know I said how so many of the intro posts are intros only, and I’ll admit right away, that this is an intro post as well. If you’ve gone through this, skip this and go to part 2 (to be posted soon, or even skip to part 3 which will be much more about what a larger implementation looks and works like). I did feel it was worth it to write this first part to get everyone on the same page when I go further with the next posts. Starting with further technical work isn’t good practice if not everyone is there to begin with.

Here in part 1, we’re going to talk through getting Airflow set up and running locally, and create a very basic single task — writing dates to a file. Seems like two quick parts, but going through the fuller process and small will be lead to better understanding.

As always, get in contact if you think something I wrote is wrong, I’ll edit and make the fix.

Get Airflow Running

Open up a new terminal session and pwd. You’ll find you’re in the base directory for your user. As with all python projects, we’re going to want an environment in order to have everything packaged up. I’ll use virtualenv. With the following commands, I’ll have that set up, install airflow, and get the airflow config set.

jds:~ jackschultz$ pwd
/Users/jackschultz
jds:~ jackschultz$ mkdir venvs
jds:~ jackschultz$ virtualenv -p python3 venvs/bidaf # Stands for Bigish Data Airflow. In some of the screenshots it's a different value. Go and ignore that.
.....
jds:~ jackschultz$ source venvs/bidaf/bin/activate
(pmaf) jds:airflow jackschultz$ pip install 'airflow[postgres]' # needed for the Airflow db
.....
(pmaf) jds:airflow jackschultz$ mkdir ~/airflow && cd ~/airflow
(pmaf) jds:airflow jackschultz$ airflow version
  ____________       _____________
 ____    |__( )_________  __/__  /________      __
____  /| |_  /__  ___/_  /_ __  /_  __ \_ | /| / /
___  ___ |  / _  /   _  __/ _  / / /_/ /_ |/ |/ /
 _/_/  |_/_/  /_/    /_/    /_/  \____/____/|__/
**whatever the current version is**
(pmaf) jds:airflow jackschultz$ ls

The reason we’re in ~/airflow  is because that’s the default AIRFLOW_HOME env variable value. If you don’t want to be in the base directory, you can export AIRFLOW_HOME=~/dev/bigishdata/airflow and then use that as the directory. Do that if you’d like, but make sure that export AIRFLOW_HOME line is in ~/.bash_profile or ~/.bashrc or ~/.zshrc or whatever ​ ~/.*rc file you use for your terminal because we’re going to be using a bunch of tabs and want to make sure they all have the same AIRFLOW_HOME. If you’re not going to use ~/airflow as the home, you’re going to have problems unless you’re always exporting this env var.

I’m kind of making a big deal about AIRFLOW_HOME , but that’s because it caused me some problems when I started. For example, some of the screenshots will show different directories. This is because I played around with this for a while before settling on a final set up.

Airflow needs a database where it will store all the information about the tasks — when they were run, the statuses, the amount and a ton of other information you’re going to see — and it defaults to sqlite. That’s quick and easy to get going, but I’d say go right to postgres. In order to change that default, we need to go to the config file that the airflow version command created.

First though, create a database, a table (I call airflow), a user (airflowuser), and password for that user (airflowpassword). Search for examples of  how to create databases and users elsewhere.

Above, when you called airflow version, a config file was created –  ~AIRFLOW_HOME/airflow.cfg. With the database created, take that url and replace the default sql_alchemy_conn variable:

sql_alchemy_conn = postgresql+psycopg2://airflowuser:airflowpassword@localhost:5432/airflow

Back to the command line, and run:

(bidaf) jds:airflow jackschultz$ airflow initdb

And then back to the postgres console command line, describe the tables and see the following:

Screen Shot 2020-03-29 at 2.42.49 PM

With this, you can see some of the complexity in Airflow. Seeing this shows Airflow is set up. If you’re going through this series, you probably won’t understand the tables yet; by the end of the series you’ll know a lot about the relations.

Go again back to the command line and run:

(pmaf) jds:airflow jackschultz$ airflow webserver --port 8080

and see:

airflow-starting.png

Then go to localhost:8080 and the admin screen, which is the highly touted UI. Like the table names, the UI will look more than a little complex at the start, but very understandable with experience.

airflow-admin-screenshot.png

Simple DAG — Directed Acyclic Graph

In terms of terminology, you’ll see the abbreviation DAG all the time. A DAG a way to explain which tasks are run and in which order. The aforementioned task refers to what will actually be run.

Looking at the admin UI, you can see the example DAGs that come with Airflow to get started. When writing DAGs, you’ll probably go through many of those to see how they’re set up and what’s required to have them run. Don’t feel bad about that; these DAG examples are fantastic to use.

Below is the full file we’ll have running. Look through it a little, as you’ll probably understand some of what’s going on. When you get to the bottom, keep reading and I’ll go through what it’s like when writing this.

# AIRFLOW_HOME/dags/write_to_file.py

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
import datetime as dt

filename = 'self_logs/dt.txt'

def write_to_file(**kwargs):
    kwarg_filename = kwargs['filename']
    execution_time = kwargs['ts']
    dtn = dt.datetime.now()
    with open(kwarg_filename, 'a') as f:  # 'a' means to append
        f.write(f'{dtn}, {execution_time}\n')

default_args = {
    'owner': 'airflow',
    'retries': 0
}
 
dag = DAG('writing_to_file',
          default_args=default_args,
          'start_date': dt.datetime.now(), 
          schedule_interval=dt.timedelta(seconds=10)
          )

write_to_file_operator = PythonOperator(task_id='write_to_file',
                                        python_callable=write_to_file,
                                        provide_context=True,
                                        op_kwargs={'filename': filename}, dag=dag)
 
write_to_file_operator

Start at the bottom and see the write_to_file_operator variable and how it’s an instance of PythonOperator.

An Operator is a class that “determines what actually gets done by a task”. A PythonOperator, when run, will run the python code that comes from the python_callable function. A BashOperator will run a bash command. There are tons of Operators that are open source that perform multiple tasks. You’ll see a few examples of these in the series, and also by the end will have written your own. For now, just go with the definition above about how Operators have the code for what the task does.

One thing about Operators you’ll seen in examples is how most of them take keyword arguments to talk about what to do. I’m not really a fan of that because it makes it seem like Airflow is only based on configs, which is one of the things I want to avoid with these tasks. I want to have the code written and not fully rely on cryptic Operators.

For now though, in the PythonOperator kwargs, you’ll see some things. First check the  python_callable, which is the function the Operator will call. Here it’s write_to_file that’s written above. Next, check the provide_context which we set to True. This flag says to give the callables information about the execution of the DAG. You also see  op_kwargs which will be passed to the python_callable. With this, we’re telling the function where to write the date.

As for the task_id, it is the name that will show up in the tree / graph view in the webserver. It can be whatever name you want, and there’s some consideration with making sure versions of that are correct, but for now, I’m keeping that name the same as the python_callable.

Going up to the callable itself, you’ll see first the filename that we’re going to write to. That’s from the op_kwargs from the PythonOperator instantiation.  You then see two timestamps. First is the execution time, which is the time that airflow scheduler starts that task. When running this DAG and looking at the values, you’ll see that time has certain number of microseconds, but always 10 seconds apart. The second timestamp, which is when the code is run, will be a varying number of seconds after the start of the execution. This is because of the work to get the code running. Keep this in mind when using timestamps in operators in the future. The rest of the function writes the two timestamps to that file.

Run First DAG

With the DAG file created, we want to run it and see what’s going on with the output.

First step is to open a new tab in the terminal, activate the venv, make sure you have the correct value for AIRFLOW_HOME, and run

(pmaf) jds:airflow jackschultz$ airflow scheduler

Go back to the browser and the admin page, and you’ll see the writing_to_file name in the DAG column, which means the webserver found that new DAG file with the name writing_to_file which we gave.

Click on the link for ​writing_to_file, which should take you to http://localhost:8080/admin/airflow/tree?dag_id=writing_to_file, and you should see this.

Screen Shot 2020-03-29 at 3.26.43 PM.png

This is the Tree View, and you can see the one operator is write_to_file which is the task_id we gave the PythonOperator.

Go upper left and click the ‘Off’ button to ‘On’ to get the task running. To watch this, go to the terminal and watch the scheduler start to throw out logs from the scheduling every 10 seconds. Go to browser and reload the tree view page and you’ll see red marks because of failure.

Screen Shot 2020-03-29 at 3.30.10 PM.png

The DAG is running, but why is it failing?

Debugging with logs

We can get to testing in the future, but for now, we’re going to debug using the logs.

In order to see what the issue is, go to logs/writing_to_file/write_to_file in the Finder and see new folders be created every 10 seconds, one for each task. Go ahead and view the log and you’ll see that there’s an error being sent.

Screen Shot 2020-03-29 at 3.43.42 PM.png

Turns out that in line 10, we’re trying to write to a file that doesn’t exist because we haven’t created self_logs/ directory. Either go to another terminal and mkdir self_logs/. With the scheduler still running, view back to the log directory and watch for new logs for newly executed tasks.

Screen Shot 2020-03-29 at 3.46.44 PM.png

Much better and correct looking log where we can see it going through.

Finally, go to self_logs/dt.txt and watch the datetimes come through. (And you can see I was writing this).

Screen Shot 2020-03-29 at 3.45.33 PM.png

One last step, and this is in terms of logging. When running code, many times you’ll want to print log statements, and in Airflow, printed values go to those log files. To show this, go back to the python_callable and add the following print line just before the file write:

print('Times to be written to file:', dtn, execution_time)

Save the file, and go back to watch the logs come in. What you’ll see is this line being added:

[2020-03-29 15:46:59,416] {logging_mixin.py:112} INFO - Times to be written to file: 2020-03-29 15:46:59.415603 2020-03-29T20:46:45.041274+00:00

This shows two things. First is that you can print and log to find errors in initial local development, and second, shows that code updates will be run on each execution of the task. The scheduler picked up the added line. In some web frameworks for example, if you change the code, you might have to restart your local server to have the changes be included. Here, we don’t have to, and those values will come in the logs.

Summary

If you got this far, you’re set up with a first DAG that writes to a file .We showed the steps to get airflow running locally, and then up and going with a basic self written task and seeing the activity.

That doesn’t sound like a lot, but with how big Airflow is, going from nothing to an initial set up, now matter how small, is a big part of the battle.

In Part 2 of this series, we’re going to take these tasks, hook them up to a different database, write the datetimes there, and have another task in the DAG format the time that was written. With that, you’ll be much more comfortable with being able to connect to services anywhere.

 

Optimizing a Daily Fantasy Sports NBA lineup — Knapsack, NumPy, and Giannis

Pic when I was sitting courtside on Oct 24th, 2018. If you zoom in a little, you can see Giannis about to make a screen, while Embiid watches from the other side of the court to help out if the screen is successful. Both players were in the optimal lineup that night.

Opener

In the data world, when looking for projects or an interesting problem, sports almost always gives you the opportunity. For a lot of what I write, I talk about getting the data because the sports leagues rarely if ever give the data away. That gets a little repetitive, so I wanted to change it up to something interesting that’s done after you get the data, like how to optimize a lineup for NBA Daily Fantasy Sports (DFS).

Before continuing, I’ll say that this isn’t about me making money from betting. In the past season I made lineups for some of the nights, but realized quickly that in order to win, you really need to know a ton about the sport. I love watching the NBA in the winter, love watching the Bucks, but don’t follow all other teams close to enough compared to others. Still, I found it worth it to keep getting the data during the regular season and found it most interesting to find out who would have been in the best lineup that night, and then look back at the highlights to see why a certain player did well.

Because of that, I took the code I used, refactored it some, and wrote this up to show what I did to get to the point where I can calculate the best lineup.

Knapsacking

This optimization is generally categorized as a Knapsack problem. The wiki page for the Knapsack Problem defines it as follows:

“””Given a set of items, each with a weight and a value, determine the number of each item to include in a collection so that the total weight is less than or equal to a given limit and the total value is as large as possible.”””

Or, self described – If you’re stealing items of known value, but only are able to carry a certain amount of weight, how do you figure out which items to take?

This DFS problem though is slightly different than the standard Knapsack problem, and makes it much more interesting.

FanDuel Rules

The DFS site I’m using for this is FanDuel, one of the two main Daily Fantasy Sports sites. Their rules for the NBA are that all players are assigned a position, Point Guard (PG), Shooting Guard (SG), Small Forward (SF), Power Forward (PF), and Center (C). A line up will have 2 PGs, 2 SGs, 2 SFs, 2 PFs, and 1 C. Each player is given a salary, and the combined salary of the players in your lineup must not be above $60,000. For reference, to give a sense of salary distribution, and what you’ll see in the final solution for best lineup of October 24th, 2018, MVP Giannis Antetokounmpo had a salary of $11,700, and ear blower and member of the Lakers Meme-Team, Lance Stephenson has a salary of $3,900. This data was given in csv files from FanDuel that we can download.

The amount of points a player gets depends on a bunch of stats for the night, positive points for things like actual points, rebounds, 3 point attempts made, assists, steals, and negative points for things like turnovers. This data comes from nba.com which I scraped and loaded into postgres.

Data

Below is a screenshot of what an example salary csv file that we can download looks like. Note that this is for a different date than the example day I’m using. I didn’t get the csv from FanDuel on that date, I had to scrape it from somewhere else, but it’s still important to give a look of what the csv file looks like. For our simple optimization, we only need the name, the position, and the salary of all the players.

Secondly, we need the stat lines which I then use to calculate the number of points a player got in a night. Below is a screenshot from stats.nba.com where it show’s how a player did that night. I have a script that scrapes that data the next day and puts that into the db.

If you look at the data csv files in the repo, all I have here is the name, position, salary, and points earned. This is a post about optimization, not about data gathering. If you’re wondering a little, here’s the query I used to get the data. I have players, positions, stat_lines, games, and some other tables. A lot of work goes into getting all this data synced up.

select p.id as pid, p.fd_name as name, sl.fd_positions as pos, sl.fd_salary as sal, sl.fd_points as pts from stat_lines sl join games g on sl.game_id=g.id join players p on p.id=sl.player_id where g.date='2018-10-24' and sl.fd_salary is not null order by sal desc

Code

Here’s the link to all the code on github. In it, you’ll find the csv files of data, three separate scripts to run the three different optimization methods, and three files for the Jupyter notebooks to look at a simplified example of the code.

Continuing

In the rest of the post, I’ll go through the three slightly different solutions for the problem. The first uses basic python elements and is pretty slow. The second brisk solution uses libraries like Pandas and NumPy to speed up the calculation quite a bit. The final fast solution goes beyond the second, ignoring most python structures, and uses matrices to improve the quickness an impressive amount.

In all cases, I made simple Jupyter files that go through how they each combine positions which hopefully give a little interactive example of the differences to try to show it more than words can do. In each case, when you go through them, you’ll see at the bottom they all return the same answer of what are the best players, what their combined salary is, and what their point totals are.

I talk about salaries, points, and indexes a lot. Salaries are the combined salaries of the players in a group, points are the combined points of the players in a group, and indexes are the the indexes from the csv file or the pandas dataframe which represent which players are in a group. Instead of indexes, we could use their names instead. Don’t get this confused when I talk about the indexes in the numpy arrays / matrixes that are needed to find which groupings are the best. To keep the index talk apart, I’ll refer to the indexes of the players as the player indexes. Also, I sometimes mix salary and cost, so if you see either of those words, they refer to the same thing.

If you have any questions, want clarification, or find mistakes, get in contact. Also I have twitter if you feel like looking at how little I tweet.

Basic solution

Time to talk about the solutions. There are effectively two parts to the problem at the start of basic. The first is combining the positions themselves together. From the FD rules, we need two PGs together. The goal of this is to return, for each salary as an input, the combination of players of the same position who have a combined salary less than the inputted salary with the most combined points.

Said a different way, for each salary of the test, we want to double loop through the same initial position array, and find the most successful combination where the combined salary is less than the salary we’re testing against.

The second part deals with combining each of those returned values together. Say we have the information about the best two PGs and the best two SGs. Again, for each salary as input, it returns the best combination of players below that salary. This is pretty much identical to what I said about the first part, with the only difference being that we didn’t start with two groups of the same players. Loop through the possible values of the salary possibilities, double loop through the arrays of positions, find the players who have the max points where the sum of their salaries is less than the salary value we’re testing.

There’s a lot of code in the solution, so I’ll post only a little, which was taken from the Jupyter file I created to further demonstrate. Click that, go through the lines of example code, and look at the double loops and see how the combinations are created. If you’re reading this, it’s worth it. To get a full look look here’s the link directly to the file on github.

#pgs, and sgs have the format of [(salary, points, [inds...])...]
#where salary is the combined cost of the players with inds in inds, points is the sum of points.

test_salary = 45000 #example test salary.
max_found_points = 0
for g1 in pgs:
    for g2 in sgs:
        if g1[0] + g2[0] > test_salary:
            break #assuming in sorted salary order, which they are
        points = g1[1] + g2[1]
        if points > max_found_points:
            max_found_points = points
            top_players = g1[2] + g2[2] #combining two lists
            top_points = points
            top_sal = g1[0] + g2[0]
return (top_sal, top_points, top_players)
#after the loop we have a new tuple of the same format (salary, points, [inds])
#where this is the best combo of players in pgs and sgs who don't have a total salary
#sum greater than the test salary

Here’s a slow gif of it running where you can see the time it takes to do the combinations. In the end, it prints out the names and info for the winners in the lineup. I also use cProfile and pstats to time the script, and also show where it’s being slow. This run took a tiny bit under 50 seconds to run (as you’ll see from the timing logs) so don’t think you’ll have to sit there and wait for minutes.

Brisk solution

After completing the first, most basic solution, it was time to move forward and write the solution which removes some of those loops by using numpy arrays.

Continue reading

Using Clustering Algorithms to Analyze Golf Shots from the U.S. Open

Cluster analysis can be considered one of the pillars of machine learning, and yet it’s one that’s difficult to talk about.

First off, it’s difficult to find specific use cases for clustering, other than pretty pictures. When looking through the wiki page on clustering, we’re told one of the uses is market research, where analysts use surveys to group together customers for market segmentation. That sounds great in theory, but the results don’t end with specific numbers telling the researchers what to do. Second, in so many cases, the hardest part of data science projects, or tutorials, is finding real world data that have the different results you want to show. In this case, I’m incredibly lucky.

I have a golf background, and on U.S. Open’s website, they have these interactive graphs that show where each ball was located after each stroke for every player. If you click around, you can see who hit what shot, how far the shot went and how far remains between the ball and the hole. For cluster analysis, we’re going to use the location. For you to check out how I got the data, look and read here.

Shinnecock Hills, the host of the 2018 U.S. Open last week, has a few parts of the course where balls roll to collection areas into groups, or, ya know, clusters. Here are the specific shots our clustered data is coming from.

Hole 10, Round 1, Off the Tee

The description that the USGA gives hole number 10 is

The player faces a decision from the tee: hit a shot of about 220 yards to a plateau, leaving a relatively level lie, or drive it over the hill. Distance control is critical on the approach shot, whether from 180 yards or so to a green on a similar plateau, or with a shorter club at the bottom of the hill or, more dauntingly, part of the way down the hill. The approach is typically downwind, to a green with a closely mown area behind it.

First I’ll say, always hit driver off the tee. Look at the cluster! If you get it down the hill you’ll be in the fairway! In the vast, vast majority of the time, it’s better to be closer to the hole. Golf tips aside, when I first saw this graph, it popped out as a great example to use as a clustering example.

Shift command 4 if you want selective screenshots

When looking at this picture, the dots represent where the players hit their tee shots on hole 10 in the first round, and the colors show how many strokes it took them to finish the hole in relation to par. For this, we’re ignoring the final score and only looking at the shots themselves.

Hole 10, Round 1, Approaching the green

One data set isn’t good enough to demonstrate the differences of the algorithms, and I wanted to find an example of a green with collection areas that would make approach shots group together. Little did I know, the 10th green, the same hole as the one above showing the drives, is the best example out there. If you’re short, it rolls back to you. If you’re long, it rolls away. You gotta be sure to hit the green. You can see that here.

So this will be a second example of data for all the algorithms.

Algorithms themselves

This time, in this blog post, I’m only looking for results, not going through the algorithms themselves. There are other tutorials online talking about them, but for now at least, we’re only getting little introductions to the algorithms and thoughts.

Instead, I use the Scikit-Learn implementations of the algorithms. Scikit-Learn offers plenty of clustering algorithms, which I could spend hours using and writing about, but for this post, the ones I chose are K Means, DBSCAN, Mean Shift, Agglomerative Clustering.

Other Notes

Before going in to the algorithms, here are a few notes on what to expect.

  • Elevation is key as to why there are clusters. If you look around the other holes, you won’t see close to as much distribution and clusters of shot results. Now, if we had elevation as a data point as well, then we could really do some great cluster analyses.
  • The X and Y values on the sides of the graphs represent yards from the hole, which is located at the (0,0) location. If you look at the first post, I show that if you measure the hypotenuse using those X and Y numbers, you’ll have the yardage to the pin.
  • This isn’t a vast data set. We have 156 points in the two data sets because that’s how many players there were in the tournament.
  • If you’re wondering which part took the longest, it was writing the matplotlib code to automatically create figures with multiple plots for different input variables, and have them all show up at once. Presentation is key, and that took tons of time.

Code

I put all the code and data on Github here, so if you want to see what’s going on behind the scenes and what it took to do the analysis, look there.

Questions, comments, concerns, notes, thoughts, etc: contact, twitter, and golf twitter if you’re interested in that too. Ok, algorithm time.

K Means

I’m starting with K Means because this was the clustering algorithm I was first introduced to, and one that I had to write myself during a machine learning class in college.

Continue reading

U.S. Open Data — Gathering and Understanding the Data from 2018 Shinnecock

After losing in a playoff to make it out of the local qualifying for the 2018 US Open at Shinnecock, I’m stuck at my apartment watching everyone struggle, wondering how much I’d be struggling if I was there myself.

Besides on TV, the US Open website offers some other way to follow what players are doing. As shown here, they very generously give us information on everyone’s shots on different holes. We’re able to see where people hit the ball, on which shot, and what their resulting score was on the hole. For example, why in the world did Tony Finau, the currently second ranked longest hitter on tour, hit it short off the first tee, leave himself 230 yards to the hole where he makes bogey?

Why didn’t Tony rip D?

One of the cool things these images show is the groupings of all the shots on a hole, like the tee shots here. And when I see very specific and interactive data like we have here, I know it comes from somewhere that I’m able to see myself. So I figured I should grab that data and do some cluster analysis on different holes to see if there are certain spots that players like to hit it.

Here, I’ll go through the data we have, what the values and the numbers mean, and also the code I wrote to eat up the data and display the graphs. Once I have this part going, I’ll be able to perform further analysis to most things that come to mind.

Any questions, comments, concerns, trash talking, get in touch: twitter, contact.

Current Posts

Using Clustering Algorithms to Analyze Golf Shots

Finding the data

First step was to search for where the data for the hole insights page was coming from. As always, open the dev tools, click on the network tab, and find what’s getting called with a pretty name.

Alert!

The file itself is quite dense and has all the information, which is really cool! It has IDs for all the players, all the shots they have on the hole, which include the starting distance from the flag and the ending distance to the flag.

First off, we’re given a list of Ps, meaning an array of player information, like this:

...
{u'FN': u'Justin', u'ID': u'33448', u'IsA': False, u'LN': u'Thomas', u'Nat': u'USA', u'SN': u'THOMAS'},
{u'FN': u'Dustin', u'ID': u'30925', u'IsA': False, u'LN': u'Johnson', u'Nat': u'USA',u'SN': u'JOHNSON D'}
{u'FN': u'Tiger', u'ID': u'08793', u'IsA': False, u'LN': u'Woods', u'Nat': u'USA', u'SN': u'WOODS'}
...

It looks like we have first name, player’s ID, whether or not they’re an amateur, last name, nationality, scoreboard name. The important part of this information is the ID, where we’ll be able to match players to shots.

Next, we’re given a few stats on the hole for the day:

Continue reading

How to Build Your Own Blockchain Part 4.2 — Ethereum Proof of Work Difficulty Explained

We’re back at it in the Proof of Work difficulty spectrum, this time going through how Ethereum’s difficulty changes over time. This is part 4.2 of the part 4 series, where part 4.1 was about Bitcoin’s PoW difficulty, and the following 4.3 will be about jbc’s PoW difficulty.

TL;DR

To calculate the difficulty for the next Ethereum block, you calculate the time it took to mine the previous block, and if that time difference was greater than the goal time, then the difficulty goes down to make mining the next block quicker. If it was less than the time goal, then difficulty goes up to attempt to mine the next block quicker.

There are three parts to determining the new difficulty: offset, which determines the standard amount of change from one difficulty to the next; sign, which determines if the difficulty should go up or down; and bomb, which adds on extra difficulty depending on the block’s number.

These numbers are calculated slightly differently for the different forks, Frontier, Homestead, and Metropolis, but the overall formula for calculating the next difficulty is

target = parent.difficulty + (offset * sign) + bomb

Other Posts in This Series

Pre notes

For the following code examples, this will be the class of the block.

class Block():
  def __init__(self, number, timestamp, difficulty, uncles=None):
    self.number = number
    self.timestamp = timestamp
    self.difficulty = difficulty
    self.uncles = uncles

The data I use to show the code is correct was grabbed from Etherscan.

Continue reading

How to Build Your Own Blockchain Part 4.1 — Bitcoin Proof of Work Difficulty Explained

If you’re wondering why this is part 4.1 instead of part 4, and why I’m not talking about continuing to build the local jbc, it’s because explaining Bitcoin’s Proof of Work difficulty at a somewhat lower level takes a lot of space. So unlike what this title says, this post in part 4 is not how to build a blockchain. It’s about how an existing blockchain is built.

My main goal of the part 4 post was to have one section on the Bitcoin PoW, the next on Ethereum’s PoW, and finally talk about how jbc is going to run and validate proof or work. After writing all of part 1 to explain how Bitcoin’s PoW difficulty, it wasn’t going to fit in a single section. People, me included, tend get bored in the middle reading a long post and don’t finish.

So part 4.1 will be going through Bitcoin’s PoW difficulty calculations. Part 4.2 will be going through Ethereum’s PoW calculations. And then part 4.3 will be me deciding how I want the jbc PoW to be as well as doing time calculations to see how long the mining will take.

The sections of this post are:

  1. Calculate Target from Bits
  2. Determining if a Hash is less than the Target
  3. Calculating Difficulty
  4. How and when block difficulty is updated
  5. Full code
  6. Final Questions

TL;DR

The overall term of difficulty refers to how much work has to be done for a node to find a hash that is smaller than the target. There is one value stored in a block that talks about difficulty — bits. In order to calculate the target value that the hash, when converted to a hex value has to be less than, we use the bits field and run it through an equation that returns the target. We then use the target to calculate difficulty, where difficulty is only a number for a human to understand how difficult the proof of work is for that block.

If you read on, I go through how the blockchain determines what target number the mined block’s hash needs to be less than to be valid, and how that target is calculated.

Other Posts in This Series

Calculate Target from Bits

In order to go through Bitcoin’s PoW, I need to use the values on actual blocks and explain the calculations, so a reader can verify all this code themselves. To start, I’m going to grab a random block number to work with and go through the calculations using that.

>>>import random
>>> random.randint(0, 493928)
111388

Block number 11138 it is! Back in time to March of 2011 we go.

Continue reading