Python, Postgres, SQLAlchemy, and PGA Tour Stats

A little ago, I wrote an article about scraping a bunch of PGA Tour stats. The end result of that was writing those stats out into CSV files. While this was suitable for that task of gathering the stats, let’s face it, you’re probably going to want to put those into some database to allow for easier querying, or possibly integrate it into to web app in the future. There are a bunch of different reasons for wanting this, so I’m going to go through the process I took to put all the data in the CSV files into the database.

Adding players to the database

First step is to fire up postgres! I’m not going to cover starting postgres since there’s so much good content about it, for example, this super good tutorial here by Digital Ocean. I created at database called ‘pgatour’, created a user named ‘pgatour_user’ with password ‘pgatour_user_password’, logged in, and created the first table, Player.

pgatour=# create table players (
  id serial PRIMARY KEY,
  name varchar (255) NOT NULL
);

Ok, now, as a test, I’m going to add myself into the database from the psql command line.

pgatour=# insert into players (name) values ('Jack Schultz')

Note that since id is serial, we don’t need to insert that value, just the name. Alas, I am not on the PGA Tour, so I’m going to need to delete myself.

pgatour=# select * from players;
id | name
----+--------------
1 | Jack Schultz
(1 row)

pgatour=# delete from players where name = 'Jack Schultz';
DELETE 1
pgatour=# select * from players;
id | name
----+------
(0 rows)

Looking good. Now onto the Python side.

Python SQL Alchemy Interface

Now that the Players table in the database is set up, we’re going to want to be able to modify the contents of it in Python.

To do this, I’m going to use SQL Alchemy ORM here. It’s pretty standard ORM for Python meaning nobody is going to turn their head if you choose this. Just a warning, it’s kind of ugly.

Side note, I use Rails for work for the past 2 years, and I absolutely love ActiveRecord, the built in driver Rails has. It’s incredibly nice and intuitive, and lets you interface with data without ever having to know about databases or the underlying queries. Anyway, back to Python.

from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker

from sqlalchemy import create_engine
from sqlalchemy import Column, Integer, String

engine = create_engine('postgresql://pgatour_user:pgatour_user_password@localhost:5432/pgatour') #note that the pgatour_user has password pgatour_user_password
Session = sessionmaker(bind=engine)
session = Session()

Base = declarative_base()

class Player(Base):
 __tablename__ = 'players'
 name = Column('name', String, primary_key=True)
p = Player(name='Jack Schultz')
session.add(p)
session.commit()
session.close() #for good measure

Oof, talk about verbose. If you go back to your psql command line, you should see:

pgatour=> select * from players;
id | name
----+--------------
2 | Jack Schultz
(1 row)

As I said before, I’m not on the PGA Tour, and we want to delete my row from the db. I’m omitting all the imports and the class definition of player.

session.query(Player).filter_by(name='Jack Schultz').delete()
session.commit()

#or if I wanted to blow everyone away

session.query(Player).delete()
session.commit()

You can go back to your psql session and select * and verify that I’m no longer on the PGA Tour. 😦

Adding actual PGA Tour players

Time to make a move and actually add players from those csv files that I have. Goal for this is to loop through each sub folder in my stats folder, and then each csv file that associates with a year. For each player, check if that name has been added, and if it hasn’t, create the player. Code can do the talking here.

import os
import csv

players = set()
def add_players_from_file(filepath):
  with open(filepath, 'rb') as csvfile:
  reader = csv.reader(csvfile)
  for row in reader:
    player_name = row[2]
    #some players have an astrisk at the end of their name
    #want to remove this for player insertion
    if len(player_name) > 0 and player_name[-1] == "*":
      player_name = player_name[0:-2]
      players.add(player_name)

for dirs, files in os.walk('stats_csv'):
  for dir in dirs:
    for files in os.walk("stats_csv/%s" % dir):
      for file in files:
        filepath = "stats_csv/%s/%s" % (dir, file)
        add_players_from_file(filepath)

for player_name in players:
  if session.query(Player).filter_by(name=player_name).count() == 0:
    p = Player(name=player_name)
    session.add(p)

session.commit()

And selecting count(*) from players, we see that there are 6876 players now in the database! Boom.

Adding Stats

Now that I have the players in the db, I want to add the stat objects, that in this case are just the names of the directories in that os.walk(‘stats_csv’) command. So this should be simple as well. Remember to create the table in the database first as well! Again, in this code snippet, I’m omitting the imports.

for subdir, dirs, files in os.walk('stats_csv'):
  for dir in dirs:
    if session.query(Stat).filter_by(name=dir).count() == 0:
      s = Stat(name=dir)
      session.add(s)

session.commit()

And now we have 456 stats in the db. Making progress! In case you’re wondering, I’m writing the code and this post as I go. I figure it gives the wording a more real flavor, and it also allows me to note any issues I had to people trying to follow along.

Ok, now for the stat lines

Stat Lines … and time to do some thinking

If you take a look at a couple of the stat CSV rows, you’ll see something like this.

19,T18,Brendon de Jonge,35,11.11,389,35

T7,T10,William McGirt,34,"1' 11""",320,618.917

First off, shoutout to BdJ and how poorly he’s been performing on my fantasy golf team this year! Second, note that all these stats have different values. Some are floats, some are integers, some are feet and inches that need to get converted to something before going into the db as probably an integer. For now though, I’m going to just add a column to the stat line object for the raw string value. I’ll deal with the conversion later.

pgatour=# create table stat_lines (
  id serial PRIMARY KEY,
  player_id INT references PLAYERS(ID),
  stat_id INT references STATS(ID),
  raw varchar(255) NOT NULL,
  event INT
);

Now for the Python side. Since we’re adding foreign keys to the existing SQL Alchemy defined tables, I’ll show the complete configuration of models.

class Player(Base):
  __tablename__ = 'players'
  id = Column('id', Integer, primary_key=True)
  name = Column('name', String)
  stat_lines = relationship("StatLine")

class Stat(Base):
  __tablename__ = 'stats'
  id = Column('id', Integer, primary_key=True)
  name = Column('name', String)
  stat_lines = relationship("StatLine")

class StatLine(Base):
  __tablename__ = 'stat_lines'
  id = Column('id', Integer, primary_key=True)
  player_id = Column('player_id', Integer, ForeignKey("players.id"))
  player = relationship('Player')
  stat_id = Column('stat_id', Integer, ForeignKey("stats.id"))
  stat = relationship('Stat')
  raw = Column('raw', String)
  events = Column('events', Integer)
  year = Column('year', Integer)

And now for the database seeding code. When you’re running this code, you can follow along and select count(*) from stat_lines; from a db console and watch the number of stat lines grow! But notice that it will update in chunks, because that’s only when you call session.commit().

def acknowledge_or_create_stat_line(row, stat, year):
  player_name = row[2]
  if len(player_name) > 0 and player_name[-1] == "*":
    player_name = player_name[0:-2]
  player = session.query(Player).filter_by(name=player_name).first()
  stat_line = session.query(StatLine).filter_by(player=player, stat=stat, year=year).first()
  if not stat_line:
    try:
      events = int(row[3])
    except ValueError:
      events = 0
    raw = row[4]
    stat_line = StatLine(player=player, stat=stat, year=year, events=events, raw=raw)
    session.add(stat_line)

def process_file(filename, stat, year):
  with open(filename, 'rb') as csvfile:
    reader = csv.reader(csvfile)
    next(reader) #headers
    for row in reader:
      if len(row) >= 5:
        acknowledge_or_create_stat_line(row, stat, year)
        session.commit()
  return filename

for subdir, dirs, files in os.walk('stats_csv'):
  for dir in dirs:
    stat = session.query(Stat).filter_by(name=dir).first()
    for subdir, dirs, files in os.walk("stats_csv/%s" % dir):
      for file in files:
        year = int(file[0:-4]) #chopping off the csv
        filepath = "stats_csv/%s/%s" % (dir, file)
        process_file(filepath, stat, year)

And there you have it! And by “there you have it!” I mean, “there you have it if you want to wait like a day and a half to loop through every file and add the info to the database.” Which is something I don’t want to do. Parallel processing time! Using the multiprocessing library, and a couple modifications to the above code, I got the db seeding to go a little faster.

def acknowledge_or_create_stat_line(data, stat, year):
  for row in data:
    if len(row) >= 5:
      player_name = row[2]
      if len(player_name) > 0 and player_name[-1] == "*":
        player_name = player_name[0:-2]
      player = session.query(Player).filter_by(name=player_name).first()
      stat_line = session.query(StatLine).filter_by(player=player, stat=stat, year=year).first()
      if not stat_line:
        try:
          events = int(row[3])
        except ValueError:
          events = 0
        raw = row[4]
        stat_line = StatLine(player=player, stat=stat, year=year, events=events, raw=raw)
        session.add(stat_line)

def process_file(filename, stat, year):
  with open(filename, 'rb') as csvfile:
    reader = csv.reader(csvfile)
    next(reader) #headers
    stat_count = session.query(StatLine).filter_by(stat=stat, year=year).count()
    data = list(reader) #only do this because I know reader is about 200. Bigger data sets can have issues!
    file_stat_count = len(data)
    print "%s, stat_count: %s, file_stat_count: %s" % (filename, stat_count, file_stat_count)
    if stat_count != file_stat_count:
      acknowledge_or_create_stat_line(data, stat, year)
      session.commit()
  return filename

from multiprocessing import Pool
pool = Pool()

for subdir, dirs, files in os.walk('stats_csv'):
  for dir in dirs:
    stat = session.query(Stat).filter_by(name=dir).first()
    for subdir, dirs, files in os.walk("stats_csv/%s" % dir):
      for file in files:
        year = int(file[0:-4]) #chopping off the csv
        filepath = "stats_csv/%s/%s" % (dir, file)
        pool.apply_async(process_file, [filepath, stat, year])

pool.close()
pool.join()

When running this script, you can check out Activity Monitor and see the many Python and Postgres processes chugging away. You’ll also notice how everything else on your computer slows to a halt.

While you’re waiting, I suggest you just plug your power cord in, set your computer to never shut off when connected to that power, and go play golf! Or check out the golf commentary I write over at Golf on the Mind. Processing will be done when you get back.

Final Verification

After a day or so of running, I ended up with 1,840,460 stat_lines in the database. Unfortunately, I only have the raw string in the db, rather than the int or float depending on the type of stat. Not ideal if you’re trying to do math on the stats, what we really want is to be able to query on those fields.

As a quick check, let’s graph Phil Mickelson’s driving distance over the span of his career.

phil = session.query(Player).filter_by(name='Phil Mickelson').first()
stat = session.query(Stat).filter_by(name='Driving Distance').first()
stat_lines = session.query(StatLine).filter_by(player=phil, stat=stat).order_by("year")
for stat_line in stat_lines:
  print "%s: %s" % (stat_line.year, stat_line.raw)

And I get as output:

1993: 269.2
1994: 273.7
1995: 270.7
1996: 280.4
1997: 284.1
1998: 283.4
1999: 285.7
2000: 288.7
2001: 293.9
2002: 288.8
2003: 306.0
2004: 295.4
2005: 300.0
2006: 300.7
2007: 298.1
2008: 295.7
2009: 300.1
2010: 299.1
2011: 299.8
2012: 294.4
2013: 287.9
2014: 299.3
2015: 299.3
2016: 299.3

Phil either got super jacked later in his career, or that’s just how good the equipment got, but the data looks correct. Another thing to note is how Phil has the same average his last three years on tour. Why is that? Because the data on pgatour.com is incorrect for a few of these stats. Gotta get in touch with them to see what the deal is.

What’s next?

ORM comparison! Never thought you’d see an exclamation point after that sentence did you? I figure next time, when I deal with parsing those strings into their respective values, I can evaluate the different Python ORMs at the same time.

You all should follow on twitter @jack_schultz if you want more. Also, if you have data needs for any reason, get in contact!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s