Tuesday, July 27, 2010

A first look at NFL PBP data

I will use this blog to present, discuss, and archive some of my thoughts on research into sports, particularly baseball, basketball, and football.

Thanks to Brian Burke of Advanced NFL Stats, we now have freely available play-by-play data for NFL seasons 2002-2009. I've taken the 2008 data set and added additional columns to capture other characteristics of each play, categorizing things like pass/rush plays, fumbles, types of penalties, types of scores, even run direction, pass location, intended receiver on complete/incomplete/intercepted passes, etc. This is thanks to the help of the valuable comments left by contributors at Burke's website.

Here are the original columns in the 2008 spreadsheet:

gameid (example: 20090201_PIT@ARI)
qtr
min (minutes left of regulation, so counts down from 60)
sec
off (who's on offense)
def (who's on defense)
down
togo (yards to go)
description (description of the play)
offscore
defscore (offscore and defscore switches after a change of possession)
season

description is the key attribute here where we can use Excel formulas to figure all sorts of information out. To see how much info description contains, check out this example:

(:18) (Shotgun) K.Warner pass short middle intended for A.Boldin INTERCEPTED by J.Harrison at PIT 0. J.Harrison for 100 yards TOUCHDOWN. Super Bowl Record longest interception return yards. Penalty on ARZ-E.Brown Face Mask (15 Yards) declined. The Replay Assistant challenged the runner broke the plane ruling and the play was Upheld.

Anyway, I've added additional columns to the 2008 spreadsheet. Here are the additional attributes extracted from description I added using the formulas that were posted in the comments (and some of my own and modified ones):

play type (example: pass)
play subtype (example: incomplete)
play call (example: rush, when play type is fumble)
yards gained
fumble?
fumble result (either fum Recov or fum Lost)
penalty?
penalty type
penalty decision
challenge
challenge decision
nullified TD (if TD was reversed because of challenge)
clean description (without formations, time left, etc.)
description w/o reversed plays
score type
passer/runner (who threw the ball or who rushed the ball)
run direction (right end, right tackle, etc.)

pass location (deep middle, deep left, etc.)
pass complete to
intended receiver on incomplete pass
intended receiver on interception

I haven't cross-checked all of the data to see if the formulas are 100% correct (the PBP data itself might have a few errors, missing plays, incorrect entries, etc.) but I am sure the ones highlighted in red have shown faulty results. Still, with this spreadsheet, you can figure out the answers to all sorts of questions if you are pretty savvy with using filter and pivot tables (they're very easy to learn). For instance, you can figure out simple questions such as how often passing attempts vs. rushing attempts for any combination of down, yards to go, and yard line using filter and pivot tables. If you dig even deeper, you can sort by team as well (the spreadsheet needs to be fixed up a bit), and see which teams chose to pass for highest % of plays or chose to rush for highest % of plays. Run direction and pass location are very useful attributes, as you can figure out how run directions are distributed in the average NFL game in 2008. Maybe we can even find out if fumbles occur more often if rushed up the middle, or if less fumbles occur in the shotgun formation than not.

I played around with pivot tables and filtering on Excel, and came up with some preliminary graphs. Nothing substantial or revolutionary here, but definitely interesting to look at and a good initial step to looking more at this data. In my next post, I'll take a look at some interesting graphs from aggregating the data (of course, only for the 2008 season).

No comments:

Post a Comment