I’ve been playing around with 2010 Gameday data all night (courtesy of Niv, thanks as always!). Downloaded the R RMySQL plugin to help run the show, but unfortunately the only documentation I can find is this. I’m sure it’s got everything I need… but for some reason it’s all arranged in alphabetical order. So for someone just getting started, I have to guess about which command is relevant for where I am in the process (connecting, import data, run a query, save output) and go from there. Certainly doable, but not exactly the easiest.
For a bit of a break I figured I’d jump over here for a bit to talk about what I’m up to. First step in the process is this – calculating an AVG and SLG for each spot on the field by batted ball type. Using the XY data from Gameday and these park adjustments (I’m starting with the 2008 figures from here – even though I’m using 2008-2010 data – and will eventually tweak them as I see necessary as I become more familiar with the data) I’m coming up with a total list of all balls in hit into play and the rate in which the possible outcomes occur for that location on the field.
Once I get all that in, I’ll have to run a loop to help smooth out the data – for each spot/type, I’ll look at all the balls hit within an X foot radius (weighting balls on the periphery of that area less than ones at the center) and calculate a weighted average of all of the relevant stats.
Now there are a million and a half different reasons that this isn’t optimal (data quality, adjustment factors, player positioning, the way different parks play, etc) but it’s a reasonable start. This is the best data that I have available to me for now, and I think at the very least I can use it to create some fun maps in R (something I’ve been wanting to do for some time now).