I have a really big post coming, but I’m driving to cleveland tonight and there’s no chance I’ll be able to finish it. So here is a little digression from the middle of it all that pretty much stands on its own. I hope to post the rest sometime this weekend.
Chad wondered: can we look at luck on individual plays or does it only make sense in aggregate? We discussed this over brunch for a while (our fiancé’s were thrilled), and I think the answer is yes, you can look at individual plays. In my original post, I didn’t specify what the odds were that the shortstop got to the ball, but in my head I was imagining a play that would be a hit maybe 90%-95% of the time. Now is that 5%-10% unlucky or just unlikely but inevitable? How about this play? That’s a home run 99.999% of the time, and I think we can all agree the guy who hit that was unlucky.
After a lot of thought, I realized that any cutoff is an arbitrary cutoff and in reality we should only be talking about degrees of luck rather than bucketing everything into categories. From a batter’s perspective, every hit is lucky and every out is unlucky. It’s a bit like win probability added insofar as every hit increases the chances of a win and every (nonproductive) out decreases the chances of a win.
A ball in play which is a hit 50% of the time that is turned into an out is unlucky, and that same ball when it is a hit is equally lucky. A ball which is a hit 99% of the time that actually results in a hit is lucky, but just a little tiny bit lucky. A ball which is a hit 99% of the time but is turned into an out is about as unlucky as you could be. getting 6 hits on 10 balls in play with an expected hit total of 3.6 is lucky, but not as lucky as getting 7 or 8 or 9 or 10 hits on those same balls; 5 hits in that situation is still somewhat lucky, but relatively less lucky than what actually happened. And if you’ve got a 30% chance of a ball in play turning into a hit and it does, that’s equivalently lucky to getting two hits on balls each with a 54.8% chance each of turning into hits or getting three hits on balls each with a 66.9% chance each of turning into hits (using the binomial distribution: 30% = 54.8% * 54.8% = 66.9% * 66.9% * 66.9%).
I had promised to have something by today. This is something. But the real something will be up soon, promise!
Reply to Niv’s comment at some point. Some point meaning before Wednesday. Sleep now. Chicago this weekend.
I don’t have much left in me tonight, so not much thought involved in this one. Niv and I are heading to Phoenix in early November, and that’s obviously a big driving force to get some stuff done. In time for the trip, I would like to accomplish the following:
- Register the chosen domain name (this is really like step 0 and probably should have been done already)
- Have a website template up for what I want the site to look like, with probably one hitter and one pitcher as examples (so that I don’t have to worry too much yet about how to do all the back end stuff)
- Have a very good understanding of what I can do with the data
- Be reasonably close to having all of the necessary equations and statistics down
If I can accomplish all that in time I think it will be a pretty successful trip. No need to have a finished product, but I at least need to be 90% of the way there in my mind and have some sort of a product available to help explain everything.
One of my econ professors from Northwestern, Jeff Ely (Econ 380-1 Winter quarter 2003 – I have no idea how I remembered the course number but I do), has a pretty fun blog that I highly recommend if you’re into that kind of thing. Strangely enough, this post from earlier today is very much in the spirit of what I’ve been thinking about recently.
Now there is certainly some role for luck in golf, and it’s actually potentially able to be calculated: it would be “lucky” for a tournament winner if his competitors finished with worse scores than they would have been expected to finish with (taking into account weather/injury, etc.) entering the tournament.
However, on an individual level I cannot see there being a great deal of luck over the course of an entire tournament. Luck would involve shots ending up in better position than would be expected given the velocity/spin/trajectory which the golfer put on the ball. Examples of this would be getting a perfect lie when hitting it into the rough, hitting the pin instead of racing past the hole, or bouncing the ball over a bunker.
Luck is certainly not the residual of some estimate of overall skill. Over the course of a tournament, some players are going to outperform their ability while others will underperform – that’s just what happens over a 72 hole sample of golf. But assuming no flukey balls or silly penalties for not realizing you’re in a bunker, it’s pretty likely that the guy who wins the tournament actually played the best golf over the course of the tournament.
So is drawing a favorable sampling distribution luck? I don’t think so – you get the outcome you deserve given your own personal contributions over that time period. To bring this back to baseball, I’d say that a Yuniesky Betancourt line drive up the middle for a single is not luck – it’s just a low-probability outcome of him showing considerable skill. If the shortstop makes a diving stop and throws him out on the same ball that’s in fact bad luck for Betancourt, despite the very low ex-ante probability of him getting a hit in the first place.
I’ll certainly take a look through that paper at some point, but just at first glance I think I have to agree with my trusted professor (and not just because he gave me an A).
This isn’t a fun week – in a contracts class from 8-4 every day and then it’s into the office for another few hours to do a ton of work that got thrown at me because the people who are really responsible for it don’t know how to work with huge data sets. So there won’t be as much baseball work to get done this week as I would want, but I still have to keep up the blog posts! And obviously sitting in a classroom for 8 hours will give my mind ample time to wander and contemplate some things for this project.
For instance, today I found myself thinking more about what a true talent level actually means – basically an extension of the second paragraph of this post. As I said there, players have many different talents and trying to collect all of that information in a single statistic is going to be problematic. In what proportions do you mix these different talents? You could just assume that all hitters face pitchers with totally average pitches that they throw at the same rate as the major league average and have handedness in line with the actual breakdown in MLB by innings pitched. But is that meaningful? A left handed hitter that struggles against lefties may face a ton of LOOGYs and therefore face a higher percentage of lefties than normal – but then again he might get lifted for a pinch hitter in that situation or sit more often with a lefty on the mound and therefore face fewer than normal.
Regardless, there is no doubt that I’d rather be thinking about this stuff than the lifestyle of a Federal contract. Which is pretty good motivation to keep thinking about it even further.
Pitch-f/x is the awesome system that MLB uses to document every single statistic about the movement of particular a pitch. However, it took quite some time to find one centralized location that provides solid documentation for all of the fields provided by the system (including ones that have been added in recent years). So for the benefit of myself and others, here it is:
- type: ball/strike/ball in play
- id: pitch ID for within that particular game
- x: horizontal location at which the ball crosses the plate in pixels
- y: vertical location at which the ball crosses the plate in pixels
- des: outcome of the pitch (18 total)
- sv_id: date and time of pitch, in YYMMDD_hhmmss
- start_speed: speed of pitch as it is 50 feet from home plate in mph
- end_speed: speed of pitch as it crosses the plate in mph
- sz_top: top of the strike zone for the particular batter in feet
- sz_bot: bottom of the strike zone for the particular batter in feet
- pfx_x: horizontal break of the pitch in inches
- pfx_z: vertical break of the pitch in inches
- px: horizontal location at which the ball crosses the plate in feet
- pz: vertical location at which the ball crosses the plate in feet
- x0: horizontal release point in feet
- y0: release point on C-2B axis (this is fixed at 50 feet)
- z0: vertical release point in feet
- vx0: horizontal velocity in feet/second
- vy0: velocity in the home plate-second base plane in feet/second (there has to be a better term for this)
- vz0: vertical velocity in feet/second
- ax: horizontal acceleration in feet/second/second
- ay: acceleration in the home plate-second base pane in feet/second/second
- az: verticaly acceleration in feet/second/second
- break_y: distance from home plate in feet at which the ball deviates most from a straight line path from release point to the point at which the ball crosses the plate
- break_angle: angle at which the ball breaks – from the catcher’s perspective, a break angle of 0 indicates no horizontal break, a positive angle indicates a ball that breaks from right to left (away from left-handed hitters) while a negative angle indicates a ball that breaks from left to right (into left-handed hitters)
- break_length: largest deviation from a straight line between the release point and the point at which the ball crosses the plate in inches
- pitch_type: type of pitch thrown
- type_confidence: confidence in the pitch_type classification by MLB on a scale from 0 to 1.5
- zone: pitch location in one of 14 zones around home plate as defined by MLB
- nasty: MLB-defined rating of quality of pitch, from 1 to 100
- spin_dir: from the catcher’s perspective, the angle (from 0 to 359.999) between the the pole around which the ball is rotating and the positive x-axis; a ball with a spin_dir of 0 has no horizontal movement and some amount of top spin while a ball with a spin_dir of 180 has no horizontal movement and back spin
- spin_rate: revolutions per minute of the ball
The two most helpful sites in compiling this information were from Alan Nathan and Fast Balls.
This post by Dave Cameron at FanGraphs does a pretty good job at pointing out the difficulties that arise when trying to separate talent from results. Trevor Cahill has certainly had a very strong year by “traditional” measures (ERA, WHIP), but in other ways (GB%, K%) has performed worse than even
Jason Justin Masterson. I’m not very concerned about whether the Cy Young vote should be based on wins and losses or K-rate or SIERA (although if I did have a vote, it would probably be for Cliff Lee despite witnessing this in person). It’s not the perception of talent that I care about, but rather actual talent. More precisely, the probability that a particular player has talent level X on a given day (see here for a very good background). Using all of the wonderful information available from MLB.com, it is possible to begin stripping away park and luck and quality of opposition and come closer to some distribution of true talent level for that player.
However, of course a player doesn’t have just one talent – a pitcher might have great fastball velocity but no command, or a wonderful change but a curve that gets pounded. On the other end of the pitch, a hitter might crush high and tight fastballs from righties but couldn’t hit a curve from a lefty to save his life (and in the case of switch-hitters, they likely behave like two totally different hitters depending on the side of the plate they are on). My goal is to use what we do know based on what has happened in the past to construct a better measure of player talent.
So we’ve brought back A90F (that’s what all the kids were calling it back in our July 2008 heyday) in order to document our experience trying to make it in the baseball world. There is a lot of interesting stuff going on, and it’s my goal to have some sort of a update every day. Regardless of how things turn out in the end, I’m very excited for this journey and hope to share the experience!