Always tell me the odds: June 2011

Tuesday, June 28, 2011

Babies

So I am getting to the age where a bunch of my friends are having a babies. One thing I love to do is turn their pregnancy into game for me. That is, I love baby pools. Most baby pools require a number of inputs:
1. Gender
2. Date (day and time)
3. Height
4. Weight

Placing is determined by how far off your guesses are on these 4 categories. Without looking at any data, I would guess these have to be related. For example, if a mother is late it only makes sense to guess a larger height and weight. Besides guessing on these 4 categories another key piece of information that is just as important if not the most important for success in a baby pool is keeping track of what others are doing. Baby pools are like Price is Right, it is definitely to your advantage to guess last.

Guessing the gender without any prior knowledge is essentially like flipping an unfair coin.
They (wikipedia) say the gender ratio of boys to girls is about 105:100. So ignoring the guesses of other players in the game, a better bet on gender will always be boy. The real question is: Can we make a better prediction based on knowing information on the couple. This is tricky question, and to give it justice will require some very specific data on the father's family tree. Hopefully I can find something eventually and we can return to this topic. For now I am going to focus on predicting the due date.

So the one website I found that had data on due dates is here. Essentially, based on numerous sources the distribution of pregnancies looks something like this:

It is pretty clear that pregnancy length is not normally distributed, and assuming a usual 40 week due date, about 62% of babies are born on or after their due date. Also, apparently the probability of having a child exactly at 40 weeks is less than 5%. With this plot and the 5% fact in mind, we might expect to see a probability distribution for single days that looks something like this:

Shifting the plot over so that we can think of the plot as days before or after the magic 40 week number we have:

This plot is maybe hard to see, but based on this data it is unlikely the "due date" is even the most probable day for the baby to be born. Wierd! Anyway, back to the baby pool. Speaking like an economist we would have baby pool efficiency if the distribution of guesses looked just like distribution above. I'm guessing this almost never happens. For example lets say the distribution of guesses looked something like this:

The approach I would take would be to pick a date with largest positive difference between true birth distribution and baby pool distribution over a short interval. In this case that looks like about 4-7 days late gives the best chance to win the date portion of the pool.

At this point I want to make a formal complaint about the lack of birth data that is publicly available. It probably exists out there, but key websites that host health and demographic data like CDC, Census, and WHO in my opinion need to get their act together. I can say hands down it easier to research on the most obscure baseball topic than on the topic that is central to so many people's lives. So if any of you statisticians, nurses, or public health people out there know of some good data I would love to hear from you!

Wednesday, June 22, 2011

Twins get 8 hits in row! Crazy right?

So last night the twins lead off the game with 8 hits in row. What is the probability of that happening?

The easiest thing to do is take the batting averages of our first 8 players and multiply them together. I am not about to take the easiest route.....But I will take the second easiest route! If I take the average of the hitters batting averages and Madison Bumgarner's batting average against (.264 before last night, .284 after last night), I am at least acknowledging a hitters chance of success is not independent of who is pitching.

Name	Real Average	Chance of hit against Bumgarner
B. Revere cf	0.269	0.2665
A. Casilla 2b	0.268	0.266
J. Mauer c	0.22	0.242
M. Cuddyer rf	0.28	0.272
D. Young lf	0.254	0.259
D. Valencia 3b	0.221	0.2425
L. Hughes 1b	0.264	0.264
T. Nishioka ss	0.2	0.232

So,

The probability the twins lead off the game with 8 hits is 0.000018. On the flip side, the probability Madison Bumgarner throws a no-hitter last night is 0.00055

In other words, before the game started Madison Bumgarner was over 30 times more likely to throw a no hitter than the twins were to start the game with 8 straight hits.

So how rare was this feat. Well, the argument is the exact same as the three birthday problem in the Star trib last year. So if we forget about walks, the average number of real at-bats in game is about 34. That means the 27 at-bats are followed by at least 7 more at-bats. Each team plays 162 games a year so there are roughly 27*162= 4374 at-bats that could be the start of an 8 hit streak. Now, nearly all teams are better at hitting than the twins, but lets just stick with their probability. That means the expected number of 8-hit streaks in one year around the entire major leagues is around:
0.00001795009 * 4374 = 0.08

If all teams were as bad at hitting as the twins and Madison Bumgarner pitched every game for every team we would only expect to see this feat about once every 12.5 years! My guess is that if someone researched this, they would find 8 consecutive hits in game happens more often than every 12 years, but lets not sell the twins short.....That was a pretty cool accomplishment.

Also, I promise the next blog won't be about baseball.

Tuesday, June 21, 2011

Vector Ball

So some of you may have read the book Moneyball. If you haven't there is a movie of the book coming out this fall with Brad Pitt and Jonah Hill. The book provides anecdotal evidence that using statistics (not the ones you see on the back of baseball card) can help in picking commonly overlooked, but valuable players. Michael Lewis, the author, shows how the Oakland A's found value in signing players with high on-base percentages and good power numbers. This book was written ten years ago, and a lot has change in the game since then. Teams don't give up prospects like they used to. More importantly, the market has shifted to value on-base percentage (maybe even overshifted).

My Beef
Since the book has come out everyone and their brother has invented new metrics that will evaluate a player. I don't pay much attention to these, and I'm sure they work fine in evaluating major and minor league players. The problem is, evaluating professional players is relatively straight forward, especially in the major leagues. Take Adrian Gonzales for example, the Red Sox signed him because they knew he had a ton of power to left field. They went after this guy for years, and now that he is there, sure enough he is crushing a bunch of doubles. They know everything about these guys, so having a bunch of "sabermetrician", which is what baseball mathematicians/statisticians like to be called, is an easy job in my opinion.

If I were a GM I would place much more of my resources in finding talent. I admit, they do spend a lot on this already, but they should do more. Once a guy is on your roster making a decision on a single player is very manageable. You have a good idea how good all your players are in your system and a relatively good idea what they are worth to other teams. What is very hard is getting the best players on your team through the draft......But how?

I'm glad you asked. The answer:

This is something I have been thinking about for years. Granted it might be as realistic as inventing a time machine, but hear me out.

It is a pretty well known fact among both baseball people and sabermetrician that line drives are a good thing. The hardball times claims 75% of line drives end in hits. In major league games they can track the path of the ball off the bat using cameras, and keep a statistic on line drive percentage (ld%). Here are the line-drive percentages of the twins players with at least 200 at-bats last year.

Name	LD%
Joe Mauer	24.20%
Justin Morneau	22.00%
Jim Thome	21.40%
Orlando Hudson	20.60%
Jason Kubel	19.20%
Danny Valencia	18.80%
Denard Span	18.00%
Michael Cuddyer	17.30%
J.J. Hardy	16.90%
Delmon Young	15.50%
Nick Punto	15.10%

The way I think about this is, who is most likely to give you a decent at bat. Joe Mauer, Justin Morneau, and Jim Thome on the top. Delmon Young and Nick Punto at the bottom. Delmon Young actually had pretty good year in terms of traditional statistics, but seriously how many times have we seen him ground out to the pitcher or fly out to shallow center (usually on the first pitch)?

The Delmon Young example, though, brings up huge problem in drafting high school and even college players. With the exception of the top few picks, it is impossible for scouts to get many looks at the "draftable" players. They end up relying on misleading statistics (explained in Moneyball). How can they tell the difference between two players with similar numbers (and of course speed, arm strength etc). My guess is it is generally a gut feeling based on a the few games they watched. What they really need to know is how many line-drives the kid hit. They could have had a very lucky or unlucky year in terms of hits falling, but someone who hits a bunch of line-drives will end up succeeding in time.

That's why I think all baseballs should have a tracking device in them that can store the speed and direction a ball travels when it leaves the bat. I call this the vector ball. I realize this post is getting pretty long, so maybe I will revisit the value of the vector ball in later posts. Also, I must give credit to the the vector ball logo to my talented wife.....Pretty cool huh?

Saturday, June 18, 2011

Welcome

Welcome all to my new blog. I am a Ph.D student in statistics and wish to talk about issues "everyone" cares about, but may have a difficult time discussing in an objective way. When I say issues "everyone" cares about, I mostly mean issues I care about, but I think there will be plenty of overlapping. I have an interest in sports, investing, home improvement, pixar movies, cool gadgets, political races, and wildlife to name a few. I am planning to find a few interesting topics, research some actual data, and hopefully provide some meaningful commentary.

The name of the blog "alwaystellmetheodds" may not be familiar to everyone, but if you enjoy star wars movies as much as I do, you know of Han Solo's famous quote "Never tell me the odds." He says this in The Empire Strikes Back movie as he is being chased by Imperial Star Destroyers and Starfighters and is about to enter an astroid field. He is warned by C-3P0 that "the possibility of successfully navigating an asteroid field is approximately 3,720 to 1." Here is the clip (the famous quote is around 1:55 mark of the video):

My interpretation is that Han Solo, all knowing, didn't need to be told "the possibility of successfully navigating an asteroid field is approximately 3,720 to 1." He made his decision to enter the astroid field because he knew his odds of survival were better by entering the astroid field than staying in the open and being fired on by all the Imperial ships. Really, Han Solo made a calculation and made an informed statistical decision to enter the astroid field.

Anyway, this is the sort of important topics I will be discussing on my blog.