Always tell me the odds: 2011

Friday, August 19, 2011

QB draft analysis

The 2004 NFL draft produced star quarterbacks Eli Manning, Philip Rivers, and Ben Roethlisberger all being chosen in the top 11 picks. Combined these three have gone on to 6 pro bowls, 3 championships, and a 184-97 record as starters. This sort of success is hard to ignore when trying to build a championship NFL team. In the years following (2005-2007) the quarterbacks taken in the top 10 were: Alex Smith, Vince Young, Matt Leinart, and Jamarcus Russell. Pretty much all of them busts. This year a whopping 4 quarterbacks were taken in the top 12 picks. All of them, especially Cam Newton and Christian Ponder were ranked no where near in talent compared to where they were drafted. Thus, a hypothesis is formed. I think teams incorrectly mistook the wild success of the 2004 quarterback class as evidence that quarterbacks are worth more to a franchise, and thus rankings in overall draft class can be ignored.

To test this hypothesis I looked at historical draft results prior to 2004 so that each player has had plenty of time to prove if their draft spot was warranted. Before sharing the data, I want to talk a little about the difficulty in quantifying success for a football player. When discussing a baseball players contribution, the argument is easy because he alone is standing at the plate when the pitch comes. Sure there is still randomness, but his success at the plate has very little to do with his teammates. Football is a completely different story. Quarterbacks and running backs depend on the offensive line to block. Wide receivers depend on the quarterback to throw an accurate pass. Some things have been done to try to better evaluate a player such as the new Total Quarterback Rating, but this still doesn't really get at how good the offensive line is, or how good the WR's are at catching the ball.

In my opinion, especially at the very top of the draft, the goal should be to draft someone who is a sure thing contributor. I could argue there are some positions more valuable than others, but at the end of the day when you invest as much money as is required to sign a top pick, you want to be fairly confident that the money is not going to go to waste. Of course the hope is that they will go on to be the best at their positions and make it to countless pro bowls, but picking someone like Jamarcus Russell can set a franchise back for years. So, to keep things simple, lets think about comparing the success rates of quarterbacks and offensive linemen drafted in the top 20 picks of the draft. To make this comparison somewhat fair, let's first consider a player successful if they start at least 80 games. Next, let's look at whether they have ever made a pro bowl. Finally, let's consider a player elite if they have made at least four pro bowl bowls. Here are the results of offensive linemen taken in the top 20 from 1994-2003:

OL	Pos	Successful	Pro Bowl	Elite
Bernard Williams	OT	0	0	0
Wayne Gandy	OT	1	0	0
Aaron Taylor	OT	0	0	0
Todd Steussie	OT	1	1	0
Tony Boselli	OT	1	1	0
Ruben Brown	G	1	1	1
Jonathan Ogden	OT	1	1	1
Willie Anderson	OT	1	1	1
Orlando Pace	OT	1	1	1
Walter Jones	OT	1	1	1
Chris Naeole	G	1	0	0
Kyle Turley	OT	1	1	0
Tra Thomas	OT	1	1	0
John Tait	OT	1	1	0
Damien Woody	C	1	1	0
Matt Stinchcomb	OT	0	0	0
Luke Petitgout	OT	1	0	0
Stockar McDougle	OT	0	0	0
Leonard Davis	OT	1	1	0
Kenyatta Walker	OT	0	0	0
Steve Hutchinson	G	1	1	1
Jeff Backus	OT	1	0	0
Mike Williams	OT	0	0	0
Bryant McKinnie	OT	1	1	0
Levi Jones	OT	1	0	0
Jordan Gross	OT	1	1	0
George Foster	OT	0	0	0

Here are the results of QB's taken in the top 20 from 1994-2003:

Qb	Successful	Pro Bowl	Elite
Heath Shuler	0	0	0
Trent Dilfer	1	1	0
Steve McNair	1	1	0
Kerry Collins	1	1	0
Peyton Manning	1	1	1
Ryan Leaf	0	0	0
Tim Couch	0	0	0
Donovan McNabb	1	1	1
Akili Smith	0	0	0
Daunte Culpepper	1	1	0
Cade McNown	0	0	0
Chad Pennington	1	0	0
Michael Vick	1	1	1
David Carr	0	0	0
Joey Harrington	0	0	0
Carson Palmer	1	1	0
Byron Leftwich	0	0	0
Kyle Boller	0	0	0

Of the busts, some of these stories are tough luck stories like torn ACL's. Others maybe not so much unlucky, but stupid, like 15 failed drug tests for Bernard Williams or the Ryan Leaf fiasco. Some of the busts went on to successful careers elsewhere like Heath Shuler

who is a currently a congressman from North Carolina. In all cases, though, neither the drafting team nor any other NFL team got five good years out of a projected top player.

So what does the statistician say about all this data? Well 50% of the quarterbacks have gone on to successful careers while 74% of the linemen went on to successful careers. This wasn't enough to be statistically significant, but if I kept collecting data on to years past, my guess is the that it would have been. 44% of qbs made at least one pro bowl compared to 55% of linemen. Finally, 16% of qbs compared to 22% of linemen went on to elite careers. When you think about about it 3 pro bowl spots for qbs and 2-3 spots total for both left and right tackle means it is easier for a qb to make a pro bowl. If anything qbs have a better chance of making the pro bowl than any offensive tackle. Thus, my analysis doesn't prove my theory, but it certainly doesn't disprove it. If one of my dozens of readers feels motivated to continue on with collecting data prior to 1994 I think we should be able to show strong enough evidence to support my theory success rates in linemen. Another thing I think would be interesting to see is the actual talent rank compared to draft spot. For example, before the draft Cam Newton was ranked somewhere in the ballpark of the 30th best player. I would love to see the success rate on players that were drafted well above or below where they should have been, but haven't been able to find that data.

In summary, common sense says it is easier to project the success of almost any other position compared to the quarterback. Speed, vertical jump, size, intelligence, etc are much more predictive for other positions than quarterback. Three of the top quarterbacks in the league, Drew Brees, Tom Brady, and Aaron Rodgers were all left out of the top 20. That is why the Christian Ponder pick still absolutely kills me. I think we needed a left tackle and cornerback just as badly as we needed a quarterback. Sure he still might succeed in the NFL, but he is a prime candidate for a bust considering our aging offensive line. As an analogy the vikings bought this monitor,

for this computer

They should have bought a better computer first.

As always, thanks for reading. It might be a while before my next post. I am slowly trying to prove mutual funds are waste of resources. In the mean time I might make some quick blurbs on statistics in the news. Also, I think it would be cool to have guest authors. Especially if you have something interesting to share or theories you think need testing. Let me know.

Friday, July 15, 2011

Real Lottery and Blackjack

I promised to take requests on blog posts, so this post will be on Ann's comment on the lottery raising its prices. First of all my opinion on the lottery is that it is a tax on the mathematically challenged. My chances of winning the jackpot are about the same whether I purchase a ticket or not.

When the jackpot is large enough ($323 million) the "break-even point" is the point where the expected return matches the amount invested. That is, if you bet $2, you could expect to win $2 back. The way the expected return is calculated is:

Return= P(winning $3) * 3 + P(winning $4) * 4 +.....+P(winning jack) *jackpot total

The "break-even point" needs to be put in quotations, however, because it is far from breaking even. When the jackpot gets large, the probability of multiple winners increases. Also, you can plan on giving about 40% of those winnings back in taxes. Finally, the posted jackpot is paid over 30 years, so comparing the value of the $2 invested in the ticket to the value of the payout in 30 years is not a fair comparison. Basically, even after paying for operating expenses (advertisements, lottery retailers, bureaucrats, etc) the government still keeps fixed portion of winnings as profits. Plus they get a major kickback on the taxes that come back from the big winners. If the winner plans to choose the cash option the expected return is also negative. Basically, even in the best conditions this is a losing game. Other people have spent much more time than I am willing on the expected returns of the lottery. This one seems ok. I have bought one lottery ticket in my life (the day I turned 18) and don't plan to buy another one ever. As far as the implications of raising the price, my guess is the expected return per dollar will stay roughly the same. This means you can expect to lose twice as much if you buy the same number of tickets.

As you can tell I have very little interest in lottery tickets. I can say, however, that I have had an ok time playing blackjack at the casinos. Individually, the chances of winning money at the casino are much higher than with lottery tickets. You should still definitely plan to lose money when you go to the casino, but at least at the casino you can think of your losses as entertainment.

Now, I can count the number of times I have gone to the casino on one hand (alright maybe two hands), but one thing I do know is you should plan to bring a decent amount of money if you want to play for even an hour. When I went with my freshman corridor my friend and I each brought $20. He played four hands, and I played eight hands before being wiped out. So we were left with drinking the free sodas for about 3 hours until the rest of the group was ready to head back to campus.

Unless you count cards (illegally) playing blackjack perfectly by the book gives you about a 49.5% chance of winning. Most players aren't good enough to play perfectly by the book, so their chances of winning are around 40%. So, if you want to play blackjack for an hour, how much money should you bring to the casino to give yourself a 95% chance that you won't be cleaned out after an hour? To answer this, I simulated going to the casino 100,000 times and take the lowest dollar amount won over a one hour period. I assume 5 people a table= 70 hands per hour, and also you make only the minimum $5 bet per hand. The Money Needed column is how much money you should bring to be 95% sure you can play for the amount of time specified in the Hours Played column.

	Good Player (49.5% of winning)		Average Player (40% of winning)
Hours Played	Money Needed	Prob of Winning	Money Needed	Prob of Winning
1	80	0.417	140	0.034
2	120	0.42	240	0.007
3	150	0.415	330	0.001
4	175	0.409	420	0
10	290	0.381	920	0
100	1120	0.198	7705	0

If you don't know if you are a 49.5% or 40% player, you are likely a 40% player. So if you are an average player you can get 3 hours of entertainment and be quite confident it won't cost you more than $330! You should expect to lose about $210. For about that same price I could be catered in the Champions Club at a Twins game, but I wouldn't have a 0.1% chance of making money. I think I would still choose the Champions Club.

Wednesday, July 6, 2011

Timberwolves lottery curse

The reason I was inspired to start writing this blog came a few days before the NBA draft lottery. I woke up to see the twins currently had the worst record in all of baseball, the timberwolves finished with the worst record in the NBA, and vikings were planning to start a rookie quarterback on an already mediocre vikings team. I thought this might be the year we end up with the first pick in the NFL, MLB, and NBA drafts. I started writing an email to some of my friends with calculations on all three events happening this year, and thought this is ridiculous. If I am going to make this calculation I should at least share it with the rest of the world.

Anyway, since then the timberwolves ended up getting the second pick and twins have at least shown signs of sporadic improvement. So instead of that post, I will focus on what many have said is a "curse" for the timberwolves. That is, since the team began in 1989 the timberwolves have never once moved up to a better draft position than they were supposed to. Since we have ended up with the draft position exactly what we were seeded to have on a few occasions, we can't say we always drop. So for the sports writers who say our streak continued this year was completely trivial. Of course the streak continued, we had the top seed, dummy!

But, that still doesn't explain 22 years of draft history without moving up once. After accounting for the Joe Smith debacle, the years we didn't have a lottery pick because we actually made the playoffs, and the one lottery pick we got via trade (Ricky Ricky Ricky), we have had a total of 16 lottery picks.

Lottery History
Since 1990 the number of teams eligible for the lottery has increased from 11 to 14. Teams are assigned some probability of winning based on their regular season record. Once the first 3 spots have been determined, there is no longer any randomness. The team with the top seed that doesn't yet have a pick automatically receives the 4th pick. This continues until all 14 teams are assigned a draft spot. Therefore, if a team doesn't get a top three pick they automatically move down or stay the same. Also, not all teams in the top three picks will necessarily move up either. When you think about it that way, the probability of moving up is not that great for any team. Here are the probabilities of moving up from all 14 seeds in the 2011 draft.

Seed	Probability of moving up
1	0
2	0.199
3	0.313
4	0.378
5	0.291
6	0.215
7	0.151
8	0.1
9	0.061
10	0.04
11	0.029
12	0.025
13	0.022
14	0.018

From this, we can see it certainly isn't equally likely to move up or down. Your chances of moving up is also very dependent on what your seed is. At this point, I am going to wave my hand and ask you to trust me on these calculations for the Timberwolves draft history. I will include a technical section at the end if you wish to verify my work.

Year	Seed	Probability of Moving Up	Pick
1990	5	0.3329221	6
1991	7	0.23970483	7
1992	1	0	3
1993	2	0.1515	5
1994	3	0.3281	4
1995	3	0.36	5
1996	5	0.2589	5
1997	no lottery	0
1998	no lottery	0
1999	5	0.2949	6
2000	no lottery	0
2001	no lottery	0
2002	no lottery	0
2003	no lottery	0
2004	no lottery	0
2005	14	0.0181	14
2006	6	0.1828	6
2007	7	0.183	7
2008	3	0.2804	3
2009	2 (From Washington)	0.178	5
2009	5	0.2549	6
2010	2	0.199	4
2011	1	0	2

These probabilities come from this site (sort of), but whoever did the site clearly has the probabilities wrong from 1990-1994. Anyone know why?......write a comment! From 1994-2009 I checked his or her work and they seem ok. The moral of the story is,

1. Each year it is more likely the Timberwolves will not move up in the draft.
2. They have been getting unlucky, but not that unlucky. The probability that in the 16 lotteries that the T-wolves would never move up is about 2.2%. Britt's great uncle Fritz was struck by lightning twice (Probability 0.0000000001625%)... Seriously. I would say crazier things have happened than the Timberwolves failing to move up for 16 drafts.
3. Never moving up is a bummer, but we should have only expected to move up in about 3 drafts by now.
4. If we didn't finish dead last we would have had a much higher chance of moving up in the draft.

Anyway, thanks for reading. Most of you can safely stop reading at this point unless you care about checking my work.

Technical Portion

Let D1 be a random variable defined by the the team that wins draft position 1. D2 position 2 and so on. Let t be in {1,2,...14} if there are 14 eligible teams in the draft. Let b_t be the number of balls seed t has in pot. Then, for example

P(D1=1) = b_1/ sum_{t=1}^14 b_t
P(D2=1) = P(D2=1|D1=2)P(D1=2) + P(D2=1|D1=3)P(D1=3) +...+ P(D2=1|D1=14)P(D1=14)
P(D3=1)= P(D3=1|D1=2,D2=3)P(D2=3|D1=2)P(D1=2)+ ...+P(D3=1|D1=13,D2=14)P(D2=14|D1=13)P(D1=13)
This needs to be summed for all D1 and D2 pairs that are not equal to each other. I won't spend a ton of time writing this out, because if you can follow this you should be able to see what this looks like via these two R function.

balls2=c(250,199,156,119,88,63,43,28,17,11,8,7,6,5)

##calculates probability of getting second pick for any seed

pballs=c()

calc=function(seed){

set=(1:length(balls2))[-seed]

for(i in set){

pballs[i]=balls2[i]/(sum(balls2)-balls2[i])

}

ans=sum(pballs[set])*(balls2[seed]/sum(balls2))

return(ans)

}

##calculates probability of getting third pick for any seed

pballs2=matrix(NA,length(balls2),length(balls2))

calc2=function(seed){

set=(1:length(balls2))[-seed]

for(i in set){

for(j in set){

pballs2[i,j]=(balls2[seed]/(sum(balls2)-balls2[i]-balls2[j]))*(balls2[i]/(sum(balls2)-balls2[j]))*(balls2[j]/sum(balls2))

}

diag(pballs2)=NA

ans=sum(pballs2[!is.na(pballs2)])

return(ans)

}

You can check your results on the wikipedia draft lottery page.

Tuesday, June 28, 2011

Babies

So I am getting to the age where a bunch of my friends are having a babies. One thing I love to do is turn their pregnancy into game for me. That is, I love baby pools. Most baby pools require a number of inputs:
1. Gender
2. Date (day and time)
3. Height
4. Weight

Placing is determined by how far off your guesses are on these 4 categories. Without looking at any data, I would guess these have to be related. For example, if a mother is late it only makes sense to guess a larger height and weight. Besides guessing on these 4 categories another key piece of information that is just as important if not the most important for success in a baby pool is keeping track of what others are doing. Baby pools are like Price is Right, it is definitely to your advantage to guess last.

Guessing the gender without any prior knowledge is essentially like flipping an unfair coin.
They (wikipedia) say the gender ratio of boys to girls is about 105:100. So ignoring the guesses of other players in the game, a better bet on gender will always be boy. The real question is: Can we make a better prediction based on knowing information on the couple. This is tricky question, and to give it justice will require some very specific data on the father's family tree. Hopefully I can find something eventually and we can return to this topic. For now I am going to focus on predicting the due date.

So the one website I found that had data on due dates is here. Essentially, based on numerous sources the distribution of pregnancies looks something like this:

It is pretty clear that pregnancy length is not normally distributed, and assuming a usual 40 week due date, about 62% of babies are born on or after their due date. Also, apparently the probability of having a child exactly at 40 weeks is less than 5%. With this plot and the 5% fact in mind, we might expect to see a probability distribution for single days that looks something like this:

Shifting the plot over so that we can think of the plot as days before or after the magic 40 week number we have:

This plot is maybe hard to see, but based on this data it is unlikely the "due date" is even the most probable day for the baby to be born. Wierd! Anyway, back to the baby pool. Speaking like an economist we would have baby pool efficiency if the distribution of guesses looked just like distribution above. I'm guessing this almost never happens. For example lets say the distribution of guesses looked something like this:

The approach I would take would be to pick a date with largest positive difference between true birth distribution and baby pool distribution over a short interval. In this case that looks like about 4-7 days late gives the best chance to win the date portion of the pool.

At this point I want to make a formal complaint about the lack of birth data that is publicly available. It probably exists out there, but key websites that host health and demographic data like CDC, Census, and WHO in my opinion need to get their act together. I can say hands down it easier to research on the most obscure baseball topic than on the topic that is central to so many people's lives. So if any of you statisticians, nurses, or public health people out there know of some good data I would love to hear from you!

Wednesday, June 22, 2011

Twins get 8 hits in row! Crazy right?

So last night the twins lead off the game with 8 hits in row. What is the probability of that happening?

The easiest thing to do is take the batting averages of our first 8 players and multiply them together. I am not about to take the easiest route.....But I will take the second easiest route! If I take the average of the hitters batting averages and Madison Bumgarner's batting average against (.264 before last night, .284 after last night), I am at least acknowledging a hitters chance of success is not independent of who is pitching.

Name	Real Average	Chance of hit against Bumgarner
B. Revere cf	0.269	0.2665
A. Casilla 2b	0.268	0.266
J. Mauer c	0.22	0.242
M. Cuddyer rf	0.28	0.272
D. Young lf	0.254	0.259
D. Valencia 3b	0.221	0.2425
L. Hughes 1b	0.264	0.264
T. Nishioka ss	0.2	0.232

So,

The probability the twins lead off the game with 8 hits is 0.000018. On the flip side, the probability Madison Bumgarner throws a no-hitter last night is 0.00055

In other words, before the game started Madison Bumgarner was over 30 times more likely to throw a no hitter than the twins were to start the game with 8 straight hits.

So how rare was this feat. Well, the argument is the exact same as the three birthday problem in the Star trib last year. So if we forget about walks, the average number of real at-bats in game is about 34. That means the 27 at-bats are followed by at least 7 more at-bats. Each team plays 162 games a year so there are roughly 27*162= 4374 at-bats that could be the start of an 8 hit streak. Now, nearly all teams are better at hitting than the twins, but lets just stick with their probability. That means the expected number of 8-hit streaks in one year around the entire major leagues is around:
0.00001795009 * 4374 = 0.08

If all teams were as bad at hitting as the twins and Madison Bumgarner pitched every game for every team we would only expect to see this feat about once every 12.5 years! My guess is that if someone researched this, they would find 8 consecutive hits in game happens more often than every 12 years, but lets not sell the twins short.....That was a pretty cool accomplishment.

Also, I promise the next blog won't be about baseball.

Tuesday, June 21, 2011

Vector Ball

So some of you may have read the book Moneyball. If you haven't there is a movie of the book coming out this fall with Brad Pitt and Jonah Hill. The book provides anecdotal evidence that using statistics (not the ones you see on the back of baseball card) can help in picking commonly overlooked, but valuable players. Michael Lewis, the author, shows how the Oakland A's found value in signing players with high on-base percentages and good power numbers. This book was written ten years ago, and a lot has change in the game since then. Teams don't give up prospects like they used to. More importantly, the market has shifted to value on-base percentage (maybe even overshifted).

My Beef
Since the book has come out everyone and their brother has invented new metrics that will evaluate a player. I don't pay much attention to these, and I'm sure they work fine in evaluating major and minor league players. The problem is, evaluating professional players is relatively straight forward, especially in the major leagues. Take Adrian Gonzales for example, the Red Sox signed him because they knew he had a ton of power to left field. They went after this guy for years, and now that he is there, sure enough he is crushing a bunch of doubles. They know everything about these guys, so having a bunch of "sabermetrician", which is what baseball mathematicians/statisticians like to be called, is an easy job in my opinion.

If I were a GM I would place much more of my resources in finding talent. I admit, they do spend a lot on this already, but they should do more. Once a guy is on your roster making a decision on a single player is very manageable. You have a good idea how good all your players are in your system and a relatively good idea what they are worth to other teams. What is very hard is getting the best players on your team through the draft......But how?

I'm glad you asked. The answer:

This is something I have been thinking about for years. Granted it might be as realistic as inventing a time machine, but hear me out.

It is a pretty well known fact among both baseball people and sabermetrician that line drives are a good thing. The hardball times claims 75% of line drives end in hits. In major league games they can track the path of the ball off the bat using cameras, and keep a statistic on line drive percentage (ld%). Here are the line-drive percentages of the twins players with at least 200 at-bats last year.

Name	LD%
Joe Mauer	24.20%
Justin Morneau	22.00%
Jim Thome	21.40%
Orlando Hudson	20.60%
Jason Kubel	19.20%
Danny Valencia	18.80%
Denard Span	18.00%
Michael Cuddyer	17.30%
J.J. Hardy	16.90%
Delmon Young	15.50%
Nick Punto	15.10%

The way I think about this is, who is most likely to give you a decent at bat. Joe Mauer, Justin Morneau, and Jim Thome on the top. Delmon Young and Nick Punto at the bottom. Delmon Young actually had pretty good year in terms of traditional statistics, but seriously how many times have we seen him ground out to the pitcher or fly out to shallow center (usually on the first pitch)?

The Delmon Young example, though, brings up huge problem in drafting high school and even college players. With the exception of the top few picks, it is impossible for scouts to get many looks at the "draftable" players. They end up relying on misleading statistics (explained in Moneyball). How can they tell the difference between two players with similar numbers (and of course speed, arm strength etc). My guess is it is generally a gut feeling based on a the few games they watched. What they really need to know is how many line-drives the kid hit. They could have had a very lucky or unlucky year in terms of hits falling, but someone who hits a bunch of line-drives will end up succeeding in time.

That's why I think all baseballs should have a tracking device in them that can store the speed and direction a ball travels when it leaves the bat. I call this the vector ball. I realize this post is getting pretty long, so maybe I will revisit the value of the vector ball in later posts. Also, I must give credit to the the vector ball logo to my talented wife.....Pretty cool huh?

Saturday, June 18, 2011

Welcome

Welcome all to my new blog. I am a Ph.D student in statistics and wish to talk about issues "everyone" cares about, but may have a difficult time discussing in an objective way. When I say issues "everyone" cares about, I mostly mean issues I care about, but I think there will be plenty of overlapping. I have an interest in sports, investing, home improvement, pixar movies, cool gadgets, political races, and wildlife to name a few. I am planning to find a few interesting topics, research some actual data, and hopefully provide some meaningful commentary.

The name of the blog "alwaystellmetheodds" may not be familiar to everyone, but if you enjoy star wars movies as much as I do, you know of Han Solo's famous quote "Never tell me the odds." He says this in The Empire Strikes Back movie as he is being chased by Imperial Star Destroyers and Starfighters and is about to enter an astroid field. He is warned by C-3P0 that "the possibility of successfully navigating an asteroid field is approximately 3,720 to 1." Here is the clip (the famous quote is around 1:55 mark of the video):

My interpretation is that Han Solo, all knowing, didn't need to be told "the possibility of successfully navigating an asteroid field is approximately 3,720 to 1." He made his decision to enter the astroid field because he knew his odds of survival were better by entering the astroid field than staying in the open and being fired on by all the Imperial ships. Really, Han Solo made a calculation and made an informed statistical decision to enter the astroid field.

Anyway, this is the sort of important topics I will be discussing on my blog.