So I am getting to the age where a bunch of my friends are having a babies. One thing I love to do is turn their pregnancy into game for me. That is, I love baby pools. Most baby pools require a number of inputs:
1. Gender
2. Date (day and time)
3. Height
4. Weight
Placing is determined by how far off your guesses are on these 4 categories. Without looking at any data, I would guess these have to be related. For example, if a mother is late it only makes sense to guess a larger height and weight. Besides guessing on these 4 categories another key piece of information that is just as important if not the most important for success in a baby pool is keeping track of what others are doing. Baby pools are like Price is Right, it is definitely to your advantage to guess last.
Guessing the gender without any prior knowledge is essentially like flipping an unfair coin.
They (wikipedia) say the gender ratio of boys to girls is about 105:100. So ignoring the guesses of other players in the game, a better bet on gender will always be boy. The real question is: Can we make a better prediction based on knowing information on the couple. This is tricky question, and to give it justice will require some very specific data on the father's family tree. Hopefully I can find something eventually and we can return to this topic. For now I am going to focus on predicting the due date.
So the one website I found that had data on due dates is here. Essentially, based on numerous sources the distribution of pregnancies looks something like this:
It is pretty clear that pregnancy length is not normally distributed, and assuming a usual 40 week due date, about 62% of babies are born on or after their due date. Also, apparently the probability of having a child exactly at 40 weeks is less than 5%. With this plot and the 5% fact in mind, we might expect to see a probability distribution for single days that looks something like this:
Shifting the plot over so that we can think of the plot as days before or after the magic 40 week number we have:
This plot is maybe hard to see, but based on this data it is unlikely the "due date" is even the most probable day for the baby to be born. Wierd! Anyway, back to the baby pool. Speaking like an economist we would have baby pool efficiency if the distribution of guesses looked just like distribution above. I'm guessing this almost never happens. For example lets say the distribution of guesses looked something like this:
The approach I would take would be to pick a date with largest positive difference between true birth distribution and baby pool distribution over a short interval. In this case that looks like about 4-7 days late gives the best chance to win the date portion of the pool.
At this point I want to make a formal complaint about the lack of birth data that is publicly available. It probably exists out there, but key websites that host health and demographic data like CDC, Census, and WHO in my opinion need to get their act together. I can say hands down it easier to research on the most obscure baseball topic than on the topic that is central to so many people's lives. So if any of you statisticians, nurses, or public health people out there know of some good data I would love to hear from you!





