## Saturday, January 24, 2009

### First-Day Statistics

Here's a demonstration that I was deliriously happy to cook up for the first day of my current statistics class. I think it worked extremely well when I first used it (actually as the second thing we do, immediately after looking at a professional research journal for its statistical notation).

Sampling a Deck of Cards: Let's act as a scientific researcher, and say that somehow we've encountered a standard deck of cards for the first time, and know practically nothing about it. We'd like to get a general idea of the contents of the deck, and for starters we'll estimate the average value (mean) of all the cards. Unfortunately, our research budget doesn't give us time to inspect the whole deck; we only have time to look at a random sample of just 4 cards.

Now, as an aside, let's cheat a bit and think about the structure of a deck of cards (not that our researcher would know any of this). For our purposes we'll let A=1, numbers 2-10 count face value, J=11, Q=12, K=13. We know that this population has size N=52; if you think about it you can derive that the actual mean is μ=7; and I'll just come out and tell the class that I already calculated the standard deviation as σ=3.74. (Again, our researcher probably wouldn't know any of this in advance.)

So granted that we wouldn't really know what μ is, what we're about to do is take a random sample and construct a standard 95% confidence interval for the most likely values it could be. In our case we'll be taking a sample size n=4, calculating the average (sample mean, here denoted x'), and construct our confidence interval. As a further aside, I'll point out that a 95% confidence level can be simplified into what we call a z-score, approximately z=2.

At this point I shuffle the deck, draw the top 4 cards, and look at them.

We take the values of the four cards and average them (for example, the last time I did this I got cards ranked 7, 3, 5, and 4; sample mean x' = 19/4 = 4.75). Then I explain that constructing a confidence interval usually involves taking our sample statistic and adding/subtracting some margin of error, thus: μ ≈ x'±E (again, x' is the "sample mean"; E is the "margin of error"). Then we turn to the formula card for the course and look up, near the end of the course, the fact that for us E = z*σ/√n. We substitute that into our formula and obtain μ ≈ x'±z*σ/√n.

So at this point we know the value of everything on the right side of the estimation, and substitute it all in and simplify (the sample mean x', z=2, σ=3.74, and n=4, all above). The arithmetic here is pretty simple, in this example:

μ ≈ x' ± z*σ/√n
= 4.75 ± 2*3.74/√4
= 4.75 ± 2*3.74/2
= 4.75 ± 3.74
= 1.01 to 8.49

So, there's our confidence interval in this case (95% CI: 1.01 to 8.49). Our researcher's interpretation of that: "There is a 95% chance that the mean value of the entire deck of cards is somewhere between 1.01 and 8.49". That's a pretty good, concentrated estimation for μ on the part of our researcher. And in this case we can step back and ask the question: Is the population mean value actually captured in this interval? Yes (based on our previous cheat), we do in fact know that μ=7, so our researcher has successfully captured where μ is with a sample of only 4 cards out of an entire deck.

That usually goes over quite well in my introductory statistics class.

Backstage -- The Ways In Which I Am Lying: Look, I'm always happy to dramatically simplify a concept if it gets the idea across (in this case, the overall process of inferential statistics, the ultimate goal of my course, as treated in the very first hour of class). Let's be upfront about what I've done here.

The primary thing that I'm abusing is that this formula for margin-of-error, and hence the confidence interval, is usually only valid if the sampling distribution follows a normal curve. There's two ways to obtain that: either (a) the original population is normally distributed, or (b) the sample size is large, triggering the Central Limit Theorem to turn our sampling distribution normal anyway.

Neither of those conditions apply here. The deck of cards has a uniform distribution, not normal (4 cards each in all the ranks A to K). And obviously our sample size n=4, necessary to make the demonstration digestible in the available time, is not remotely a "large enough" sample size for the CLT. But granted that the deck of cards has a uniform distribution, that does help us in it becoming "normal-like" a bit faster than some wack-ass massively skewed population, so the example is still going to work out for us most of the time (see more below).

At the same time, ironically enough, I also have too large of a sample size, in terms of a proportion to the overall population, for the usual margin-of-error formula. Here I'm sampling 4/52 = 7.69% of the population, and if that's more than around 5%, technically we're supposed to use a more complicated formula that corrects for that. Or we could legitimately avoid that if we were sampling with replacement, but we're not doing that, either (re-shuffling the deck after each single card draw is a real drag).

However, even without those technical guarantees, everything does in fact work out for us in this particular example anyway. I wrote a computer program to exhaustively evaluate all the possible samples of size 4 from a deck of cards, and the result is this: What I'm calling a 95% confidence interval above, will actually catch our population mean over 95.7% of the time; so if anything the "cheat" here is that we know the interval has more of a chance of catching μ than we're really admitting.

Some other things that may be obvious are the fact that we're assuming we know the population standard deviation σ in advance, but that's a pretty standard instructional warm-up before dealing with the more realistic case of unknown σ. And of course I've approximated the z-score for a 95% CI as z=2, when more accurately it's z=1.960 -- but you'll notice above that using z=2 magically cancels with the factor √n = √4 = 2 in the denominator of our formula, thus nicely abbreviating the number-crunching.

The other thing that might happen when you run this demonstration is there's a possibility of generating an interval with a negative endpoint (even while catching μ inside), which would be ugly and might warrant some grief from certain students (e.g., if x'=3.5, then the interval is -0.24 to 7.24). Nontheless, the numerical examination shows that there's a 94.8% chance of getting what I'd call a "good result" for the presentation -- both catching μ and avoiding any negative endpoint.

At first I considered a sample size of n=3, which would shorten the card-drawing part of the demonstration; this still results in (numerically exhausted) 95.4% chance to catch μ in the resulting interval. Alternatively, you might consider n=5, which guarantees avoidance of any negatives in the interval. In both those cases you lose the cancellation with the z-score, so there would be more calculator number-crunching involved if you did it that way.

Finally, I know that someone could technically dispute my interpretation of what a confidence interval means above as being incompatible with the frequentist interpretation of probability. But I've decided to emphasize this version in my classes, because it's at least comprehensible to both me and my students. I figure you can call me a Bayesian and we'll call it a day.