Friday, 27 July 2018

a student of the t-test

Statistics tests can seem like magic, but many of them work on the same underlying principle.  The starting point is usually some null hypothesis (which you hope to refute):

  • you have some (possibly conceptual) population PP of individuals, from which you can take a sample
  • there is some statistic SS of interest for your problem (the mean, the standard deviation, ...)
  • you have a null hypothesis that your population PP has the same value of the statistic as some reference population (for example, null hypothesis: the mean size of my treated population is the same as the mean size of the untreated population)

A statistical test of this hypothesis can work as follows:

  • the statistic SS will have as a corresponding sampling distribution XX
    • this is the distribution of the original statistic SS if you measure it over many samples (for example, if you measure the mean of a sample, some samples will have a low mean – you were unlucky enough to pick all small members, sometimes a high mean – you just happened to pick all large members, but mostly a mean close to the true population mean)
    • if you assume the population has some specific distribution (eg a normal distribution), you can often calculate this distribution XX
  • when you look at your actual experimental sample, you see where it lies within in this sampling distribution
    • the sampling distribution XX has some critical value, xcritxcrit, dependent on your desired significance αα; only a small proportion of the distribution lies beyond xcritxcrit
    • your experimental sample has a value of the statistic SS, let’s call it xobsxobs; if xobs>xcritxobs>xcrit, it lies in that very small proportion of the distribution; this is deemed to be sufficiently unlikely to have happened by chance, and it is more likely that the null hypothesis doesn’t hold; so you reject the null hypothesis at the 1α1α confidence level 

This is all a bit abstract, so let’s look in detail how it works for the Student’s tt-test.  (This test was invented by the statistician William Sealy Gossett, who published under the pseudonym Student, hence the name.)  One of the nice things about having access to a programming language is that we can look at actual samples and distributions, to get a clearer intuition of what is going on.  I’ve used Python to draw all the charts below.

Let’s assume we have the following conceptual setup. We have a population of items, with a population mean μμ and population variance σ2σ2. From this we draw a sample of n items. We calculate a statistic of that sample (for example, the sample mean ˉx¯x or the sample variance s2s2). We then draw another sample (either assuming we are drawing ‘with replacement’, or assuming that the population is large enough that this doesn’t matter), and calculate its relevant statistic. We do the rr times, so we have rr values of the statistic, giving us a distribution of that statistic, which we show in a histogram.  Here is the process for a population (of circles) with a uniform distribution of sizes (radius of the circles) between 0 and 300, and hence a mean size of 150.


Even with 3000 samples, that histogram looks a bit ragged.  Mathematically, the distribution we want is the one we get in the limit as the number of samples tends to infinity.  But here we are programming, so we want a number of samples that is ‘big enough’.

The chart below uses a population with a standard normal distribution (mean zero, standard deviation 1), and shows the sampling distribution that results from calculating the mean of the samples.  We can see that by the time we have 100,000 samples, the histogram is pretty smooth.


So in what follows, we use 100,000 samples in order to get a good view of the underlying distribution.  An the population will always be a normal distribution.

In the above example, there were 10 items per sample.  How does the size of the sample (that is, the number of items in each sample, not the number of different samples) affect the distribution?

The chart below takes 100,000 samples of different sizes (3, 5, 10, 20), and shows the sampling distributions of (top) the means and (middle) the standard deviations of those samples.  The bottom chart is a scatter plot of the (mean, sd) for each sample.


So we can see a clear effect on the size of samples drawn from the normal distribution
  • for larger samples, the sample means are more closely distributed around the population mean (of 0) – larger samples give a better estimate of the underlying population mean
  • for larger samples, the sample standard deviations are more closely and more symmetrically distributed around the population std dev (of 1) – larger samples give a better estimate of the underlying population s.d.
The distribution of means varies with the underlying distribution (its mean and standard deviation) as well as the size of samples taken. We can reduce this effect by calculating a different statistic, the tt-statistic, rather than the sample mean ˉx¯x.
t=ˉxμs/nt=¯xμs/n The chart below shows how the distribution of the tt-statistic varies with sample size.


In the limit that the number of samples tends to infinity, and where the underlying population has a normal distribution with a mean of μμ, then this is the ‘tt-distribution with n1n1 degrees of freedom’. Overlaying the plots above with the tt-distribution shows a good fit.


The tt-distribution does depend on the sample size, but not as extremely as the distribution of means:

Note that the underlying distribution being sampled is normal with a mean of μμ, but the sd is not specified. To check whether this is important, the chart below shows the sampling distribution with a sample size of 10, from normal distributions with a variety of sds:


But what if we have the population mean wrong? That is, what if we assume that our samples are drawn from a population with a mean of μμ (the μμ used in the tt-statistic), but it is actually drawn from a population with a different mean?  The chart below shows the experimental sampling distribution, compared to the theoretical tt-distribution:


So, if the underlying distribution is normal with the assumed mean, we get a good match, and if it isn’t, we don’t. This is the basis of the tt-test.

  • First, define tcrittcrit to be the value for tt such that the area under the sampling distribution curve outside tcrittcrit is αα (αα is typically 0.05, or 0.01).
  • Calculate tobstobs of your sample.  The probability of it falling outside tcrittcrit if the null hypothesis holds is αα, a small value.  The test says if this happens, it is more likely that the null hypothesis does not hold than you were unlucky, so reject the null hypothesis (with confidence 1α1α, 95% or 99% in the cases above).
  • There are four cases, illustrated in the chart below for three different sample sizes:
    • Your population has a normal distribution with mean μμ
      • The tt-statistic of your sample falls inside tcrittcrit (with high probability 1α1α).  You correctly fail to reject the null hypothesis: a true negative.
      • The tt-statistic of your sample falls outside tcrittcrit (with low probability αα).  You incorrectly reject the null hypothesis: a false positive.
    • Your population has a normal distribution, but with a mean different from μμ
      • The tt-statistic of your sample falls inside tcrittcrit. You incorrectly fail to reject the null hypothesis: a false negative.
      • The tt-statistic of your sample falls outside tcrittcrit.  You correctly reject the null hypothesis: a true positive.

Note that there is a large proportion of  false negatives (red areas in the chart above).  You may have only a small change of incorrectly rejecting the null hypothesis when it holds (high confidence), but may still have a large chance of incorrectly failing to reject it when it is false (low statistical power).  You can reduce this second error by increasing your sample size, as shown in the chart below (notice how the red area reduces as the sample size increases).


So that is the detail for the tt-test assuming a normal distribution, but the same underlying philosophy holds for other tests and other distributions: a given observation is unlikely if the null hypothesis holds, assuming some properties of the sampling distribution (here that it is normal), so reject the null hypothesis with high confidence.

But what about the tt-test if the sample is drawn from a non-normal distribution?  Well, it doesn’t work, because the calculation of tcrittcrit is derived from the tt-distribution, which assumes an underlying a normal population distribution.


2 comments:

  1. A nice exposition. One point to clarify is that the t distribution doesn't arise only when the population is normal; it's only the distribution of sample means drawn from that population which needs to be. This is good because the distribution of sample means often is (approximately) normal even when the population isn't because of the central limit theorem.

    ReplyDelete