Friday 27 July 2018

a student of the t-test

Statistics tests can seem like magic, but many of them work on the same underlying principle.  The starting point is usually some null hypothesis (which you hope to refute):

  • you have some (possibly conceptual) population \(P\) of individuals, from which you can take a sample
  • there is some statistic \(S\) of interest for your problem (the mean, the standard deviation, ...)
  • you have a null hypothesis that your population \(P\) has the same value of the statistic as some reference population (for example, null hypothesis: the mean size of my treated population is the same as the mean size of the untreated population)

A statistical test of this hypothesis can work as follows:

  • the statistic \(S\) will have as a corresponding sampling distribution \(X\)
    • this is the distribution of the original statistic \(S\) if you measure it over many samples (for example, if you measure the mean of a sample, some samples will have a low mean – you were unlucky enough to pick all small members, sometimes a high mean – you just happened to pick all large members, but mostly a mean close to the true population mean)
    • if you assume the population has some specific distribution (eg a normal distribution), you can often calculate this distribution \(X\)
  • when you look at your actual experimental sample, you see where it lies within in this sampling distribution
    • the sampling distribution \(X\) has some critical value, \(x_{crit}\), dependent on your desired significance \(\alpha\); only a small proportion of the distribution lies beyond \(x_{crit}\)
    • your experimental sample has a value of the statistic \(S\), let’s call it \(x_{obs}\); if \(x_{obs} > x_{crit}\), it lies in that very small proportion of the distribution; this is deemed to be sufficiently unlikely to have happened by chance, and it is more likely that the null hypothesis doesn’t hold; so you reject the null hypothesis at the \(1-\alpha\) confidence level 

This is all a bit abstract, so let’s look in detail how it works for the Student’s \(t\)-test.  (This test was invented by the statistician William Sealy Gossett, who published under the pseudonym Student, hence the name.)  One of the nice things about having access to a programming language is that we can look at actual samples and distributions, to get a clearer intuition of what is going on.  I’ve used Python to draw all the charts below.

Let’s assume we have the following conceptual setup. We have a population of items, with a population mean \(\mu\) and population variance \(\sigma^2\). From this we draw a sample of n items. We calculate a statistic of that sample (for example, the sample mean \(\bar{x}\) or the sample variance \(s^2\)). We then draw another sample (either assuming we are drawing ‘with replacement’, or assuming that the population is large enough that this doesn’t matter), and calculate its relevant statistic. We do the \(r\) times, so we have \(r\) values of the statistic, giving us a distribution of that statistic, which we show in a histogram.  Here is the process for a population (of circles) with a uniform distribution of sizes (radius of the circles) between 0 and 300, and hence a mean size of 150.


Even with 3000 samples, that histogram looks a bit ragged.  Mathematically, the distribution we want is the one we get in the limit as the number of samples tends to infinity.  But here we are programming, so we want a number of samples that is ‘big enough’.

The chart below uses a population with a standard normal distribution (mean zero, standard deviation 1), and shows the sampling distribution that results from calculating the mean of the samples.  We can see that by the time we have 100,000 samples, the histogram is pretty smooth.


So in what follows, we use 100,000 samples in order to get a good view of the underlying distribution.  An the population will always be a normal distribution.

In the above example, there were 10 items per sample.  How does the size of the sample (that is, the number of items in each sample, not the number of different samples) affect the distribution?

The chart below takes 100,000 samples of different sizes (3, 5, 10, 20), and shows the sampling distributions of (top) the means and (middle) the standard deviations of those samples.  The bottom chart is a scatter plot of the (mean, sd) for each sample.


So we can see a clear effect on the size of samples drawn from the normal distribution
  • for larger samples, the sample means are more closely distributed around the population mean (of 0) – larger samples give a better estimate of the underlying population mean
  • for larger samples, the sample standard deviations are more closely and more symmetrically distributed around the population std dev (of 1) – larger samples give a better estimate of the underlying population s.d.
The distribution of means varies with the underlying distribution (its mean and standard deviation) as well as the size of samples taken. We can reduce this effect by calculating a different statistic, the \(t\)-statistic, rather than the sample mean \(\bar{x}\).
$$ t = \frac{\bar{x}-\mu}{s / \sqrt{n}}$$ The chart below shows how the distribution of the \(t\)-statistic varies with sample size.


In the limit that the number of samples tends to infinity, and where the underlying population has a normal distribution with a mean of \(\mu\), then this is the ‘\(t\)-distribution with \(n-1\) degrees of freedom’. Overlaying the plots above with the \(t\)-distribution shows a good fit.


The \(t\)-distribution does depend on the sample size, but not as extremely as the distribution of means:

Note that the underlying distribution being sampled is normal with a mean of \(\mu\), but the sd is not specified. To check whether this is important, the chart below shows the sampling distribution with a sample size of 10, from normal distributions with a variety of sds:


But what if we have the population mean wrong? That is, what if we assume that our samples are drawn from a population with a mean of \(\mu\) (the \(\mu\) used in the \(t\)-statistic), but it is actually drawn from a population with a different mean?  The chart below shows the experimental sampling distribution, compared to the theoretical \(t\)-distribution:


So, if the underlying distribution is normal with the assumed mean, we get a good match, and if it isn’t, we don’t. This is the basis of the \(t\)-test.

  • First, define \(t_{crit}\) to be the value for \(t\) such that the area under the sampling distribution curve outside \(t_{crit}\) is \(\alpha\) (\(\alpha\) is typically 0.05, or 0.01).
  • Calculate \(t_{obs}\) of your sample.  The probability of it falling outside \(t_{crit}\) if the null hypothesis holds is \(\alpha\), a small value.  The test says if this happens, it is more likely that the null hypothesis does not hold than you were unlucky, so reject the null hypothesis (with confidence \(1-\alpha\), 95% or 99% in the cases above).
  • There are four cases, illustrated in the chart below for three different sample sizes:
    • Your population has a normal distribution with mean \(\mu\)
      • The \(t\)-statistic of your sample falls inside \(t_{crit}\) (with high probability \(1-\alpha\)).  You correctly fail to reject the null hypothesis: a true negative.
      • The \(t\)-statistic of your sample falls outside \(t_{crit}\) (with low probability \(\alpha\)).  You incorrectly reject the null hypothesis: a false positive.
    • Your population has a normal distribution, but with a mean different from \(\mu\)
      • The \(t\)-statistic of your sample falls inside \(t_{crit}\). You incorrectly fail to reject the null hypothesis: a false negative.
      • The \(t\)-statistic of your sample falls outside \(t_{crit}\).  You correctly reject the null hypothesis: a true positive.

Note that there is a large proportion of  false negatives (red areas in the chart above).  You may have only a small change of incorrectly rejecting the null hypothesis when it holds (high confidence), but may still have a large chance of incorrectly failing to reject it when it is false (low statistical power).  You can reduce this second error by increasing your sample size, as shown in the chart below (notice how the red area reduces as the sample size increases).


So that is the detail for the \(t\)-test assuming a normal distribution, but the same underlying philosophy holds for other tests and other distributions: a given observation is unlikely if the null hypothesis holds, assuming some properties of the sampling distribution (here that it is normal), so reject the null hypothesis with high confidence.

But what about the \(t\)-test if the sample is drawn from a non-normal distribution?  Well, it doesn’t work, because the calculation of \(t_{crit}\) is derived from the \(t\)-distribution, which assumes an underlying a normal population distribution.


2 comments:

  1. A nice exposition. One point to clarify is that the t distribution doesn't arise only when the population is normal; it's only the distribution of sample means drawn from that population which needs to be. This is good because the distribution of sample means often is (approximately) normal even when the population isn't because of the central limit theorem.

    ReplyDelete