The p-value is extensively reported throughout the medical literature although its use is highly debated and not always well understood. While it is much less commonly referred to in science media articles, statistical significance testing does play an important role in interpreting research results and this post is dedicated to a (hopefully) simple explanation of what the p-value is, its uses and limitations.
So what is the p-value and why do we use it?
The p-value is a statistical measure which assesses whether one particular value is statistically significantly different from a reference value which indicates no change. That’s a textbook definition, but like most statistics it is much easier to ignore the definition and focus on an example of how it is used.
In our example let’s say we are interested in looking at whether there is a link between exposure to asbestos (a fibre commonly used as insulation in old (and not-so-old) buildings) and the development of lung cancer. In science experiments you start by outlining two scenarios: a ‘null’ hypothesis (remember this simply as the null hypothesis usually means no change) and an ‘alternative’ hypothesis (the alternative hypothesis is the idea or question that you are interested in testing). For our example the null hypothesis is that there is no association between asbestos and lung cancer (no change), and the alternative hypothesis is that there is an association between asbestos and cancer.
Imagine that in our hypothetical study we found that people exposed to asbestos were two times more likely to develop lung cancer than those people who were not exposed to asbestos. How can we then be sure that this is a real finding (even if other possible contributing factors, or ‘confounders’ such as smoking have been taken into account)? How can we be sure that this result didn’t happen purely by chance? To explain the role of chance, I’ll employ the familiar example of the results of a coin toss. We know that there is a 50:50 chance (or 0.5 probability) that when you toss a coin, the head-side will land face up. In practice however, if you toss a coin ten times, the head-side may land face up, by chance, seven out of ten times, but if you toss a coin 100 times, the number of times the head-side lands face up will come closer to 50 (or half).
Assessing whether a finding is ‘real’ or whether it could occur by chance is where the p-value comes into play. The p-value is the probability of obtaining this finding given that the null hypothesis is true (i.e. there is no association between asbestos exposure and lung cancer). A p-value that is small suggests that the result is unlikely to be due just to chance and this provides some evidence that the null hypothesis is incorrect. Conversely a large p-value indicates that it is likely chance could be playing a role and that there really is no association between asbestos exposure and lung cancer. This may seem black and white: a small p-value equals a real finding; a large p-value indicates equals no real finding. So what is all the fuss about?
The grey zone
There are three main limitations or criticisms of the p-value:
- It is common practise in medical research to use a cut-off significance level of 0.05. Again using our example, this means that if our hypothetical study described above had p-value of 0.049 the finding would be ‘statistically significant’ and we might conclude that exposure to asbestos is associated with lung cancer. However, if this same study had a p-value of just two points more (0.051) it would then fall into the category of being ‘statistically insignificant’ and we may therefore falsely claim that the evidence does not indicate that exposure to asbestos is linked to lung cancer.
- The p-value depends on sample size. In our example let us assume that exposure to asbestos is linked to lung cancer (current evidence suggests that this very much is the case); if we did an experiment to test this and used only a small number of people (small sample size), the p-value may be larger than 0.05. This means that even though the alternative hypothesis is true, due to the small sample size the p-value is larger than the cut-off for statistical significance and we may falsely conclude that the null hypothesis is true (that there is no association between exposure to asbestos and the development of lung cancer). If we were then to re-run the same experiment exactly the same again, but this time use a larger number of people the p-value would be smaller, possibly indicating that the results are now significant.
- Another criticism of the p-value is again related to the cut-off significance level of 0.05. This cut-off means that 5% of the time the p-value will be false. This means that in up to 1/20 tests a p-value could indicate a result is statistically significant even if it’s not.
Why use it at all?
The p-value along with other statistical tests (such as confidence intervals – more about those later) can be a useful tool to help researchers and others interpret research results. Medical research plays an important role in health policies, doctor advice or practices and general public health advice. Before research is translated into policy or practice changes, it is important to understand if a study result might be due to chance or if it could be a real finding. It is also important to keep in mind the limitations of the p-value and to assess these in terms of the study design.
I drafted this post during attendance at a weekly writing group. When I briefly outlined what the post was about two of my fellow writing group members immediately pointed my attention to an amusingly but aptly named article on the p-value and its limitations. For those who are interested in reading more, the article is called ‘The Earth is Round (p<.05) and is by Jacob Cohen (PDF).