[Note: This post was partly inspired by a talk given at the NTEN ResearchED event held at Huntington School, York on 3rd May, 2014. Also note that I am not a statistician – which might appear obvious!]
Many of us in the educational community are at last coming around to the realisation that research does have something to offer. We read it all the time on social media and have all witnessed discussions that seem to go on for days about which research is best or how to be critical about things we’ve read. There is an assumption that if research supports our particular view then we have permission to take the high ground and shoot down all those who refuse to accept the ‘evidence’. Of course, we need evidence (that’s the whole point) but we also need to be critical of it.
A great deal of educational research is positivist; like psychology, educational research often assumes that outcomes can be measured using scientific principles and anyone who is familiar with academic papers in psychology will have noticed that there is an awful lot of numbers and bizarre equations involved. The scientific method is one of hypothesis testing – to paraphrase Richard Feynman, the first thing you do is make a guess and then you test the guess by conducting an experiment. If your experiment doesn’t support your guess then your guess is wrong.
In psychology (and many social sciences) what we are looking for is a statistical significance, the nuts and bolts of which are dependent upon statistical tests. The main criteria we use in order to establish significance is something called a p value (or probability value). Psychologists often set the p value as 5% and represent this using the statement p≤0.05.
What does p≤0.05 actually mean?
This is actually quite straightforward, despite the looks on by students’ faces when presented with it: “Maths! In psychology… nooooo.”
All it means is that there is a 5% (or less) chance that the results were due to something other than the manipulation of the independent variable (i.e. something the researcher was unable to control for). The p value is fairly arbitrary but there is a general consensus that 0.05 is a good place to start. We could set it higher but this might mean that we accept our hypothesis based on a false positive (a Type 1 error), or we could set it lower – but then we face the possibility that we reject our hypothesis and accept our null hypothesis when, in fact, the difference was significant (a Type 2 error). So, sometimes that which is true is actually false and that which is false is actually true (cue Robin Thicke).
P values are a hot topic at the moment with many suggesting that effect size might be a better measure to use (there are problems here as well). Nevertheless, while p values remain so influential, we need to be mindful that errors do occur. More worrying, perhaps (if less common) is the phenomenon of p-hacking. P-hacking involves removing the data which prevents the p value from being reached, thus manipulating the data in order to create a positive result. So a researcher might remove all the outliers, sometimes under the ruse that there was something ‘wrong’ with these results. P-hacking (and other such dubious practices) are often uncovered due to the inability to replicate the results – so be wary of single studies (especially if they are a few years old) with no recent studies to support them.
So, to claim that true and false (or right and wrong) are absolute in research is perhaps to misunderstand the workings of the scientific method as it applies to real people. Other factors such as bias, demand characterises and individual differences can blur the lines even further. This is perhaps the reason for the oft-used line ‘research suggests’ because there is always the probability (however small) that the results aren’t as statistically significant as we thought.