Being an adult can be tough. We get burdened with all kinds of responsibilities and we’re expected to know what to do. I am probably not the only one who sometimes thinks: “Who told me to adult? I can’t adult!”. Recently, I became a PhD student. I think that doing science is one of the coolest things you can get paid for to do. Yet, often I am madly confused and all I can think is:
Who told me to science? I don’t know how to science?!
The main instigator of my confusion is my continuously growing awareness that a lot of my ideas about how to conduct science are wrong. Little can be so frustrating as discovering that what you thought was basic knowledge turns out to be demonstrable false. I will share some of the misconceptions I struggle(d) with.
The insignificance of p-values
Science is about many things but it is certainly about evidence; this is often where statistics comes in. Of all the statistical metrics, the p-value is certainly the most (ab)used. It quantifies the amount of evidence we have. It tells us whether or not a finding is due to chance. It tells us which hypothesis is more likely to be true. If we find a ‘statistical difference’ we can refute the null hypothesis and accept the alternative hypothesis. And finally, because we use p = 0.05 as a cut-off point only 5% of the significant findings will be false positives.
The (not so) funny thing is that all of these statements are false. The simple fact is that p-values cannot quantify evidence for or against a hypothesis. This is frustrating because this is how we want to use p-values. However, there is not a single metric in classical statistics which can quantify the likelihood of one hypothesis over another. Another frightening notion is that much more than 5% of significant findings are false positives (see this and this).
Confidence in confidence intervals?
There are of course many other metrics other than p-values, for example the confidence interval. A 95% confidence interval is commonly thought to give us an interval of which we can be 95% confident that it includes the true value. Again, this common interpretation is incorrect. It even has a name: the Fundamental Confidence Fallacy. Other typical fallacies include the belief that the width of the interval conveys something the accuracy of the measurement (the Precision Fallacy), or that values inside the interval are more likely than those outside of the interval (the Likelihood Fallacy). How common these misconceptions are was highlighted by a study which found that only 3% of researchers correctly interpreted confidence intervals, while 74% agreed with three or more incorrect interpretations.
Effect sizes and correlations
What about effect sizes and correlations? Certainly, they must be informative?! Yes, they can be. However, just as p-values and confidence intervals, the correct interpretation and use of effect sizes and correlations can differ substantially from common practice. For example, a correlation estimate in a study with 20ish participants is often so unreliable that a correlation of r = 0.40 might just as well be .07 or .65. To reliably estimate a correlation you will need hundreds of participants, while most studies use less than 50. Additionally, there is the misconception that the size of an effect or correlation also tells you something about the size of the evidence.
Explorative versus Confirmatory studies
Several decades ago it was already argued for a clear distinction between explorative and confirmatory evidence. It is common practice to explore a dataset to see if there any unexpected but interesting findings. The trouble starts when you attempt to do a statistical significance test to see if the interesting finding is ‘real’. The validity and interpretation of a p-value depends on the sampling plan; without a pre-established sampling plan it becomes impossible to meaningfully interpret a p-value. As such, a ‘surprise finding’ should always be backed up by a replication study which has a pre-determined plan for sampling and analysis. Only such a study provides us with confirmatory evidence.
Although replications are extremely important for cumulative knowledge building it is not yet common practice. What is more, when replications are done the results are often not that positive. Recently, the massive Reproducibility Project finished with well-powered replications of 100 published psychology studies. Only 39% of the effects could be replicated and the mean effect size was substantially lower than in the original studies. Does that mean that the remaining 61% are false positives? Not necessarily, but this project highlights the importance of not relying on a single study to make any conclusion.
We’ve seen that many common statistical measures are not what they appear to be. Should we stop using p-values altogether? Some do argue this and say that Bayesian statistics is the better alternative. Others argue that we should simply be much more careful but that we can still meaningfully use classical statistics. Surely, we should move towards making pre-registration the standard. Additionally, we should perhaps ‘slow down science’ and replicate a finding several times before we are satisfied with the amount and quality of the evidence.
At the end of the day, I still know little about how to science. That is why I am glad that I am not alone; I have already learned so much from researchers such as Eric-Jan Wagenmakers, Daniel Lakens, Richard Morey, and many other. Furthermore, there is you, the reader of this blog. How do you think we should and should not do science?