Being an adult can be tough. We get burdened with all kinds of responsibilities and we’re expected to know what to do. I am probably not the only one who sometimes thinks: “Who told me to adult? I can’t adult!”. Recently, I became a PhD student. I think that doing science is one of the coolest things you can get paid for to do. Yet, often I am madly confused and all I can think is:
Who told me to science? I don’t know how to science?!
The main instigator of my confusion is my continuously growing awareness that a lot of my ideas about how to conduct science are wrong. Little can be so frustrating as discovering that what you thought was basic knowledge turns out to be demonstrable false. I will share some of the misconceptions I struggle(d) with.
The insignificance of p-values
Science is about many things but it is certainly about evidence; this is often where statistics comes in. Of all the statistical metrics, the p-value is certainly the most (ab)used. It quantifies the amount of evidence we have. It tells us whether or not a finding is due to chance. It tells us which hypothesis is more likely to be true. If we find a ‘statistical difference’ we can refute the null hypothesis and accept the alternative hypothesis. And finally, because we use p = 0.05 as a cut-off point only 5% of the significant findings will be false positives.
The (not so) funny thing is that all of these statements are false. The simple fact is that p-values cannot quantify evidence for or against a hypothesis. This is frustrating because this is how we want to use p-values. However, there is not a single metric in classical statistics which can quantify the likelihood of one hypothesis over another. Another frightening notion is that much more than 5% of significant findings are false positives (see this and this).
Confidence in confidence intervals?
There are of course many other metrics other than p-values, for example the confidence interval. A 95% confidence interval is commonly thought to give us an interval of which we can be 95% confident that it includes the true value. Again, this common interpretation is incorrect. It even has a name: the Fundamental Confidence Fallacy. Other typical fallacies include the belief that the width of the interval conveys something the accuracy of the measurement (the Precision Fallacy), or that values inside the interval are more likely than those outside of the interval (the Likelihood Fallacy). How common these misconceptions are was highlighted by a study which found that only 3% of researchers correctly interpreted confidence intervals, while 74% agreed with three or more incorrect interpretations.
Effect sizes and correlations
What about effect sizes and correlations? Certainly, they must be informative?! Yes, they can be. However, just as p-values and confidence intervals, the correct interpretation and use of effect sizes and correlations can differ substantially from common practice. For example, a correlation estimate in a study with 20ish participants is often so unreliable that a correlation of r = 0.40 might just as well be .07 or .65. To reliably estimate a correlation you will need hundreds of participants, while most studies use less than 50. Additionally, there is the misconception that the size of an effect or correlation also tells you something about the size of the evidence.
Explorative versus Confirmatory studies
Several decades ago it was already argued for a clear distinction between explorative and confirmatory evidence. It is common practice to explore a dataset to see if there any unexpected but interesting findings. The trouble starts when you attempt to do a statistical significance test to see if the interesting finding is ‘real’. The validity and interpretation of a p-value depends on the sampling plan; without a pre-established sampling plan it becomes impossible to meaningfully interpret a p-value. As such, a ‘surprise finding’ should always be backed up by a replication study which has a pre-determined plan for sampling and analysis. Only such a study provides us with confirmatory evidence.
Although replications are extremely important for cumulative knowledge building it is not yet common practice. What is more, when replications are done the results are often not that positive. Recently, the massive Reproducibility Project finished with well-powered replications of 100 published psychology studies. Only 39% of the effects could be replicated and the mean effect size was substantially lower than in the original studies. Does that mean that the remaining 61% are false positives? Not necessarily, but this project highlights the importance of not relying on a single study to make any conclusion.
We’ve seen that many common statistical measures are not what they appear to be. Should we stop using p-values altogether? Some do argue this and say that Bayesian statistics is the better alternative. Others argue that we should simply be much more careful but that we can still meaningfully use classical statistics. Surely, we should move towards making pre-registration the standard. Additionally, we should perhaps ‘slow down science’ and replicate a finding several times before we are satisfied with the amount and quality of the evidence.
At the end of the day, I still know little about how to science. That is why I am glad that I am not alone; I have already learned so much from researchers such as Eric-Jan Wagenmakers, Daniel Lakens, Richard Morey, and many other. Furthermore, there is you, the reader of this blog. How do you think we should and should not do science?
They also underlie Bayesian analysis. As you delve further into statistics they pop up all over the place. There are some gotchas, as Alexander alludes to in his reply. If you take the likelihood approach in its strong form (the SLP), it implies that the experiment is not relevent to the outcome and you can ignore the data collection mechanism. (speaking casually) Even the person who proposed the SLP (Birnbaum) eventually rejected that, and much of Bayesianism. As the name suggests, there are weak and srong forms and statisticians are still working on this.
We differ on that being nonsense. It makes perfect sense to me. The critiques I’ve seen tend to conflate deduction with induction and add side conditions that Fisher didn’t. If you take the argument as an inductive (hence fallible) argument, it makes sense. It seems nonsensical if you limit oneself to deduction. We also shouldn’t forget that Fisher placed also emphasis on repeatabilty to verify the the tentative induction, as well as additivity (no interaction). The latter becomes very important when dealing with non-random samples and k-way designs.
Fisher says a lot of nonsense about p-values. One such example, “The level of significance in such cases fulfills the conditions of a measure of the rational grounds for the disbelief [in the null hypothesis] it engenders.” (Fisher, 1959; Statistical Methods and Scientific Inference)
This is strictly not possible, as pointed out in dozens (hundreds?) of papers. Fisher’s argument is based on intuitions and, when pressed, he could never substantiate this idea without appealing to “public knowledge” or other equally fuzzy concepts (Fisher, 1955, Statistical Methods and Scientific Induction).
I agree with you that this sentence in the second p-values paragraph, “However, there is not a single metric in classical statistics which can quantify the likelihood of one hypothesis over another.” (quite ironically) ignores the existence of likelihood ratios. Likelihoods, however, are not in general one to one with p-values. The classic example is that of the binomial vs negative binomial sampling, which I’m sure you’re familiar with. The likelihood ratio is invariant to sampling plan, whereas the p-value is demonstrably not.
Thanks for your comment. Interesting discussion on likelihood ratios! I haven’t dealt with those or even see them being used in a long while now, but they indeed give you the likelihood of one hypothesis over another.
This makes me curious: why are likelihood ratios only used for specific instances? It seems like it would be a good measure for many common tests as well since model comparison is often exactly what we want to do. I guess you could even apply Bayes’ theorem and get posterior odds?
Either way, I’m going to be reading up on the rational and use of likelihood ratios, thanks for the suggestions.
I wouldn’t call R.A.F. (a fellow of the royal stat society) a geneticist, but he was outstanding in most places he worked. Box started as a chemist. Statistics has all sorts of cross-over stars.
However, if you want to learn about p-values specifically, Fisher is quite explicit about what they are. That level of teaching typically occurs in (math) stat courses, not in applied courses. Having worked with biologists, psychologists and engineers, they commonly have rather strange notions of p-values, which they attribute to their teachers.
Look again at the second paragraph on p-values, and think about calibration, which commonly requires monotonicity, and about likelihood ratios, which are pretty good at comparing the likelihood of two “hypotheses”, and are, oddly enough, also one-to-one with p-values.
Good luck on your Ph.D.!
1. It might help to find out what a p-value actually is before you critique them. The data are fixed and have a weight of one. The p-value tells you how well your assumed description (hypothesis) fits the data. That’s alll. No probability involved, other than the simple Laplacian definition. It helps to get your statistics from statisticians, and your psychology from psychologists.
2. Why would you expect only 5% of significant results fo be false? Do some simulations to build your intuition and then try it for real with, say, sub-samples or bootstrap samples.
3. Confidence is in the procedure, not the a particular outcome.
4. Constant effect sizes for humans are a bit of a fantasy. Very little has the same effect on everyone.
You say, “It helps to get your statistics from statisticians” as if they are privileged in knowing and developing statistical concepts. You might recall that some of the biggest strides in statistics were made by non-statisticians. Ronald Fisher was a geneticist, Harold Jeffreys was a physicist, Gosset was a chemist, Gibbs was an engineer/physicist, etc etc etc.
Thanks for your comment!
Note that I don’t critique p-values but only highlighted some misconceptions which I used to hold. That said, I do believe that p-values are often not ideal for what we want to know.
About the false discovery rate: An alpha level of .05 is generally thought to imply that you only get a Type 1 error (false positive) 5% of the time. However, in practice the false discovery rate is actually much higher as it depends on many other factors (prevalence, power, sensitivity, etc.). Although most of this does not depend on p-values I do think it’s problematic. I prefer my discoveries to be true, not false ;).
The “95% confidence” refers to the procedure, not the interval; absolutely. That’s also why the interpretation of a single CI is so tricky, and many argue it might not be informative at all.
Again, thanks for your comment! I think we agree on basically every point 🙂
Thanks for your summary of this discussions!
I think the best way to get a better science is to begin at the very start. Giving students a feeling of good and bad statistic and the opportunity to choose between orthodox and bayesian statistic. Showing them that the methods of science are free for a discussion and don’t have to be accepted as they are! Most students want to argue and discuss, but when it starts with statistic or methodology they just have to swallow the stuff without getting a hint what lies behind that. Maybe the best is to integrate this discussions in the seminars at the universities!
That’s my opionion as a student