Report likelihoods not p-values: FAQ

This page answers frequently asked questions about the Report likelihoods, not p-values proposal for experimental science.

(Note: This page is a personal opinion page.)

What does this proposal entail?

Let’s say you have a coin and you don’t know whether it’s biased or not. You flip it six times and it comes up HHHHHT.

To report a p-value, you have to first declare which experiment you were doing — were you flipping it six times no matter what you saw and counting the number of heads, or were you flipping it until it came up tails and seeing how long it took? Then you have to declare a “null hypothesis,” such as “the coin is fair.” Only then can you get a p-value, which in this case, is either 0.11 (if you were going to toss the coin 6 times regardless) or 0.03 (if you were going to toss until it came up heads). The p-value of 0.11 means “if the null hypothesis were true, then data that has as many H values as the observed data would only occur 11% of the time, if the declared experiment were repeated many many times.”

To report a likelihood, you don’t have to do any of that “declare your experiment” stuff, and you don’t have to single out one special hypothesis. You just pick a whole bunch of hypotheses that seem plausible, such as the set of hypotheses \(H_b\) = “the coin has a bias of \(b\) towards heads” for \(b\) between 0% and 100%. Then you look at the actual data, and report how likely that data is according to each hypothesis. In this example, that yields a graph which looks like this:

L(e|H)

This graph says that HHHHHT is about 1.56% likely under the hypothesis \(H_{0.5}\) saying that the coin is fair, and about 5.93% likely under the hypothesis \(H_{0.75}\) that the coin comes up heads 75% of the time, and only 0.17% likely under the hypothesis \(H_{0.3}\) that the coin only comes up tails 30% of the time.

That’s all you have to do. You don’t need to make any arbitrary choice about which experiment you were going to run. You don’t need to ask yourself what you “would have seen” in other cases. You just look at the actual data, and report how likely each hypothesis in your hypothesis class said that data should be.

(If you want to compare how well the evidence supports one hypothesis or another, you just use the graph to get a likelihood ratio between any two hypotheses. For example, this graph reports that the data HHHHHT supports the hypothesis \(H_{0.75}\) over \(H_{0.5}\) at odds of \(\frac{0.0593}{0.0156}\) = 3.8 to 1.)

For more of an explanation, see Report likelihoods, not p-values.

Why would reporting likelihoods be a good idea?

Experimental scientists reporting likelihoods instead of p-values would likely help address many problems facing modern science, including p-hacking, the vanishing effect size problem, and publication bias.

It would also make it easier for scientists to combine the results from multiple studies, and it would make it much much easier to conduct meta-analyses.

It would also make scientific statistics more intuitive, and easier to understand.

Likelihood functions are a Bayesian tool. Aren’t Bayesian statistics subjective? Shouldn’t science be objective?

Likelihood functions are purely objective. In fact, there’s only one degree of freedom in a likelihood function, and that’s the choice of hypothesis class. This choice is no more arbitrary than the choice of a “null hypothesis” in standard statistics, and indeed, it’s significantly less arbitrary (you can pick a large class of hypotheses, rather than just one; and none of them needs to be singled out as subjectively “special”).

This is in stark contrast with p-values, which require that you pick an “experimental design” in advance, or that you talk about what data you “could have seen” if the experiment turned out differently. Likelihood functions only depend on the hypothesis class that you’re considering, and the data that you actually saw. (This is one of the reasons why likelihood functions would solve p-hacking.)

Likelihood functions are often used by Bayesian statisticians, and Bayesian statisticians do indeed use subjective probabilities, which has led some people to believe that reporting likelihood functions would somehow allow hated subjectivity to seep into the hallowed halls of science.

However, it’s the priors that are subjective in Bayesian statistics, not likelihood functions. In fact, according to the laws of probability theory, likelihood functions are precisely that-which-is-left-over when you factor out all subjective beliefs from an observation of evidence. In other words, probability theory tells us that likelihoods are the best summary there is for capturing the objective evidence that a piece of data provides (assuming your goal is to help make people’s beliefs more accurate).

How would reporting likelihoods solve p-hacking?

P-values depend on what experiment the experimenter says they had in mind. For example, if the data is HHHHHT and the experimenter says “I was planning to flip it six times and count the number of Hs” then the p-value (for the fair coin hypothesis) is 0.11, which is not “significant.” If instead the experimenter says “I was planing to flip it until I got a T” then the p-value is 0.03, which is “significant.” Experimenters can (and do!) misuse or abuse this degree of freedom to make their results appear more significant than they actually are. This is known as “p-hacking.”

In fact, when running complicated experiments, this can (and does!) happen to honest well-meaning researchers. Some experimenters are dishonest, and many others simply lack the time and patience to understand the subtleties of good experimental design. We don’t need to put that burden on experimenters. We don’t need to use statistical tools that depend on which experiment the experimenter had in mind. We can instead report the likelihood that each hypothesis assigned to the actual data.

Likelihood functions don’t have this “experiment” degree of freedom. They don’t care what experiment you thought you were doing. They only care about the data you actually saw. To use likelihood functions correctly, all you have to do is look at stuff and then not lie about what you saw. Given the set of hypotheses you want to report likelihoods for, the likelihood function is completely determined by the data.

But what if the experimenter tries to game the rules by choosing how much data to collect?

That’s a problem if you’re reporting p-values, but it’s not a problem if you’re reporting likelihood functions.

Let’s say there’s a coin that you think is fair, that I think might be biased 55% towards heads. If you’re right, then every toss is going to (in expected value) provide more evidence for “fair” than “biased.” But sometimes (rarely), even if the coin is fair, you will flip it and it will generate a sequence that supports the “bias” hypothesis more than the “fair” hypothesis.

How often will this happen? It depends on how exactly you ask the question. If you can flip the coin at most 300 times, then there’s about a 1.4% chance that at some point the sequence generated will support the hypothesis “the coin is biased 55% towards heads” 20x more than it supports the hypothesis “the coin is fair.” (You can verify this yourself, and tweak the parameters, using this code.)

This is an objective fact about coin tosses. If you look at a sequence of Hs and Ts generated by a fair coin, then some tiny fraction of the time, after some number \(n\) of flips, it will support the “biased 55% towards heads” hypothesis 20x more than it supports the “fair” hypothesis. This is true no matter how or why you decided to look at those \(n\) coin flips. It’s true if you were always planning to look at \(n\) coin flips since the day you were born. It’s true if each coin flip costs $1 to look at, so you decided to only look until the evidence supported one hypothesis at least 20x better than the other. It’s true if you have a heavy personal desire to see the coin come up biased, and were planning to keep flipping until the evidence supports “bias” 20x more than it supports “fair”. It doesn’t matter why you looked at the sequence of Hs and Ts. The amount by which it supports “biased” vs “fair” is objective. If the coin really is fair, then the more you flip it the more the evidence will push towards “fair.” It will only support “bias” a small unlucky fraction of the time, and that fraction is completely independent from your thoughts and intentions .

Likelihoods are objective. They don’t depend on your state of mind.

P-values, on the other hand, run into some difficulties. A p-value is about a single hypothesis (such as “fair”) in isolation. If the coin is fair, then all sequences of coin tosses are equally likely, so you need something more than the data in order to decide whether the data is “significant evidence” about fairness one way or the other. Which means you have to choose a “reference class” of ways the coin “could have come up.” Which means you need to tell us which experiment you “intended” to run. And down the rabbit hole we go.

The p-value you report depends on how many coin tosses you say you were going to look at. If you lie about where you intended to stop, the p-value breaks. If you’re out in the field collecting data, and the data just subconsciously begins to feel overwhelming, and so you stop collecting evidence (or if the data just subconsciously feels insufficient and so you collect more) then the p-value breaks. How badly to p-values break? If you can toss the coin at most 300 times, then by choosing when to stop looking, you can get a p < 0.05 significant result 21% of the time, and that’s assuming you are required to look at at least 30 flips. If you’re allowed to use small sample sizes, the number is more like 25%. You can verify this yourself, and tweak the parameters, using this code.

It’s no wonder that p-values are so often misused! To use p-values correctly, an experimenter has to meticulously report their intentions about the experimental design before collecting data, and then has to hold utterly unfaltering to that experiment design as the data comes in (even if it becomes clear that their experimental design was naive, and that there were crucial considerations that they failed to take into account). Using p-values correctly requires good intentions, constant vigilance, and inflexibility.

Contrast this with likelihood functions. Likelihood functions don’t depend on your intentions. If you start collecting data until it looks overwhelming and then stop, that’s great. If you start collecting data and it looks underwhelming so you keep collecting more, that’s great too. Every new piece of data you do collect will support the true hypothesis more than any other hypothesis, in expectation — that’s the whole point of collecting data. Likelihood functions don’t depend upon your state of mind.

What if the experimenter uses some other technique to bias the result?

They can’t. Or, at least, it’s a theorem of probability theory that they can’t. This law is known as conservation of expected evidence, and it says that for any hypothesis \(H\) and any piece of evidence \(e\), \(\mathbb P(H) = \mathbb P(H \mid e) \mathbb P(e) + \mathbb P(H \mid \lnot e) \mathbb P(\lnot e),\) where \(\mathbb P\) stands for my personal subjective probabilities.

Imagine that I’m going to take your likelihood function \(\mathcal L\) and blindly combine it with my personal beliefs using Bayes’ rule. The question is, can you use \(\mathcal L\) to manipulate my beliefs? The answer is clearly “yes” if you’re willing to lie about what data you saw. But what if you’re honestly reporting all the data you actually saw? Then can you manipulate my beliefs, perhaps by being strategic about what data you look at and how long you look at it?

Clearly, the answer to that question is “sort of.” If you have a fair coin, and you want to convince me it’s biased, and you toss it 10 times, and it (by sheer luck) comes up HHHHHHHHHH, then that’s a lot of evidence in favor of it being biased. But you can’t use the “hope the coin comes up heads 10 times in a row by sheer luck” strategy to reliably bias my beliefs; and if you try just flipping the coin 10 times and hoping to get lucky, then on average, you’re going to produce data that convinces me that the coin is fair. The real question is, can you bias my beliefs in expectation?

If the answer is “yes,” then there will be times when I should ignore \(\mathcal L\) even if you honestly reported what you saw. If the answer is “no,” then there will be no such times — for every \(e\) that would shift my beliefs heavily towards \(H\) (such that you could say “Aha! How naive! If you look at this data and see it is \(e\), then you will believe \(H\), just as I intended”), there is an equal and opposite chance of alternative data which would push my beliefs away from \(H.\) So, can you set up a data collection mechanism that pushes me towards \(H\) in expectation?

And the answer to that question is no, and this is a trivial theorem of probability theory. No matter what subjective belief state \(\mathbb P\) I use, if you honestly report the objective likelihood \(\mathcal L\) of the data you actually saw, and I update \(\mathbb P\) by multiplying it by \(\mathcal L\), there is no way (according to \(\mathbb P\)) for you to bias my probability of \(H\) on average — no matter how strategically you decide which data to look at or how long to look. For more on this theorem and its implications, see Conservation of Expected Evidence.

There’s a difference between metrics that can’t be exploited in theory and metrics that can’t be exploited in practice, and if a malicious experimenter really wanted to abuse likelihood functions, they could probably find some clever method. (At the least, they can always lie and make things up.) However, p-values aren’t even provably inexploitable — they’re so easy to exploit that sometimes well-meaning honest researchers exploit them by accident, and these exploits are already commonplace and harmful. When building better metrics, starting with metrics that are provably inexploitable is a good start.

What if you pick the wrong hypothesis class?

If you don’t report likelihoods for the hypotheses that someone cares about, then that person won’t find your likelihood function very helpful. The same problem exists when you report p-values (what if you pick the wrong null and alternative hypotheses?). Likelihood functions make the problem a little better, by making it easy to report how well the data supports a wide variety of hypotheses (instead of just ~2), but at the end of the day, there’s no substitute for the raw data.

Likelihoods are a summary of the data you saw. They’re a useful summary, especially if you report likelihoods for a broad set of plausible hypotheses. They’re a much better summary than many other alternatives, such as p-values. But they’re still a summary, and there’s just no substitute for the raw data.

How does reporting likelihoods help prevent publication bias?

When you’re reporting p-values, there’s a stark difference between p-values that favor the null hypothesis (which are deemed “insignificant”) and p-values that reject the null hypothesis (which are deemed “significant”). This “significance” occurs at arbitrary thresholds (e.g. p < 0.05), and significance is counted only in one direction (to be significant, you must reject the null hypothesis). Both these features contribute to publication bias: Journals only want to accept experiments that claim “significance” and reject the null hypothesis.

When you’re reporting likelihood functions, a 20 : 1 ratio is a 20 : 1 ratio is a 20 : 1 ratio. It doesn’t matter if your likelihood function is peaked near “the coin is fair” or whether it’s peaked near “the coin is biased 82% towards heads.” If the ratio between the likelihood of one hypothesis and the likelihood of another hypothesis is 20 : 1 then the data provides the same strength of evidence either way. Likelihood functions don’t single out one “null” hypothesis and incentivize people to only report data that pushes away from that null hypothesis; they just talk about the relationship between the data and all the interesting hypotheses.

Furthermore, there’s no arbitrary significance threshold for likelihood functions. If you didn’t have a ton of data, your likelihood function will be pretty spread out, but it won’t be useless. If you find \(5 : 1\) odds in favor of \(H_1\) over \(H_2\), and I independently find \(6 : 1\) odds in favor of \(H_1\) over \(H_2\), and our friend independently finds \(3 : 1\) odds in favor of \(H_1\) over \(H_2,\) then our studies as a whole constitute evidence that favors \(H_1\) over \(H_2\) by a factor of \(90 : 1\) — hardly insignificant! With likelihood ratios (and no arbitrary “significance” cutoffs) progress can be made in small steps.

Of course, this wouldn’t solve the problem of publication bias in full, not by a long shot. There would still be incentives to report cool and interesting results, and the scientific community might still ask for results to pass some sort of “significance” threshold before accepting them for publication. However, reporting likelihoods would be a good start.

How does reporting likelihoods help address vanishing effect sizes?

In a field where an effect does not actually exist, we will often observe an initial study that finds a very large effect, followed by a number of attempts at replication that find smaller and smaller and smaller effects (until someone postulates that the effect doesn’t exist, and does a meta-analysis to look for p-hacking and publication bias). This is known as the decline effect; see also The control group is out of control.

The decline effect is possible in part because p-values look only at whether the evidence says we should “accept” or “reject” a special null hypothesis, without any consideration for what that evidence says about the alternative hypotheses. Let’s say we have three studies, all of which reject the null hypothesis “the coin is fair.” The first study rejects the null hypothesis with a 95% confidence interval of 0.9 bias in favor of heads, but it was a small study and some of the experimenters were a bit sloppy. The second study is a bit bigger and a bit better organized, and rejects the null hypothesis with a 95% confidence interval of 0.62. The third study is high-powered, long-running, and rejects the null hypothesis with a 95% confidence interval of 0.511. It’s easy to say “look, three separate studies rejected the null hypothesis!”

But if you look at the likelihood functions, you’ll see that something very fishy is going on — none of the studies actually agree with each other! The effect sizes are incompatible. Likelihood functions make this phenomenon easy to detect, because they tell you how much the data supports all the relevant hypotheses (not just the null hypothesis). If you combine the three likelihood functions, you’ll see that none of the confidence intervals fare very well. Likelihood functions make it obvious when different studies contradict each other directly, which makes it much harder to summarize contradictory data down to “three studies rejected the null hypothesis”.

What if I want to reject the null hypothesis without needing to have any particular alternative in mind?

Maybe you don’t want to report likelihoods for a large hypothesis class, because you are pretty sure you can’t generate a hypothesis class that contains the correct hypothesis. “I don’t want to have to make up a bunch of alternatives,” you protest, “I just want to show that the null hypothesis is wrong, in isolation.”

Fortunately for you, that’s possible using likelihood functions! The tool you’re looking for is the notion of strict confusion. A hypothesis \(H\) will tell you how low its likelihood is supposed to get, and if its likelihood goes a lot lower than that value, then you can be pretty confident that you’ve got the wrong hypothesis.

For example, let’s say that your one and only hypothesis is \(H_{0.9}\) = “the coin is biased 90% towards heads.” Now let’s say you flip the coin twenty times, and you see the sequence THTTHTTTHTTTTHTTTTTH. The log-likelihood that \(H_{0.9}\) expected to get on a sequence of 20 coin tosses was about −9.37 bits,noteAccording to \(H_{0.9},\) each coin toss carries \(0.9 \log_2(0.9) + 0.1 \log_2(0.1) \approx -0.469\) bits of evidence, so after 20 coin tosses, \(H_{0.9}\) expects about \(20 \cdot 0.469 \approx 9.37\) bits of surprise. For more on why log likelihood is a convenient tool for measuring “evidence” and “surprise,” see Bayes’ rule: log odds form. for a likelihood score of about \(2^{-9.37} \approx\) \(1.5 \cdot 10^{-3},\) on average. The likelihood that \(H_{0.9}\) actually gets on that sequence is −50.59 bits, for a likelihood score of about \(5.9 \cdot 10^{-16},\) which is thirteen orders of magnitude less likely than expected. You don’t need to be clever enough to come up with an alternative hypothesis that explains the data in order to know that \(H_{0.9}\) is not the right hypothesis for you.

In fact, likelihood functions make it easy to show that lots of different hypotheses are strictly confused — you don’t need to have a good hypothesis in your hypothesis class in order for reporting likelihood functions to be a useful service.

How does reporting likelihoods make it easier to combine multiple studies?

Want to combine two studies that reported likelihood functions? Easy! Just multiply the likelihood functions together. If the first study reported 10 : 1 odds in favor of “fair coin” over “biased 55% towards heads,” and the second study reported 12 : 1 odds in favor of “fair coin” over “biased 55% towards heads,” then the combined studies support the “fair coin” hypothesis over the “biased 55% towards heads” hypothesis at a likelihood ratio of 120 : 1.

Is it really that easy? Yes! That’s one of the benefits of using a representation of evidence supported by a large edifice of probability theory — they’re trivially easy to compose. You have to ensure that the studies are independent first, because otherwise you’ll double-count the data. (If the combined likelihood ratios get really extreme, you should be suspicious about whether they were actually independent.) This isn’t exactly a new problem in experimental science; we can just add it to the list of reasons why replication studies had better be independent of the original study. Also, you can only multiply the likelihood functions together on places where they’re both defined: If one study doesn’t report the likelihood for a hypothesis that you care about, you might need access to the raw data in order to extend their likelihood function. But if the studies are independent and both report likelihood functions for the relevant hypotheses, then all you need to do is multiply.

(Don’t try this with p-values. A p < 0.05 study and a p < 0.01 study don’t combine into anything remotely like a p < 0.0005 study.)

How does reporting likelihoods make it easier to conduct meta-analyses?

When studies report p-values, performing a meta-analysis is a complicated procedure that requires dozens of parameters to be finely tuned, and (lo and behold) bias somehow seeps in, and meta-analyses often find whatever the analyzer set out to find. When studies report likelihood functions, performing a meta-analysis is trivial and doesn’t depend on you to tune a dozen parameters. Just multiply all the likelihood functions together.

If you want to be extra virtuous, you can check for anomalies, such as one likelihood function that’s tightly peaked in a place that disagrees with all the other peaks. You can also check for strict confusion, to get a sense for how likely it is that the correct hypothesis is contained within the hypothesis class that you considered. But mostly, all you’ve got to do is multiply the likelihood functions together.

How does reporting likelihood functions make it easier to detect fishy studies?

With likelihood functions, it’s much easier to find the studies that don’t match up with each other — look for the likelihood function that has its peak in a different place than all the other peaks. That study deserves scrutiny: either those experimenters had something special going on in the background of their experiment, or something strange happened in their data collection and reporting process.

Furthermore, likelihoods combined with the notion of strict confusion make it easy to notice when something has gone seriously wrong. As per the above answers, you can combine multiple studies by multiplying their likelihood functions together. What happens if the likelihood function is super small everywhere? That means that either (a) some of the data is fishy, or (b) you haven’t considered the right hypothesis yet.

When you have considered the right hypothesis, it will have decently high likelihood under all the data. There’s only one real world underlying all our data, after all — it’s not like different experimenters are measuring different underlying universes. If you multiply all the likelihood functions together and all the hypotheses turn out looking wildly unlikely, then you’ve got some work to do — you haven’t yet considered the right hypothesis.

When reporting p-values, contradictory studies feel like the norm. Nobody even tries to make all the studies fit together, as if they were all measuring the same world. With likelihood functions, we could actually aspire towards a world where scientific studies on the same topic are all combined. A world where people try to find hypotheses that fit all the data at once, and where a single study’s data being out of place (and making all the hypotheses currently under consideration become strictly confused) is a big glaring “look over here!” signal. A world where it feels like studies are supposed to fit together, where if scientists haven’t been able to find a hypothesis that explains all the raw data, then they know they have their work cut out for them.

Whatever the right hypothesis is, it will almost surely not be strictly confused under the actual data. Of course, when you come up with a completely new hypothesis (such as “the coin most of us have been using is fair but study #317 accidentally used a different coin”) you’re going to need access to the raw data of some of the previous studies in order to extend their likelihood functions and see how well they do on this new hypothesis. As always, there’s just no substitute for raw data.

Why would this make statistics easier to do and understand?

p < 0.05 does not mean “the null hypothesis is less than 5% likely” (though that’s what young students of statistics often want it to mean). What the null hypothesis means is “given a particular experimental design (e.g., toss the coin 100 times and count the heads) and the data (e.g., the sequence of 100 coin flips), if the null hypothesis were true, then data that matches my chosen statistic (e.g., the number of heads) would only occur 5% of the time, if we repeated this experiment over and over and over.”

Why the complexity? Statistics is designed to keep subjective beliefs out of the hallowed halls of science. Your science paper shouldn’t be able to conclude “and, therefore, I personally believe that the coin is very likely to be biased, and I’d bet on that at 20 : 1 odds.” Still, much of this complexity is unnecessary. Likelihood functions achieve the same goal of objectivity, but without all the complexity.

\(\mathcal L_e(H)\) \(< 0.05\) also doesn’t mean “$H$ is less than 5% likely”, it means “H assigned less than 0.05 probability to \(e\) happening.” The student still needs to learn to keep “probability of \(e\) given \(H\)” and “probability of \(H\) given \(e\)” distinctly separate in their heads. However, likelihood functions do have a simpler interpretation: \(\mathcal L_e(H)\) is the probability of the actual data \(e\) occurring if \(H\) were in fact true. No need to talk about experimental design, no need to choose a summary statistic, no need to talk about what “would have happened.” Just look at how much probability each hypothesis assigned to the actual data; that’s your likelihood function.

If you’re going to report p-values, you need to be meticulous in considering the complexities and subtleties of experiment design, on pain of creating p-values that are broken in non-obvious ways (thereby contributing to the replication crisis). When reading results, you need to take the experimenter’s intentions into account. None of this is necessary with likelihoods.

To understand \(\mathcal L_e(H),\) all you need to know is how likely \(e\) was according to \(H.\) Done.

Isn’t this just one additional possible tool in the toolbox? Why switch entirely away from p-values?

This may all sound too good to be true. Can one simple change really solve that many problems in modern science?

First of all, you can be assured that reporting likelihoods instead of p-values would not “solve” all the problems above, and it would surely not solve all problems with modern experimental science. Open access to raw data, preregistration of studies, a culture that rewards replication, and many other ideas are also crucial ingredients to a scientific community that zeroes in on truth.

However, reporting likelihoods would help solve lots of different problems in modern experimental science. This may come as a surprise. Aren’t likelihood functions just one more statistical technique, just another tool for the toolbox? Why should we think that one single tool can solve that many problems?

The reason lies in probability theory. According to the axioms of probability theory, there is only one good way to account for evidence when updating your beliefs, and that way is via likelihood functions. Any other method is subject to inconsistencies and pathologies, as per the coherence theorems of probability theory.

If you’re manipulating equations like \(2 + 2 = 4,\) and you’re using methods that may or may not let you throw in an extra 3 on the right hand side (depending on the arithmetician’s state of mind), then it’s no surprise that you’ll occasionally get yourself into trouble and deduce that \(2 + 2 = 7.\) The laws of arithmetic show that there is only one correct set of tools for manipulating equations if you want to avoid inconsistency.

Similarly, the laws of probability theory show that there is only one correct set of tools for manipulating uncertainty if you want to avoid inconsistency. According to those rules, the right way to represent evidence is through likelihood functions.

These laws (and a solid understanding of them) are younger than the experimental science community, and the statistical tools of that community predate a modern understanding of probability theory. Thus, it makes a lot of sense that the existing literature uses different tools. However, now that humanity does possess a solid understanding of probability theory, it should come as no surprise that many diverse pathologies in statistics can be cleaned up by switching to a policy of reporting likelihoods instead of p-values.

If it’s so great why aren’t we doing it already?

Probability theory (and a solid understanding of all that it implies) is younger than the experimental science community, and the statistical tools of that community predate a modern understanding of probability theory. In particular, modern statistical tools were designed in an attempt to keep subjective reasoning out of the hallowed halls of science. You shouldn’t be able to publish a scientific paper which concludes “and therefore, I personally believe that this coin is biased towards heads, and would bet on that at 20 : 1 odds.” Those aren’t the foundations upon which science can be built.

Likelihood functions are strongly associated with Bayesian statistics, and Bayesian statistical tools tend to manipulate subjective probabilities. Thus, it wasn’t entirely clear how to use tools such as likelihood functions without letting subjectivity bleed into science.

Nowadays, we have a better understanding of how to separate out subjective probabilities from objective claims, and it’s known that likelihood functions don’t carry any subjective baggage with them. In fact, they carry less subjective baggage than p-values do: A likelihood function depends only on the data that you actually saw, whereas p-values depend on your experimental design and your intentions.

There are good historical reasons why the existing scientific community is using p-values, but now that humanity does possess a solid theoretical understanding of probability theory (and how to factor subjective probabilities out from objective claims), it’s no surprise that a wide array of diverse problems in modern statistics can be cleaned up by reporting likelihoods instead of p-values.

Has this ever been tried?

No. Not yet. To our knowledge, most scientists haven’t even considered this proposal — and for good reason! There are a lot of big fish to fry when it comes to addressing the replication crisis, p-hacking, the problem of vanishing effect sizes, publication bias, and other problems facing science today. The scientific community at large is huge, decentralized, and has a lot of inertia. Most activists who are trying to shift it already have their hands full advocating for very important policies such as open access journals and pre-registration of trials. So it makes sense that nobody’s advocating hard for reporting likelihoods instead of p-values — yet.

Nevertheless, there are good reasons to believe that reporting likelihoods instead of p-values would help solve many of the issues in modern experimental science.