Report likelihoods not p-values: FAQ

This page an­swers fre­quently asked ques­tions about the Re­port like­li­hoods, not p-val­ues pro­posal for ex­per­i­men­tal sci­ence.

(Note: This page is a per­sonal opinion page.)

What does this pro­posal en­tail?

Let’s say you have a coin and you don’t know whether it’s bi­ased or not. You flip it six times and it comes up HHHHHT.

To re­port a p-value, you have to first de­clare which ex­per­i­ment you were do­ing — were you flip­ping it six times no mat­ter what you saw and count­ing the num­ber of heads, or were you flip­ping it un­til it came up tails and see­ing how long it took? Then you have to de­clare a “null hy­poth­e­sis,” such as “the coin is fair.” Only then can you get a p-value, which in this case, is ei­ther 0.11 (if you were go­ing to toss the coin 6 times re­gard­less) or 0.03 (if you were go­ing to toss un­til it came up heads). The p-value of 0.11 means “if the null hy­poth­e­sis were true, then data that has as many H val­ues as the ob­served data would only oc­cur 11% of the time, if the de­clared ex­per­i­ment were re­peated many many times.”

To re­port a like­li­hood, you don’t have to do any of that “de­clare your ex­per­i­ment” stuff, and you don’t have to sin­gle out one spe­cial hy­poth­e­sis. You just pick a whole bunch of hy­pothe­ses that seem plau­si­ble, such as the set of hy­pothe­ses \(H_b\) = “the coin has a bias of \(b\) to­wards heads” for \(b\) be­tween 0% and 100%. Then you look at the ac­tual data, and re­port how likely that data is ac­cord­ing to each hy­poth­e­sis. In this ex­am­ple, that yields a graph which looks like this:


This graph says that HHHHHT is about 1.56% likely un­der the hy­poth­e­sis \(H_{0.5}\) say­ing that the coin is fair, and about 5.93% likely un­der the hy­poth­e­sis \(H_{0.75}\) that the coin comes up heads 75% of the time, and only 0.17% likely un­der the hy­poth­e­sis \(H_{0.3}\) that the coin only comes up tails 30% of the time.

That’s all you have to do. You don’t need to make any ar­bi­trary choice about which ex­per­i­ment you were go­ing to run. You don’t need to ask your­self what you “would have seen” in other cases. You just look at the ac­tual data, and re­port how likely each hy­poth­e­sis in your hy­poth­e­sis class said that data should be.

(If you want to com­pare how well the ev­i­dence sup­ports one hy­poth­e­sis or an­other, you just use the graph to get a like­li­hood ra­tio be­tween any two hy­pothe­ses. For ex­am­ple, this graph re­ports that the data HHHHHT sup­ports the hy­poth­e­sis \(H_{0.75}\) over \(H_{0.5}\) at odds of \(\frac{0.0593}{0.0156}\) = 3.8 to 1.)

For more of an ex­pla­na­tion, see Re­port like­li­hoods, not p-val­ues.

Why would re­port­ing like­li­hoods be a good idea?

Ex­per­i­men­tal sci­en­tists re­port­ing like­li­hoods in­stead of p-val­ues would likely help ad­dress many prob­lems fac­ing mod­ern sci­ence, in­clud­ing p-hack­ing, the van­ish­ing effect size prob­lem, and pub­li­ca­tion bias.

It would also make it eas­ier for sci­en­tists to com­bine the re­sults from mul­ti­ple stud­ies, and it would make it much much eas­ier to con­duct meta-analy­ses.

It would also make sci­en­tific statis­tics more in­tu­itive, and eas­ier to un­der­stand.

Like­li­hood func­tions are a Bayesian tool. Aren’t Bayesian statis­tics sub­jec­tive? Shouldn’t sci­ence be ob­jec­tive?

Like­li­hood func­tions are purely ob­jec­tive. In fact, there’s only one de­gree of free­dom in a like­li­hood func­tion, and that’s the choice of hy­poth­e­sis class. This choice is no more ar­bi­trary than the choice of a “null hy­poth­e­sis” in stan­dard statis­tics, and in­deed, it’s sig­nifi­cantly less ar­bi­trary (you can pick a large class of hy­pothe­ses, rather than just one; and none of them needs to be sin­gled out as sub­jec­tively “spe­cial”).

This is in stark con­trast with p-val­ues, which re­quire that you pick an “ex­per­i­men­tal de­sign” in ad­vance, or that you talk about what data you “could have seen” if the ex­per­i­ment turned out differ­ently. Like­li­hood func­tions only de­pend on the hy­poth­e­sis class that you’re con­sid­er­ing, and the data that you ac­tu­ally saw. (This is one of the rea­sons why like­li­hood func­tions would solve p-hack­ing.)

Like­li­hood func­tions are of­ten used by Bayesian statis­ti­ci­ans, and Bayesian statis­ti­ci­ans do in­deed use sub­jec­tive prob­a­bil­ities, which has led some peo­ple to be­lieve that re­port­ing like­li­hood func­tions would some­how al­low hated sub­jec­tivity to seep into the hal­lowed halls of sci­ence.

How­ever, it’s the pri­ors that are sub­jec­tive in Bayesian statis­tics, not like­li­hood func­tions. In fact, ac­cord­ing to the laws of prob­a­bil­ity the­ory, like­li­hood func­tions are pre­cisely that-which-is-left-over when you fac­tor out all sub­jec­tive be­liefs from an ob­ser­va­tion of ev­i­dence. In other words, prob­a­bil­ity the­ory tells us that like­li­hoods are the best sum­mary there is for cap­tur­ing the ob­jec­tive ev­i­dence that a piece of data pro­vides (as­sum­ing your goal is to help make peo­ple’s be­liefs more ac­cu­rate).

How would re­port­ing like­li­hoods solve p-hack­ing?

P-val­ues de­pend on what ex­per­i­ment the ex­per­i­menter says they had in mind. For ex­am­ple, if the data is HHHHHT and the ex­per­i­menter says “I was plan­ning to flip it six times and count the num­ber of Hs” then the p-value (for the fair coin hy­poth­e­sis) is 0.11, which is not “sig­nifi­cant.” If in­stead the ex­per­i­menter says “I was plan­ing to flip it un­til I got a T” then the p-value is 0.03, which is “sig­nifi­cant.” Ex­per­i­menters can (and do!) mi­suse or abuse this de­gree of free­dom to make their re­sults ap­pear more sig­nifi­cant than they ac­tu­ally are. This is known as “p-hack­ing.”

In fact, when run­ning com­pli­cated ex­per­i­ments, this can (and does!) hap­pen to hon­est well-mean­ing re­searchers. Some ex­per­i­menters are dishon­est, and many oth­ers sim­ply lack the time and pa­tience to un­der­stand the sub­tleties of good ex­per­i­men­tal de­sign. We don’t need to put that bur­den on ex­per­i­menters. We don’t need to use statis­ti­cal tools that de­pend on which ex­per­i­ment the ex­per­i­menter had in mind. We can in­stead re­port the like­li­hood that each hy­poth­e­sis as­signed to the ac­tual data.

Like­li­hood func­tions don’t have this “ex­per­i­ment” de­gree of free­dom. They don’t care what ex­per­i­ment you thought you were do­ing. They only care about the data you ac­tu­ally saw. To use like­li­hood func­tions cor­rectly, all you have to do is look at stuff and then not lie about what you saw. Given the set of hy­pothe­ses you want to re­port like­li­hoods for, the like­li­hood func­tion is com­pletely de­ter­mined by the data.

But what if the ex­per­i­menter tries to game the rules by choos­ing how much data to col­lect?

That’s a prob­lem if you’re re­port­ing p-val­ues, but it’s not a prob­lem if you’re re­port­ing like­li­hood func­tions.

Let’s say there’s a coin that you think is fair, that I think might be bi­ased 55% to­wards heads. If you’re right, then ev­ery toss is go­ing to (in ex­pected value) provide more ev­i­dence for “fair” than “bi­ased.” But some­times (rarely), even if the coin is fair, you will flip it and it will gen­er­ate a se­quence that sup­ports the “bias” hy­poth­e­sis more than the “fair” hy­poth­e­sis.

How of­ten will this hap­pen? It de­pends on how ex­actly you ask the ques­tion. If you can flip the coin at most 300 times, then there’s about a 1.4% chance that at some point the se­quence gen­er­ated will sup­port the hy­poth­e­sis “the coin is bi­ased 55% to­wards heads” 20x more than it sup­ports the hy­poth­e­sis “the coin is fair.” (You can ver­ify this your­self, and tweak the pa­ram­e­ters, us­ing this code.)

This is an ob­jec­tive fact about coin tosses. If you look at a se­quence of Hs and Ts gen­er­ated by a fair coin, then some tiny frac­tion of the time, af­ter some num­ber \(n\) of flips, it will sup­port the “bi­ased 55% to­wards heads” hy­poth­e­sis 20x more than it sup­ports the “fair” hy­poth­e­sis. This is true no mat­ter how or why you de­cided to look at those \(n\) coin flips. It’s true if you were always plan­ning to look at \(n\) coin flips since the day you were born. It’s true if each coin flip costs \(1 to look at, so you decided to only look until the evidence supported one hypothesis at least 20x better than the other. It's true if you have a heavy personal desire to see the coin come up biased, and were planning to keep flipping until the evidence supports "bias" 20x more than it supports "fair". It doesn't _matter_ why you looked at the sequence of Hs and Ts. The amount by which it supports "biased" vs "fair" is objective. If the coin really is fair, then the more you flip it the more the evidence will push towards "fair." It will only support "bias" a small unlucky fraction of the time, and that fraction is completely independent from your thoughts and intentions . Likelihoods are objective. They don't depend on your state of mind. P-values, on the other hand, run into some difficulties. A p-value is about a single hypothesis (such as "fair") in isolation. If the coin is fair, then [fair_coin_equally_likely all sequences of coin tosses are equally likely], so you need something more than the data in order to decide whether the data is "significant evidence" about fairness one way or the other. Which means you have to choose a "reference class" of ways the coin "could have come up." Which means you need to tell us which experiment you "intended" to run. And down the rabbit hole we go. The p-value you report depends on how many coin tosses you say you were going to look at. If you lie about where you intended to stop, the p-value breaks. If you're out in the field collecting data, and the data just subconsciously begins to feel overwhelming, and so you stop collecting evidence (or if the data just subconsciously feels insufficient and so you collect more) then the p-value breaks. How badly to p-values break? If you can toss the coin at most 300 times, then by choosing when to stop looking, you can get a p < 0.05 significant result _21% of the time,_ and that's assuming you are required to look at at least 30 flips. If you're allowed to use small sample sizes, the number is more like 25%. You can verify this yourself, and tweak the parameters, using [https://​​Soares/​4955bb9268129476262b28e32b8ec979 this code]. It’s no wonder that p-values are so often misused! To use p-values correctly, an experimenter has to meticulously report their intentions about the experimental design before collecting data, and then has to hold utterly unfaltering to that experiment design as the data comes in (even if it becomes clear that their experimental design was naive, and that there were crucial considerations that they failed to take into account). Using p-values correctly requires good intentions, constant vigilance, and inflexibility. Contrast this with likelihood functions. Likelihood functions don’t depend on your intentions. If you start collecting data until it looks overwhelming and then stop, that’s great. If you start collecting data and it looks underwhelming so you keep collecting more, that’s great too. Every new piece of data you do collect will support the true hypothesis more than any other hypothesis, in expectation — that’s the whole point of collecting data. Likelihood functions don’t depend upon your state of mind. ### What if the experimenter uses some other technique to bias the result? They can’t. Or, at least, it’s a theorem of [-1bv] that they can’t. This law is known as [-conservation_expected_evidence conservation of expected evidence], and it says that for any hypothesis \)H\( and any piece of evidence \)e\(, \)\mathbb P(H) = \mathbb P(H \mid e) \mathbb P(e) + \mathbb P(H \mid \lnot e) \mathbb P(\lnot e),\( where \)\mathbb P\( stands for my personal subjective probabilities. Imagine that I'm going to take your likelihood function \)\math­cal L\( and blindly combine it with my personal beliefs using [1lz Bayes' rule]. The question is, can you use \)\math­cal L\( to manipulate my beliefs? The answer is clearly "yes" if you're willing to lie about what data you saw. But what if you're honestly reporting all the data you _actually_ saw? Then can you manipulate my beliefs, perhaps by being strategic about what data you look at and how long you look at it? Clearly, the answer to that question is "sort of." If you have a fair coin, and you want to convince me it's biased, and you toss it 10 times, and it (by sheer luck) comes up HHHHHHHHHH, then that's a lot of evidence in favor of it being biased. But you can't use the "hope the coin comes up heads 10 times in a row by sheer luck" strategy to _reliably_ bias my beliefs; and if you try just flipping the coin 10 times and hoping to get lucky, then on average, you're going to produce data that convinces me that the coin is fair. The real question is, can you bias my beliefs _in expectation?_ If the answer is "yes," then there will be times when I should ignore \)\math­cal L\( even if you honestly reported what you saw. If the answer is "no," then there will be no such times — for every \)e\( that would shift my beliefs heavily towards \)H\( (such that you could say "Aha! How naive! If you look at this data and see it is \)e\(, then you will believe \)H\(, just as I intended"), there is an equal and opposite chance of alternative data which would push my beliefs _away_ from \)H.\( So, can you set up a data collection mechanism that pushes me towards \)H\( in expectation? And the answer to that question is _no,_ and this is a trivial theorem of probability theory. No matter what subjective belief state \)\mathbb P\( I use, if you honestly report the objective likelihood \)\math­cal L\( of the data you actually saw, and I update \)\mathbb P\( by [1lz multiplying it by \)\math­cal L\(], there is no way (according to \)\mathbb P\() for you to bias my probability of \)H\( on average — no matter how strategically you decide which data to look at or how long to look. For more on this theorem and its implications, see [conservation_expected_evidence Conservation of Expected Evidence]. There's a difference between metrics that can't be exploited in theory and metrics that can't be exploited in practice, and if a malicious experimenter really wanted to abuse likelihood functions, they could probably find some clever method. (At the least, they can always lie and make things up.) However, p-values aren't even provably inexploitable — they're so easy to exploit that sometimes well-meaning honest researchers exploit them _by accident_, and these exploits are already commonplace and harmful. When building better metrics, starting with metrics that are provably inexploitable is a good start. ### What if you pick the wrong hypothesis class? If you don't report likelihoods for the hypotheses that someone cares about, then that person won't find your likelihood function very helpful. The same problem exists when you report p-values (what if you pick the wrong null and alternative hypotheses?). Likelihood functions make the problem a little better, by making it easy to report how well the data supports a wide variety of hypotheses (instead of just ~2), but at the end of the day, there's no substitute for the raw data. Likelihoods are a summary of the data you saw. They're a useful summary, especially if you report likelihoods for a broad set of plausible hypotheses. They're a much better summary than many other alternatives, such as p-values. But they're still a summary, and there's just no substitute for the raw data. ### How does reporting likelihoods help prevent publication bias? When you're reporting p-values, there's a stark difference between p-values that favor the null hypothesis (which are deemed "insignificant") and p-values that reject the null hypothesis (which are deemed "significant"). This "significance" occurs at arbitrary thresholds (e.g. p < 0.05), and significance is counted only in one direction (to be significant, you must reject the null hypothesis). Both these features contribute to publication bias: Journals only want to accept experiments that claim "significance" and reject the null hypothesis. When you're reporting [56s likelihood functions], a 20 : 1 [1rq ratio] is a 20 : 1 ratio is a 20 : 1 ratio. It doesn't matter if your likelihood function is peaked near "the coin is fair" or whether it's peaked near "the coin is biased 82% towards heads." If the ratio between the likelihood of one hypothesis and the likelihood of another hypothesis is 20 : 1 then the data provides the same strength of evidence either way. Likelihood functions don't single out one "null" hypothesis and incentivize people to only report data that pushes away from that null hypothesis; they just talk about the relationship between the data and _all_ the interesting hypotheses. Furthermore, there's no arbitrary significance threshold for likelihood functions. If you didn't have a ton of data, your likelihood function will be pretty spread out, but it won't be useless. If you find \)5 : 1\( odds in favor of \)H1\( over \)H2\(, and I independently find \)6 : 1\( odds in favor of \)H1\( over \)H2\(, and our friend independently finds \)3 : 1\( odds in favor of \)H1\( over \)H2,\( then our studies as a whole constitute evidence that favors \)H1\( over \)H2\( by a factor of \)90 : 1\( — hardly insignificant! With likelihood ratios (and no arbitrary "significance" cutoffs) progress can be made in small steps. Of course, this wouldn't solve the problem of publication bias in full, not by a long shot. There would still be incentives to report cool and interesting results, and the scientific community might still ask for results to pass some sort of "significance" threshold before accepting them for publication. However, reporting likelihoods would be a good start. ### How does reporting likelihoods help address vanishing effect sizes? In a field where an effect does not actually exist, we will often observe an initial study that finds a very large effect, followed by a number of attempts at replication that find smaller and smaller and smaller effects (until someone postulates that the effect doesn't exist, and does a meta-analysis to look for p-hacking and publication bias). This is known as the [https://​​wiki/​Decline_effect decline effect]; see also [http://​​2014/​04/​28/​the-control-group-is-out-of-control/​ _The control group is out of control_]. The decline effect is possible in part because p-values look only at whether the evidence says we should “accept” or “reject” a special null hypothesis, without any consideration for what that evidence says about the alternative hypotheses. Let’s say we have three studies, all of which reject the null hypothesis “the coin is fair.” The first study rejects the null hypothesis with a 95% confidence interval of [0.7, 0.9] bias in favor of heads, but it was a small study and some of the experimenters were a bit sloppy. The second study is a bit bigger and a bit better organized, and rejects the null hypothesis with a 95% confidence interval of [0.53, 0.62]. The third study is high-powered, long-running, and rejects the null hypothesis with a 95% confidence interval of [0.503, 0.511]. It’s easy to say “look, three separate studies rejected the null hypothesis!” But if you look at the likelihood functions, you’ll see that _something very fishy is going on_ — none of the studies actually agree with each other! The effect sizes are incompatible. Likelihood functions make this phenomenon easy to detect, because they tell you how much the data supports _all_ the relevant hypotheses (not just the null hypothesis). If you combine the three likelihood functions, you’ll see that _none_ of the confidence intervals fare very well. Likelihood functions make it obvious when different studies contradict each other directly, which makes it much harder to summarize contradictory data down to “three studies rejected the null hypothesis”. ### What if I want to reject the null hypothesis without needing to have any particular alternative in mind? Maybe you don’t want to report likelihoods for a large hypothesis class, because you are pretty sure you can’t generate a hypothesis class that contains the correct hypothesis. “I don’t want to have to make up a bunch of alternatives,” you protest, “I just want to show that the null hypothesis is _wrong,_ in isolation.” Fortunately for you, that’s possible using likelihood functions! The tool you’re looking for is the notion of [227 strict confusion]. A hypothesis \)H\( will tell you how low its likelihood is supposed to get, and if its likelihood goes a lot lower than that value, then you can be pretty confident that you've got the wrong hypothesis. For example, let's say that your one and only hypothesis is \)H{0.9}\( = "the coin is biased 90% towards heads." Now let's say you flip the coin twenty times, and you see the sequence THTTHTTTHTTTTHTTTTTH. The [log_likelihood log-likelihood] that \)H{0.9}\( _expected_ to get on a sequence of 20 coin tosses was about -9.37 [evidence_bit bits],%%note: According to \)H{0.9},\( each coin toss carries \)0.9 \log2(0.9) + 0.1 \log2(0.1) \ap­prox −0.469\( bits of evidence, so after 20 coin tosses, \)H{0.9}\( expects about \)20 \cdot 0.469 \ap­prox 9.37\( bits of [bayes_surprise surprise]. For more on why log likelihood is a convenient tool for measuring "evidence" and "surprise," see [1zh Bayes' rule: log odds form].%% for a likelihood score of about \)2^{-9.37} \ap­prox\( \)1.5 \cdot 10^{-3},\( on average. The likelihood that \)H{0.9}\( actually gets on that sequence is -50.59 bits, for a likelihood score of about \)5.9 \cdot 10^{-16},\( which is _thirteen orders of magnitude less likely than expected._ You don't need to be clever enough to come up with an alternative hypothesis that explains the data in order to know that \)H{0.9}\( is not the right hypothesis for you. In fact, likelihood functions make it easy to show that _lots_ of different hypotheses are strictly confused — you don't need to have a good hypothesis in your hypothesis class in order for reporting likelihood functions to be a useful service. ### How does reporting likelihoods make it easier to combine multiple studies? Want to combine two studies that reported likelihood functions? Easy! Just multiply the likelihood functions together. If the first study reported 10 : 1 odds in favor of "fair coin" over "biased 55% towards heads," and the second study reported 12 : 1 odds in favor of "fair coin" over "biased 55% towards heads," then the combined studies support the "fair coin" hypothesis over the "biased 55% towards heads" hypothesis at a likelihood ratio of 120 : 1. Is it really that easy? Yes! That's one of the benefits of using a representation of evidence supported by a large edifice of [-1bv] — they're trivially easy to compose. You have to ensure that the studies are independent first, because otherwise you'll double-count the data. (If the combined likelihood ratios get really extreme, you should be suspicious about whether they were actually independent.) This isn't exactly a new problem in experimental science; we can just add it to the list of reasons why replication studies had better be independent of the original study. Also, you can only multiply the likelihood functions together on places where they're both defined: If one study doesn't report the likelihood for a hypothesis that you care about, you might need access to the raw data in order to extend their likelihood function. But if the studies are independent and both report likelihood functions for the relevant hypotheses, then all you need to do is multiply. (Don't try this with p-values. A p < 0.05 study and a p < 0.01 study don't combine into anything remotely like a p < 0.0005 study.) ### How does reporting likelihoods make it easier to conduct meta-analyses? When studies report p-values, performing a meta-analysis is a complicated procedure that requires dozens of parameters to be finely tuned, and (lo and behold) bias somehow seeps in, and meta-analyses often find whatever the analyzer set out to find. When studies report likelihood functions, performing a meta-analysis is trivial and doesn't depend on you to tune a dozen parameters. Just multiply all the likelihood functions together. If you want to be extra virtuous, you can check for anomalies, such as one likelihood function that's tightly peaked in a place that disagrees with all the other peaks. You can also check for [227 strict confusion], to get a sense for how likely it is that the correct hypothesis is contained within the hypothesis class that you considered. But mostly, all you've got to do is multiply the likelihood functions together. ### How does reporting likelihood functions make it easier to detect fishy studies? With likelihood functions, it's much easier to find the studies that don't match up with each other — look for the likelihood function that has its peak in a different place than all the other peaks. That study deserves scrutiny: either those experimenters had something special going on in the background of their experiment, or something strange happened in their data collection and reporting process. Furthermore, likelihoods combined with the notion of [227 strict confusion] make it easy to notice when something has gone seriously wrong. As per the above answers, you can combine multiple studies by multiplying their likelihood functions together. What happens if the likelihood function is super small everywhere? That means that either (a) some of the data is fishy, or (b) you haven't considered the right hypothesis yet. When you _have_ considered the right hypothesis, it will have decently high likelihood under _all_ the data. There's only one real world underlying all our data, after all — it's not like different experimenters are measuring different underlying universes. If you multiply all the likelihood functions together and _all_ the hypotheses turn out looking wildly unlikely, then you've got some work to do — you haven't yet considered the right hypothesis. When reporting p-values, contradictory studies feel like the norm. Nobody even _tries_ to make all the studies fit together, as if they were all measuring the same world. With likelihood functions, we could actually aspire towards a world where scientific studies on the same topic are _all_ combined. A world where people try to find hypotheses that fit _all_ the data at once, and where a single study's data being out of place (and making all the hypotheses currently under consideration become [-227]) is a big glaring "look over here!" signal. A world where it feels like studies are _supposed_ to fit together, where if scientists haven't been able to find a hypothesis that explains all the raw data, then they know they have their work cut out for them. Whatever the right hypothesis is, it will almost surely not be strictly confused under the actual data. Of course, when you come up with a completely new hypothesis (such as "the coin most of us have been using is fair but study #317 accidentally used a different coin") you're going to need access to the raw data of some of the previous studies in order to extend their likelihood functions and see how well they do on this new hypothesis. As always, there's just no substitute for raw data. ### Why would this make statistics easier to do and understand? p < 0.05 does not mean "the null hypothesis is less than 5% likely" (though that's what young students of statistics often _want_ it to mean). What the null hypothesis means is "given a particular experimental design (e.g., toss the coin 100 times and count the heads) and the data (e.g., the sequence of 100 coin flips), if the null hypothesis _were_ true, then data that matches my chosen statistic (e.g., the number of heads) would only occur 5% of the time, if we repeated this experiment over and over and over." Why the complexity? Statistics is designed to keep subjective beliefs out of the hallowed halls of science. Your science paper shouldn't be able to conclude "and, therefore, I personally believe that the coin is very likely to be biased, and I'd bet on that at 20 : 1 odds." Still, much of this complexity is unnecessary. Likelihood functions achieve the same goal of objectivity, but without all the complexity. [51n \)\math­cal Le(H)\(] \)< 0.05\( _also_ doesn't mean "\)H\( is less than 5% likely", it means "H assigned less than 0.05 probability to \)e\( happening." The student still needs to learn to keep "probability of \)e\( given \)H\(" and "probability of \)H\( given \)e\(" distinctly separate in their heads. However, likelihood functions do have a _simpler_ interpretation: \)\math­cal Le(H)\( is the probability of the actual data \)e\( occurring if \)H\( were in fact true. No need to talk about experimental design, no need to choose a summary statistic, no need to talk about what "would have happened." Just look at how much probability each hypothesis assigned to the actual data; that's your likelihood function. If you're going to report p-values, you need to be meticulous in considering the complexities and subtleties of experiment design, on pain of creating p-values that are broken in non-obvious ways (thereby contributing to the [https://​​wiki/​Replication_crisis replication crisis]). When reading results, you need to take the experimenter’s intentions into account. None of this is necessary with likelihoods. To understand \)\math­cal L_e(H),\( all you need to know is how likely \)e\( was according to \)H.\( Done. ### Isn't this just one additional possible tool in the toolbox? Why switch entirely away from p-values? This may all sound too good to be true. Can one simple change really solve that many problems in modern science? First of all, you can be assured that reporting likelihoods instead of p-values would not "solve" all the problems above, and it would surely not solve all problems with modern experimental science. Open access to raw data, preregistration of studies, a culture that rewards replication, and many other ideas are also crucial ingredients to a scientific community that zeroes in on truth. However, reporting likelihoods would help solve lots of different problems in modern experimental science. This may come as a surprise. Aren't likelihood functions just one more statistical technique, just another tool for the toolbox? Why should we think that one single tool can solve that many problems? The reason lies in [-1bv]. According to the axioms of probability theory, there is only one good way to account for evidence when updating your beliefs, and that way is via likelihood functions. Any other method is subject to inconsistencies and pathologies, as per the [probability_coherence_theorems coherence theorems of probability theory]. If you're manipulating equations like \)2 + 2 = 4,\( and you're using methods that may or may not let you throw in an extra 3 on the right hand side (depending on the arithmetician's state of mind), then it's no surprise that you'll occasionally get yourself into trouble and deduce that \)2 + 2 = 7.$ The laws of ar­ith­metic show that there is only one cor­rect set of tools for ma­nipu­lat­ing equa­tions if you want to avoid in­con­sis­tency.

Similarly, the laws of prob­a­bil­ity the­ory show that there is only one cor­rect set of tools for ma­nipu­lat­ing un­cer­tainty if you want to avoid in­con­sis­tency. Ac­cord­ing to those rules, the right way to rep­re­sent ev­i­dence is through like­li­hood func­tions.

Th­ese laws (and a solid un­der­stand­ing of them) are younger than the ex­per­i­men­tal sci­ence com­mu­nity, and the statis­ti­cal tools of that com­mu­nity pre­date a mod­ern un­der­stand­ing of prob­a­bil­ity the­ory. Thus, it makes a lot of sense that the ex­ist­ing liter­a­ture uses differ­ent tools. How­ever, now that hu­man­ity does pos­sess a solid un­der­stand­ing of prob­a­bil­ity the­ory, it should come as no sur­prise that many di­verse patholo­gies in statis­tics can be cleaned up by switch­ing to a policy of re­port­ing like­li­hoods in­stead of p-val­ues.

If it’s so great why aren’t we do­ing it already?

Prob­a­bil­ity the­ory (and a solid un­der­stand­ing of all that it im­plies) is younger than the ex­per­i­men­tal sci­ence com­mu­nity, and the statis­ti­cal tools of that com­mu­nity pre­date a mod­ern un­der­stand­ing of prob­a­bil­ity the­ory. In par­tic­u­lar, mod­ern statis­ti­cal tools were de­signed in an at­tempt to keep sub­jec­tive rea­son­ing out of the hal­lowed halls of sci­ence. You shouldn’t be able to pub­lish a sci­en­tific pa­per which con­cludes “and there­fore, I per­son­ally be­lieve that this coin is bi­ased to­wards heads, and would bet on that at 20 : 1 odds.” Those aren’t the foun­da­tions upon which sci­ence can be built.

Like­li­hood func­tions are strongly as­so­ci­ated with Bayesian statis­tics, and Bayesian statis­ti­cal tools tend to ma­nipu­late sub­jec­tive prob­a­bil­ities. Thus, it wasn’t en­tirely clear how to use tools such as like­li­hood func­tions with­out let­ting sub­jec­tivity bleed into sci­ence.

Nowa­days, we have a bet­ter un­der­stand­ing of how to sep­a­rate out sub­jec­tive prob­a­bil­ities from ob­jec­tive claims, and it’s known that like­li­hood func­tions don’t carry any sub­jec­tive bag­gage with them. In fact, they carry less sub­jec­tive bag­gage than p-val­ues do: A like­li­hood func­tion de­pends only on the data that you ac­tu­ally saw, whereas p-val­ues de­pend on your ex­per­i­men­tal de­sign and your in­ten­tions.

There are good his­tor­i­cal rea­sons why the ex­ist­ing sci­en­tific com­mu­nity is us­ing p-val­ues, but now that hu­man­ity does pos­sess a solid the­o­ret­i­cal un­der­stand­ing of prob­a­bil­ity the­ory (and how to fac­tor sub­jec­tive prob­a­bil­ities out from ob­jec­tive claims), it’s no sur­prise that a wide ar­ray of di­verse prob­lems in mod­ern statis­tics can be cleaned up by re­port­ing like­li­hoods in­stead of p-val­ues.

Has this ever been tried?

No. Not yet. To our knowl­edge, most sci­en­tists haven’t even con­sid­ered this pro­posal — and for good rea­son! There are a lot of big fish to fry when it comes to ad­dress­ing the repli­ca­tion crisis, p-hack­ing, the prob­lem of van­ish­ing effect sizes, pub­li­ca­tion bias, and other prob­lems fac­ing sci­ence to­day. The sci­en­tific com­mu­nity at large is huge, de­cen­tral­ized, and has a lot of in­er­tia. Most ac­tivists who are try­ing to shift it already have their hands full ad­vo­cat­ing for very im­por­tant poli­cies such as open ac­cess jour­nals and pre-reg­is­tra­tion of tri­als. So it makes sense that no­body’s ad­vo­cat­ing hard for re­port­ing like­li­hoods in­stead of p-val­ues — yet.

Nev­er­the­less, there are good rea­sons to be­lieve that re­port­ing like­li­hoods in­stead of p-val­ues would help solve many of the is­sues in mod­ern ex­per­i­men­tal sci­ence.