Report likelihoods not p-values: FAQ

This page an­swers fre­quently asked ques­tions about the Re­port like­li­hoods, not p-val­ues pro­posal for ex­per­i­men­tal sci­ence.

(Note: This page is a per­sonal opinion page.)

What does this pro­posal en­tail?

Let’s say you have a coin and you don’t know whether it’s bi­ased or not. You flip it six times and it comes up HHHHHT.

To re­port a p-value, you have to first de­clare which ex­per­i­ment you were do­ing — were you flip­ping it six times no mat­ter what you saw and count­ing the num­ber of heads, or were you flip­ping it un­til it came up tails and see­ing how long it took? Then you have to de­clare a “null hy­poth­e­sis,” such as “the coin is fair.” Only then can you get a p-value, which in this case, is ei­ther 0.11 (if you were go­ing to toss the coin 6 times re­gard­less) or 0.03 (if you were go­ing to toss un­til it came up heads). The p-value of 0.11 means “if the null hy­poth­e­sis were true, then data that has as many H val­ues as the ob­served data would only oc­cur 11% of the time, if the de­clared ex­per­i­ment were re­peated many many times.”

To re­port a like­li­hood, you don’t have to do any of that “de­clare your ex­per­i­ment” stuff, and you don’t have to sin­gle out one spe­cial hy­poth­e­sis. You just pick a whole bunch of hy­pothe­ses that seem plau­si­ble, such as the set of hy­pothe­ses \(H_b\) = “the coin has a bias of \(b\) to­wards heads” for \(b\) be­tween 0% and 100%. Then you look at the ac­tual data, and re­port how likely that data is ac­cord­ing to each hy­poth­e­sis. In this ex­am­ple, that yields a graph which looks like this:


This graph says that HHHHHT is about 1.56% likely un­der the hy­poth­e­sis \(H_{0.5}\) say­ing that the coin is fair, and about 5.93% likely un­der the hy­poth­e­sis \(H_{0.75}\) that the coin comes up heads 75% of the time, and only 0.17% likely un­der the hy­poth­e­sis \(H_{0.3}\) that the coin only comes up tails 30% of the time.

That’s all you have to do. You don’t need to make any ar­bi­trary choice about which ex­per­i­ment you were go­ing to run. You don’t need to ask your­self what you “would have seen” in other cases. You just look at the ac­tual data, and re­port how likely each hy­poth­e­sis in your hy­poth­e­sis class said that data should be.

(If you want to com­pare how well the ev­i­dence sup­ports one hy­poth­e­sis or an­other, you just use the graph to get a like­li­hood ra­tio be­tween any two hy­pothe­ses. For ex­am­ple, this graph re­ports that the data HHHHHT sup­ports the hy­poth­e­sis \(H_{0.75}\) over \(H_{0.5}\) at odds of \(\frac{0.0593}{0.0156}\) = 3.8 to 1.)

For more of an ex­pla­na­tion, see Re­port like­li­hoods, not p-val­ues.

Why would re­port­ing like­li­hoods be a good idea?

Ex­per­i­men­tal sci­en­tists re­port­ing like­li­hoods in­stead of p-val­ues would likely help ad­dress many prob­lems fac­ing mod­ern sci­ence, in­clud­ing p-hack­ing, the van­ish­ing effect size prob­lem, and pub­li­ca­tion bias.

It would also make it eas­ier for sci­en­tists to com­bine the re­sults from mul­ti­ple stud­ies, and it would make it much much eas­ier to con­duct meta-analy­ses.

It would also make sci­en­tific statis­tics more in­tu­itive, and eas­ier to un­der­stand.

Like­li­hood func­tions are a Bayesian tool. Aren’t Bayesian statis­tics sub­jec­tive? Shouldn’t sci­ence be ob­jec­tive?

Like­li­hood func­tions are purely ob­jec­tive. In fact, there’s only one de­gree of free­dom in a like­li­hood func­tion, and that’s the choice of hy­poth­e­sis class. This choice is no more ar­bi­trary than the choice of a “null hy­poth­e­sis” in stan­dard statis­tics, and in­deed, it’s sig­nifi­cantly less ar­bi­trary (you can pick a large class of hy­pothe­ses, rather than just one; and none of them needs to be sin­gled out as sub­jec­tively “spe­cial”).

This is in stark con­trast with p-val­ues, which re­quire that you pick an “ex­per­i­men­tal de­sign” in ad­vance, or that you talk about what data you “could have seen” if the ex­per­i­ment turned out differ­ently. Like­li­hood func­tions only de­pend on the hy­poth­e­sis class that you’re con­sid­er­ing, and the data that you ac­tu­ally saw. (This is one of the rea­sons why like­li­hood func­tions would solve p-hack­ing.)

Like­li­hood func­tions are of­ten used by Bayesian statis­ti­ci­ans, and Bayesian statis­ti­ci­ans do in­deed use sub­jec­tive prob­a­bil­ities, which has led some peo­ple to be­lieve that re­port­ing like­li­hood func­tions would some­how al­low hated sub­jec­tivity to seep into the hal­lowed halls of sci­ence.

How­ever, it’s the pri­ors that are sub­jec­tive in Bayesian statis­tics, not like­li­hood func­tions. In fact, ac­cord­ing to the laws of prob­a­bil­ity the­ory, like­li­hood func­tions are pre­cisely that-which-is-left-over when you fac­tor out all sub­jec­tive be­liefs from an ob­ser­va­tion of ev­i­dence. In other words, prob­a­bil­ity the­ory tells us that like­li­hoods are the best sum­mary there is for cap­tur­ing the ob­jec­tive ev­i­dence that a piece of data pro­vides (as­sum­ing your goal is to help make peo­ple’s be­liefs more ac­cu­rate).

How would re­port­ing like­li­hoods solve p-hack­ing?

P-val­ues de­pend on what ex­per­i­ment the ex­per­i­menter says they had in mind. For ex­am­ple, if the data is HHHHHT and the ex­per­i­menter says “I was plan­ning to flip it six times and count the num­ber of Hs” then the p-value (for the fair coin hy­poth­e­sis) is 0.11, which is not “sig­nifi­cant.” If in­stead the ex­per­i­menter says “I was plan­ing to flip it un­til I got a T” then the p-value is 0.03, which is “sig­nifi­cant.” Ex­per­i­menters can (and do!) mi­suse or abuse this de­gree of free­dom to make their re­sults ap­pear more sig­nifi­cant than they ac­tu­ally are. This is known as “p-hack­ing.”

In fact, when run­ning com­pli­cated ex­per­i­ments, this can (and does!) hap­pen to hon­est well-mean­ing re­searchers. Some ex­per­i­menters are dishon­est, and many oth­ers sim­ply lack the time and pa­tience to un­der­stand the sub­tleties of good ex­per­i­men­tal de­sign. We don’t need to put that bur­den on ex­per­i­menters. We don’t need to use statis­ti­cal tools that de­pend on which ex­per­i­ment the ex­per­i­menter had in mind. We can in­stead re­port the like­li­hood that each hy­poth­e­sis as­signed to the ac­tual data.

Like­li­hood func­tions don’t have this “ex­per­i­ment” de­gree of free­dom. They don’t care what ex­per­i­ment you thought you were do­ing. They only care about the data you ac­tu­ally saw. To use like­li­hood func­tions cor­rectly, all you have to do is look at stuff and then not lie about what you saw. Given the set of hy­pothe­ses you want to re­port like­li­hoods for, the like­li­hood func­tion is com­pletely de­ter­mined by the data.

But what if the ex­per­i­menter tries to game the rules by choos­ing how much data to col­lect?

That’s a prob­lem if you’re re­port­ing p-val­ues, but it’s not a prob­lem if you’re re­port­ing like­li­hood func­tions.

Let’s say there’s a coin that you think is fair, that I think might be bi­ased 55% to­wards heads. If you’re right, then ev­ery toss is go­ing to (in ex­pected value) provide more ev­i­dence for “fair” than “bi­ased.” But some­times (rarely), even if the coin is fair, you will flip it and it will gen­er­ate a se­quence that sup­ports the “bias” hy­poth­e­sis more than the “fair” hy­poth­e­sis.

How of­ten will this hap­pen? It de­pends on how ex­actly you ask the ques­tion. If you can flip the coin at most 300 times, then there’s about a 1.4% chance that at some point the se­quence gen­er­ated will sup­port the hy­poth­e­sis “the coin is bi­ased 55% to­wards heads” 20x more than it sup­ports the hy­poth­e­sis “the coin is fair.” (You can ver­ify this your­self, and tweak the pa­ram­e­ters, us­ing this code.)

This is an ob­jec­tive fact about coin tosses. If you look at a se­quence of Hs and Ts gen­er­ated by a fair coin, then some tiny frac­tion of the time, af­ter some num­ber \(n\) of flips, it will sup­port the “bi­ased 55% to­wards heads” hy­poth­e­sis 20x more than it sup­ports the “fair” hy­poth­e­sis. This is true no mat­ter how or why you de­cided to look at those \(n\) coin flips. It’s true if you were always plan­ning to look at \(n\) coin flips since the day you were born. It’s true if each coin flip costs $1 to look at, so you de­cided to only look un­til the ev­i­dence sup­ported one hy­poth­e­sis at least 20x bet­ter than the other. It’s true if you have a heavy per­sonal de­sire to see the coin come up bi­ased, and were plan­ning to keep flip­ping un­til the ev­i­dence sup­ports “bias” 20x more than it sup­ports “fair”. It doesn’t mat­ter why you looked at the se­quence of Hs and Ts. The amount by which it sup­ports “bi­ased” vs “fair” is ob­jec­tive. If the coin re­ally is fair, then the more you flip it the more the ev­i­dence will push to­wards “fair.” It will only sup­port “bias” a small un­lucky frac­tion of the time, and that frac­tion is com­pletely in­de­pen­dent from your thoughts and in­ten­tions .

Like­li­hoods are ob­jec­tive. They don’t de­pend on your state of mind.

P-val­ues, on the other hand, run into some difficul­ties. A p-value is about a sin­gle hy­poth­e­sis (such as “fair”) in iso­la­tion. If the coin is fair, then all se­quences of coin tosses are equally likely, so you need some­thing more than the data in or­der to de­cide whether the data is “sig­nifi­cant ev­i­dence” about fair­ness one way or the other. Which means you have to choose a “refer­ence class” of ways the coin “could have come up.” Which means you need to tell us which ex­per­i­ment you “in­tended” to run. And down the rab­bit hole we go.

The p-value you re­port de­pends on how many coin tosses you say you were go­ing to look at. If you lie about where you in­tended to stop, the p-value breaks. If you’re out in the field col­lect­ing data, and the data just sub­con­sciously be­gins to feel over­whelming, and so you stop col­lect­ing ev­i­dence (or if the data just sub­con­sciously feels in­suffi­cient and so you col­lect more) then the p-value breaks. How badly to p-val­ues break? If you can toss the coin at most 300 times, then by choos­ing when to stop look­ing, you can get a p < 0.05 sig­nifi­cant re­sult 21% of the time, and that’s as­sum­ing you are re­quired to look at at least 30 flips. If you’re al­lowed to use small sam­ple sizes, the num­ber is more like 25%. You can ver­ify this your­self, and tweak the pa­ram­e­ters, us­ing this code.

It’s no won­der that p-val­ues are so of­ten mi­sused! To use p-val­ues cor­rectly, an ex­per­i­menter has to metic­u­lously re­port their in­ten­tions about the ex­per­i­men­tal de­sign be­fore col­lect­ing data, and then has to hold ut­terly un­fal­ter­ing to that ex­per­i­ment de­sign as the data comes in (even if it be­comes clear that their ex­per­i­men­tal de­sign was naive, and that there were cru­cial con­sid­er­a­tions that they failed to take into ac­count). Us­ing p-val­ues cor­rectly re­quires good in­ten­tions, con­stant vigilance, and in­flex­i­bil­ity.

Con­trast this with like­li­hood func­tions. Like­li­hood func­tions don’t de­pend on your in­ten­tions. If you start col­lect­ing data un­til it looks over­whelming and then stop, that’s great. If you start col­lect­ing data and it looks un­der­whelming so you keep col­lect­ing more, that’s great too. Every new piece of data you do col­lect will sup­port the true hy­poth­e­sis more than any other hy­poth­e­sis, in ex­pec­ta­tion — that’s the whole point of col­lect­ing data. Like­li­hood func­tions don’t de­pend upon your state of mind.

What if the ex­per­i­menter uses some other tech­nique to bias the re­sult?

They can’t. Or, at least, it’s a the­o­rem of prob­a­bil­ity the­ory that they can’t. This law is known as con­ser­va­tion of ex­pected ev­i­dence, and it says that for any hy­poth­e­sis \(H\) and any piece of ev­i­dence \(e\), \(\mathbb P(H) = \mathbb P(H \mid e) \mathbb P(e) + \mathbb P(H \mid \lnot e) \mathbb P(\lnot e),\) where \(\mathbb P\) stands for my per­sonal sub­jec­tive prob­a­bil­ities.

Imag­ine that I’m go­ing to take your like­li­hood func­tion \(\mathcal L\) and blindly com­bine it with my per­sonal be­liefs us­ing Bayes’ rule. The ques­tion is, can you use \(\mathcal L\) to ma­nipu­late my be­liefs? The an­swer is clearly “yes” if you’re will­ing to lie about what data you saw. But what if you’re hon­estly re­port­ing all the data you ac­tu­ally saw? Then can you ma­nipu­late my be­liefs, per­haps by be­ing strate­gic about what data you look at and how long you look at it?

Clearly, the an­swer to that ques­tion is “sort of.” If you have a fair coin, and you want to con­vince me it’s bi­ased, and you toss it 10 times, and it (by sheer luck) comes up HHHHHHHHHH, then that’s a lot of ev­i­dence in fa­vor of it be­ing bi­ased. But you can’t use the “hope the coin comes up heads 10 times in a row by sheer luck” strat­egy to re­li­ably bias my be­liefs; and if you try just flip­ping the coin 10 times and hop­ing to get lucky, then on av­er­age, you’re go­ing to pro­duce data that con­vinces me that the coin is fair. The real ques­tion is, can you bias my be­liefs in ex­pec­ta­tion?

If the an­swer is “yes,” then there will be times when I should ig­nore \(\mathcal L\) even if you hon­estly re­ported what you saw. If the an­swer is “no,” then there will be no such times — for ev­ery \(e\) that would shift my be­liefs heav­ily to­wards \(H\) (such that you could say “Aha! How naive! If you look at this data and see it is \(e\), then you will be­lieve \(H\), just as I in­tended”), there is an equal and op­po­site chance of al­ter­na­tive data which would push my be­liefs away from \(H.\) So, can you set up a data col­lec­tion mechanism that pushes me to­wards \(H\) in ex­pec­ta­tion?

And the an­swer to that ques­tion is no, and this is a triv­ial the­o­rem of prob­a­bil­ity the­ory. No mat­ter what sub­jec­tive be­lief state \(\mathbb P\) I use, if you hon­estly re­port the ob­jec­tive like­li­hood \(\mathcal L\) of the data you ac­tu­ally saw, and I up­date \(\mathbb P\) by mul­ti­ply­ing it by \(\mathcal L\), there is no way (ac­cord­ing to \(\mathbb P\)) for you to bias my prob­a­bil­ity of \(H\) on av­er­age — no mat­ter how strate­gi­cally you de­cide which data to look at or how long to look. For more on this the­o­rem and its im­pli­ca­tions, see Con­ser­va­tion of Ex­pected Ev­i­dence.

There’s a differ­ence be­tween met­rics that can’t be ex­ploited in the­ory and met­rics that can’t be ex­ploited in prac­tice, and if a mal­i­cious ex­per­i­menter re­ally wanted to abuse like­li­hood func­tions, they could prob­a­bly find some clever method. (At the least, they can always lie and make things up.) How­ever, p-val­ues aren’t even prov­ably in­ex­ploitable — they’re so easy to ex­ploit that some­times well-mean­ing hon­est re­searchers ex­ploit them by ac­ci­dent, and these ex­ploits are already com­mon­place and harm­ful. When build­ing bet­ter met­rics, start­ing with met­rics that are prov­ably in­ex­ploitable is a good start.

What if you pick the wrong hy­poth­e­sis class?

If you don’t re­port like­li­hoods for the hy­pothe­ses that some­one cares about, then that per­son won’t find your like­li­hood func­tion very helpful. The same prob­lem ex­ists when you re­port p-val­ues (what if you pick the wrong null and al­ter­na­tive hy­pothe­ses?). Like­li­hood func­tions make the prob­lem a lit­tle bet­ter, by mak­ing it easy to re­port how well the data sup­ports a wide va­ri­ety of hy­pothe­ses (in­stead of just ~2), but at the end of the day, there’s no sub­sti­tute for the raw data.

Like­li­hoods are a sum­mary of the data you saw. They’re a use­ful sum­mary, es­pe­cially if you re­port like­li­hoods for a broad set of plau­si­ble hy­pothe­ses. They’re a much bet­ter sum­mary than many other al­ter­na­tives, such as p-val­ues. But they’re still a sum­mary, and there’s just no sub­sti­tute for the raw data.

How does re­port­ing like­li­hoods help pre­vent pub­li­ca­tion bias?

When you’re re­port­ing p-val­ues, there’s a stark differ­ence be­tween p-val­ues that fa­vor the null hy­poth­e­sis (which are deemed “in­signifi­cant”) and p-val­ues that re­ject the null hy­poth­e­sis (which are deemed “sig­nifi­cant”). This “sig­nifi­cance” oc­curs at ar­bi­trary thresh­olds (e.g. p < 0.05), and sig­nifi­cance is counted only in one di­rec­tion (to be sig­nifi­cant, you must re­ject the null hy­poth­e­sis). Both these fea­tures con­tribute to pub­li­ca­tion bias: Jour­nals only want to ac­cept ex­per­i­ments that claim “sig­nifi­cance” and re­ject the null hy­poth­e­sis.

When you’re re­port­ing like­li­hood func­tions, a 20 : 1 ra­tio is a 20 : 1 ra­tio is a 20 : 1 ra­tio. It doesn’t mat­ter if your like­li­hood func­tion is peaked near “the coin is fair” or whether it’s peaked near “the coin is bi­ased 82% to­wards heads.” If the ra­tio be­tween the like­li­hood of one hy­poth­e­sis and the like­li­hood of an­other hy­poth­e­sis is 20 : 1 then the data pro­vides the same strength of ev­i­dence ei­ther way. Like­li­hood func­tions don’t sin­gle out one “null” hy­poth­e­sis and in­cen­tivize peo­ple to only re­port data that pushes away from that null hy­poth­e­sis; they just talk about the re­la­tion­ship be­tween the data and all the in­ter­est­ing hy­pothe­ses.

Fur­ther­more, there’s no ar­bi­trary sig­nifi­cance thresh­old for like­li­hood func­tions. If you didn’t have a ton of data, your like­li­hood func­tion will be pretty spread out, but it won’t be use­less. If you find \(5 : 1\) odds in fa­vor of \(H_1\) over \(H_2\), and I in­de­pen­dently find \(6 : 1\) odds in fa­vor of \(H_1\) over \(H_2\), and our friend in­de­pen­dently finds \(3 : 1\) odds in fa­vor of \(H_1\) over \(H_2,\) then our stud­ies as a whole con­sti­tute ev­i­dence that fa­vors \(H_1\) over \(H_2\) by a fac­tor of \(90 : 1\) — hardly in­signifi­cant! With like­li­hood ra­tios (and no ar­bi­trary “sig­nifi­cance” cut­offs) progress can be made in small steps.

Of course, this wouldn’t solve the prob­lem of pub­li­ca­tion bias in full, not by a long shot. There would still be in­cen­tives to re­port cool and in­ter­est­ing re­sults, and the sci­en­tific com­mu­nity might still ask for re­sults to pass some sort of “sig­nifi­cance” thresh­old be­fore ac­cept­ing them for pub­li­ca­tion. How­ever, re­port­ing like­li­hoods would be a good start.

How does re­port­ing like­li­hoods help ad­dress van­ish­ing effect sizes?

In a field where an effect does not ac­tu­ally ex­ist, we will of­ten ob­serve an ini­tial study that finds a very large effect, fol­lowed by a num­ber of at­tempts at repli­ca­tion that find smaller and smaller and smaller effects (un­til some­one pos­tu­lates that the effect doesn’t ex­ist, and does a meta-anal­y­sis to look for p-hack­ing and pub­li­ca­tion bias). This is known as the de­cline effect; see also The con­trol group is out of con­trol.

The de­cline effect is pos­si­ble in part be­cause p-val­ues look only at whether the ev­i­dence says we should “ac­cept” or “re­ject” a spe­cial null hy­poth­e­sis, with­out any con­sid­er­a­tion for what that ev­i­dence says about the al­ter­na­tive hy­pothe­ses. Let’s say we have three stud­ies, all of which re­ject the null hy­poth­e­sis “the coin is fair.” The first study re­jects the null hy­poth­e­sis with a 95% con­fi­dence in­ter­val of 0.9 bias in fa­vor of heads, but it was a small study and some of the ex­per­i­menters were a bit sloppy. The sec­ond study is a bit big­ger and a bit bet­ter or­ga­nized, and re­jects the null hy­poth­e­sis with a 95% con­fi­dence in­ter­val of 0.62. The third study is high-pow­ered, long-run­ning, and re­jects the null hy­poth­e­sis with a 95% con­fi­dence in­ter­val of 0.511. It’s easy to say “look, three sep­a­rate stud­ies re­jected the null hy­poth­e­sis!”

But if you look at the like­li­hood func­tions, you’ll see that some­thing very fishy is go­ing on — none of the stud­ies ac­tu­ally agree with each other! The effect sizes are in­com­pat­i­ble. Like­li­hood func­tions make this phe­nomenon easy to de­tect, be­cause they tell you how much the data sup­ports all the rele­vant hy­pothe­ses (not just the null hy­poth­e­sis). If you com­bine the three like­li­hood func­tions, you’ll see that none of the con­fi­dence in­ter­vals fare very well. Like­li­hood func­tions make it ob­vi­ous when differ­ent stud­ies con­tra­dict each other di­rectly, which makes it much harder to sum­ma­rize con­tra­dic­tory data down to “three stud­ies re­jected the null hy­poth­e­sis”.

What if I want to re­ject the null hy­poth­e­sis with­out need­ing to have any par­tic­u­lar al­ter­na­tive in mind?

Maybe you don’t want to re­port like­li­hoods for a large hy­poth­e­sis class, be­cause you are pretty sure you can’t gen­er­ate a hy­poth­e­sis class that con­tains the cor­rect hy­poth­e­sis. “I don’t want to have to make up a bunch of al­ter­na­tives,” you protest, “I just want to show that the null hy­poth­e­sis is wrong, in iso­la­tion.”

For­tu­nately for you, that’s pos­si­ble us­ing like­li­hood func­tions! The tool you’re look­ing for is the no­tion of strict con­fu­sion. A hy­poth­e­sis \(H\) will tell you how low its like­li­hood is sup­posed to get, and if its like­li­hood goes a lot lower than that value, then you can be pretty con­fi­dent that you’ve got the wrong hy­poth­e­sis.

For ex­am­ple, let’s say that your one and only hy­poth­e­sis is \(H_{0.9}\) = “the coin is bi­ased 90% to­wards heads.” Now let’s say you flip the coin twenty times, and you see the se­quence THTTHTTTHTTTTHTTTTTH. The log-like­li­hood that \(H_{0.9}\) ex­pected to get on a se­quence of 20 coin tosses was about −9.37 bits,noteAc­cord­ing to \(H_{0.9},\) each coin toss car­ries \(0.9 \log_2(0.9) + 0.1 \log_2(0.1) \approx -0.469\) bits of ev­i­dence, so af­ter 20 coin tosses, \(H_{0.9}\) ex­pects about \(20 \cdot 0.469 \approx 9.37\) bits of sur­prise. For more on why log like­li­hood is a con­ve­nient tool for mea­sur­ing “ev­i­dence” and “sur­prise,” see Bayes’ rule: log odds form. for a like­li­hood score of about \(2^{-9.37} \approx\) \(1.5 \cdot 10^{-3},\) on av­er­age. The like­li­hood that \(H_{0.9}\) ac­tu­ally gets on that se­quence is −50.59 bits, for a like­li­hood score of about \(5.9 \cdot 10^{-16},\) which is thir­teen or­ders of mag­ni­tude less likely than ex­pected. You don’t need to be clever enough to come up with an al­ter­na­tive hy­poth­e­sis that ex­plains the data in or­der to know that \(H_{0.9}\) is not the right hy­poth­e­sis for you.

In fact, like­li­hood func­tions make it easy to show that lots of differ­ent hy­pothe­ses are strictly con­fused — you don’t need to have a good hy­poth­e­sis in your hy­poth­e­sis class in or­der for re­port­ing like­li­hood func­tions to be a use­ful ser­vice.

How does re­port­ing like­li­hoods make it eas­ier to com­bine mul­ti­ple stud­ies?

Want to com­bine two stud­ies that re­ported like­li­hood func­tions? Easy! Just mul­ti­ply the like­li­hood func­tions to­gether. If the first study re­ported 10 : 1 odds in fa­vor of “fair coin” over “bi­ased 55% to­wards heads,” and the sec­ond study re­ported 12 : 1 odds in fa­vor of “fair coin” over “bi­ased 55% to­wards heads,” then the com­bined stud­ies sup­port the “fair coin” hy­poth­e­sis over the “bi­ased 55% to­wards heads” hy­poth­e­sis at a like­li­hood ra­tio of 120 : 1.

Is it re­ally that easy? Yes! That’s one of the benefits of us­ing a rep­re­sen­ta­tion of ev­i­dence sup­ported by a large ed­ifice of prob­a­bil­ity the­ory — they’re triv­ially easy to com­pose. You have to en­sure that the stud­ies are in­de­pen­dent first, be­cause oth­er­wise you’ll dou­ble-count the data. (If the com­bined like­li­hood ra­tios get re­ally ex­treme, you should be sus­pi­cious about whether they were ac­tu­ally in­de­pen­dent.) This isn’t ex­actly a new prob­lem in ex­per­i­men­tal sci­ence; we can just add it to the list of rea­sons why repli­ca­tion stud­ies had bet­ter be in­de­pen­dent of the origi­nal study. Also, you can only mul­ti­ply the like­li­hood func­tions to­gether on places where they’re both defined: If one study doesn’t re­port the like­li­hood for a hy­poth­e­sis that you care about, you might need ac­cess to the raw data in or­der to ex­tend their like­li­hood func­tion. But if the stud­ies are in­de­pen­dent and both re­port like­li­hood func­tions for the rele­vant hy­pothe­ses, then all you need to do is mul­ti­ply.

(Don’t try this with p-val­ues. A p < 0.05 study and a p < 0.01 study don’t com­bine into any­thing re­motely like a p < 0.0005 study.)

How does re­port­ing like­li­hoods make it eas­ier to con­duct meta-analy­ses?

When stud­ies re­port p-val­ues, perform­ing a meta-anal­y­sis is a com­pli­cated pro­ce­dure that re­quires dozens of pa­ram­e­ters to be finely tuned, and (lo and be­hold) bias some­how seeps in, and meta-analy­ses of­ten find what­ever the an­a­lyzer set out to find. When stud­ies re­port like­li­hood func­tions, perform­ing a meta-anal­y­sis is triv­ial and doesn’t de­pend on you to tune a dozen pa­ram­e­ters. Just mul­ti­ply all the like­li­hood func­tions to­gether.

If you want to be ex­tra vir­tu­ous, you can check for anoma­lies, such as one like­li­hood func­tion that’s tightly peaked in a place that dis­agrees with all the other peaks. You can also check for strict con­fu­sion, to get a sense for how likely it is that the cor­rect hy­poth­e­sis is con­tained within the hy­poth­e­sis class that you con­sid­ered. But mostly, all you’ve got to do is mul­ti­ply the like­li­hood func­tions to­gether.

How does re­port­ing like­li­hood func­tions make it eas­ier to de­tect fishy stud­ies?

With like­li­hood func­tions, it’s much eas­ier to find the stud­ies that don’t match up with each other — look for the like­li­hood func­tion that has its peak in a differ­ent place than all the other peaks. That study de­serves scrutiny: ei­ther those ex­per­i­menters had some­thing spe­cial go­ing on in the back­ground of their ex­per­i­ment, or some­thing strange hap­pened in their data col­lec­tion and re­port­ing pro­cess.

Fur­ther­more, like­li­hoods com­bined with the no­tion of strict con­fu­sion make it easy to no­tice when some­thing has gone se­ri­ously wrong. As per the above an­swers, you can com­bine mul­ti­ple stud­ies by mul­ti­ply­ing their like­li­hood func­tions to­gether. What hap­pens if the like­li­hood func­tion is su­per small ev­ery­where? That means that ei­ther (a) some of the data is fishy, or (b) you haven’t con­sid­ered the right hy­poth­e­sis yet.

When you have con­sid­ered the right hy­poth­e­sis, it will have de­cently high like­li­hood un­der all the data. There’s only one real world un­der­ly­ing all our data, af­ter all — it’s not like differ­ent ex­per­i­menters are mea­sur­ing differ­ent un­der­ly­ing uni­verses. If you mul­ti­ply all the like­li­hood func­tions to­gether and all the hy­pothe­ses turn out look­ing wildly un­likely, then you’ve got some work to do — you haven’t yet con­sid­ered the right hy­poth­e­sis.

When re­port­ing p-val­ues, con­tra­dic­tory stud­ies feel like the norm. No­body even tries to make all the stud­ies fit to­gether, as if they were all mea­sur­ing the same world. With like­li­hood func­tions, we could ac­tu­ally as­pire to­wards a world where sci­en­tific stud­ies on the same topic are all com­bined. A world where peo­ple try to find hy­pothe­ses that fit all the data at once, and where a sin­gle study’s data be­ing out of place (and mak­ing all the hy­pothe­ses cur­rently un­der con­sid­er­a­tion be­come strictly con­fused) is a big glar­ing “look over here!” sig­nal. A world where it feels like stud­ies are sup­posed to fit to­gether, where if sci­en­tists haven’t been able to find a hy­poth­e­sis that ex­plains all the raw data, then they know they have their work cut out for them.

What­ever the right hy­poth­e­sis is, it will al­most surely not be strictly con­fused un­der the ac­tual data. Of course, when you come up with a com­pletely new hy­poth­e­sis (such as “the coin most of us have been us­ing is fair but study #317 ac­ci­den­tally used a differ­ent coin”) you’re go­ing to need ac­cess to the raw data of some of the pre­vi­ous stud­ies in or­der to ex­tend their like­li­hood func­tions and see how well they do on this new hy­poth­e­sis. As always, there’s just no sub­sti­tute for raw data.

Why would this make statis­tics eas­ier to do and un­der­stand?

p < 0.05 does not mean “the null hy­poth­e­sis is less than 5% likely” (though that’s what young stu­dents of statis­tics of­ten want it to mean). What the null hy­poth­e­sis means is “given a par­tic­u­lar ex­per­i­men­tal de­sign (e.g., toss the coin 100 times and count the heads) and the data (e.g., the se­quence of 100 coin flips), if the null hy­poth­e­sis were true, then data that matches my cho­sen statis­tic (e.g., the num­ber of heads) would only oc­cur 5% of the time, if we re­peated this ex­per­i­ment over and over and over.”

Why the com­plex­ity? Statis­tics is de­signed to keep sub­jec­tive be­liefs out of the hal­lowed halls of sci­ence. Your sci­ence pa­per shouldn’t be able to con­clude “and, there­fore, I per­son­ally be­lieve that the coin is very likely to be bi­ased, and I’d bet on that at 20 : 1 odds.” Still, much of this com­plex­ity is un­nec­es­sary. Like­li­hood func­tions achieve the same goal of ob­jec­tivity, but with­out all the com­plex­ity.

\(\mathcal L_e(H)\) \(< 0.05\) also doesn’t mean “$H$ is less than 5% likely”, it means “H as­signed less than 0.05 prob­a­bil­ity to \(e\) hap­pen­ing.” The stu­dent still needs to learn to keep “prob­a­bil­ity of \(e\) given \(H\)” and “prob­a­bil­ity of \(H\) given \(e\)” dis­tinctly sep­a­rate in their heads. How­ever, like­li­hood func­tions do have a sim­pler in­ter­pre­ta­tion: \(\mathcal L_e(H)\) is the prob­a­bil­ity of the ac­tual data \(e\) oc­cur­ring if \(H\) were in fact true. No need to talk about ex­per­i­men­tal de­sign, no need to choose a sum­mary statis­tic, no need to talk about what “would have hap­pened.” Just look at how much prob­a­bil­ity each hy­poth­e­sis as­signed to the ac­tual data; that’s your like­li­hood func­tion.

If you’re go­ing to re­port p-val­ues, you need to be metic­u­lous in con­sid­er­ing the com­plex­ities and sub­tleties of ex­per­i­ment de­sign, on pain of cre­at­ing p-val­ues that are bro­ken in non-ob­vi­ous ways (thereby con­tribut­ing to the repli­ca­tion crisis). When read­ing re­sults, you need to take the ex­per­i­menter’s in­ten­tions into ac­count. None of this is nec­es­sary with like­li­hoods.

To un­der­stand \(\mathcal L_e(H),\) all you need to know is how likely \(e\) was ac­cord­ing to \(H.\) Done.

Isn’t this just one ad­di­tional pos­si­ble tool in the toolbox? Why switch en­tirely away from p-val­ues?

This may all sound too good to be true. Can one sim­ple change re­ally solve that many prob­lems in mod­ern sci­ence?

First of all, you can be as­sured that re­port­ing like­li­hoods in­stead of p-val­ues would not “solve” all the prob­lems above, and it would surely not solve all prob­lems with mod­ern ex­per­i­men­tal sci­ence. Open ac­cess to raw data, pre­reg­is­tra­tion of stud­ies, a cul­ture that re­wards repli­ca­tion, and many other ideas are also cru­cial in­gre­di­ents to a sci­en­tific com­mu­nity that ze­roes in on truth.

How­ever, re­port­ing like­li­hoods would help solve lots of differ­ent prob­lems in mod­ern ex­per­i­men­tal sci­ence. This may come as a sur­prise. Aren’t like­li­hood func­tions just one more statis­ti­cal tech­nique, just an­other tool for the toolbox? Why should we think that one sin­gle tool can solve that many prob­lems?

The rea­son lies in prob­a­bil­ity the­ory. Ac­cord­ing to the ax­ioms of prob­a­bil­ity the­ory, there is only one good way to ac­count for ev­i­dence when up­dat­ing your be­liefs, and that way is via like­li­hood func­tions. Any other method is sub­ject to in­con­sis­ten­cies and patholo­gies, as per the co­her­ence the­o­rems of prob­a­bil­ity the­ory.

If you’re ma­nipu­lat­ing equa­tions like \(2 + 2 = 4,\) and you’re us­ing meth­ods that may or may not let you throw in an ex­tra 3 on the right hand side (de­pend­ing on the ar­ith­meti­cian’s state of mind), then it’s no sur­prise that you’ll oc­ca­sion­ally get your­self into trou­ble and de­duce that \(2 + 2 = 7.\) The laws of ar­ith­metic show that there is only one cor­rect set of tools for ma­nipu­lat­ing equa­tions if you want to avoid in­con­sis­tency.

Similarly, the laws of prob­a­bil­ity the­ory show that there is only one cor­rect set of tools for ma­nipu­lat­ing un­cer­tainty if you want to avoid in­con­sis­tency. Ac­cord­ing to those rules, the right way to rep­re­sent ev­i­dence is through like­li­hood func­tions.

Th­ese laws (and a solid un­der­stand­ing of them) are younger than the ex­per­i­men­tal sci­ence com­mu­nity, and the statis­ti­cal tools of that com­mu­nity pre­date a mod­ern un­der­stand­ing of prob­a­bil­ity the­ory. Thus, it makes a lot of sense that the ex­ist­ing liter­a­ture uses differ­ent tools. How­ever, now that hu­man­ity does pos­sess a solid un­der­stand­ing of prob­a­bil­ity the­ory, it should come as no sur­prise that many di­verse patholo­gies in statis­tics can be cleaned up by switch­ing to a policy of re­port­ing like­li­hoods in­stead of p-val­ues.

If it’s so great why aren’t we do­ing it already?

Prob­a­bil­ity the­ory (and a solid un­der­stand­ing of all that it im­plies) is younger than the ex­per­i­men­tal sci­ence com­mu­nity, and the statis­ti­cal tools of that com­mu­nity pre­date a mod­ern un­der­stand­ing of prob­a­bil­ity the­ory. In par­tic­u­lar, mod­ern statis­ti­cal tools were de­signed in an at­tempt to keep sub­jec­tive rea­son­ing out of the hal­lowed halls of sci­ence. You shouldn’t be able to pub­lish a sci­en­tific pa­per which con­cludes “and there­fore, I per­son­ally be­lieve that this coin is bi­ased to­wards heads, and would bet on that at 20 : 1 odds.” Those aren’t the foun­da­tions upon which sci­ence can be built.

Like­li­hood func­tions are strongly as­so­ci­ated with Bayesian statis­tics, and Bayesian statis­ti­cal tools tend to ma­nipu­late sub­jec­tive prob­a­bil­ities. Thus, it wasn’t en­tirely clear how to use tools such as like­li­hood func­tions with­out let­ting sub­jec­tivity bleed into sci­ence.

Nowa­days, we have a bet­ter un­der­stand­ing of how to sep­a­rate out sub­jec­tive prob­a­bil­ities from ob­jec­tive claims, and it’s known that like­li­hood func­tions don’t carry any sub­jec­tive bag­gage with them. In fact, they carry less sub­jec­tive bag­gage than p-val­ues do: A like­li­hood func­tion de­pends only on the data that you ac­tu­ally saw, whereas p-val­ues de­pend on your ex­per­i­men­tal de­sign and your in­ten­tions.

There are good his­tor­i­cal rea­sons why the ex­ist­ing sci­en­tific com­mu­nity is us­ing p-val­ues, but now that hu­man­ity does pos­sess a solid the­o­ret­i­cal un­der­stand­ing of prob­a­bil­ity the­ory (and how to fac­tor sub­jec­tive prob­a­bil­ities out from ob­jec­tive claims), it’s no sur­prise that a wide ar­ray of di­verse prob­lems in mod­ern statis­tics can be cleaned up by re­port­ing like­li­hoods in­stead of p-val­ues.

Has this ever been tried?

No. Not yet. To our knowl­edge, most sci­en­tists haven’t even con­sid­ered this pro­posal — and for good rea­son! There are a lot of big fish to fry when it comes to ad­dress­ing the repli­ca­tion crisis, p-hack­ing, the prob­lem of van­ish­ing effect sizes, pub­li­ca­tion bias, and other prob­lems fac­ing sci­ence to­day. The sci­en­tific com­mu­nity at large is huge, de­cen­tral­ized, and has a lot of in­er­tia. Most ac­tivists who are try­ing to shift it already have their hands full ad­vo­cat­ing for very im­por­tant poli­cies such as open ac­cess jour­nals and pre-reg­is­tra­tion of tri­als. So it makes sense that no­body’s ad­vo­cat­ing hard for re­port­ing like­li­hoods in­stead of p-val­ues — yet.

Nev­er­the­less, there are good rea­sons to be­lieve that re­port­ing like­li­hoods in­stead of p-val­ues would help solve many of the is­sues in mod­ern ex­per­i­men­tal sci­ence.