Report likelihoods, not p-values

This page ad­vo­cates for a change in the way that statis­tics is done in stan­dard sci­en­tific jour­nals. The key idea is to re­port like­li­hood func­tions in­stead of p-val­ues, and this could have many benefits.

(Note: This page is a per­sonal opinion page.)

What’s the differ­ence?

The sta­tus quo across sci­en­tific jour­nals is to test data for “statis­ti­cal sig­nifi­cance” us­ing func­tions such as p-val­ues. A p-value is a num­ber calcu­lated from a hy­poth­e­sis (called the “null hy­poth­e­sis”), an ex­per­i­ment, a re­sult, and a sum­mary statis­tic. For ex­am­ple, if the null hy­poth­e­sis is “this coin is fair,” and the ex­per­i­ment is “flip it 6 times”, and the re­sult is HHHHHT, and the sum­mary statis­tic is “the se­quence has at least five H val­ues,” then the p-value is 0.11, which means “if the coin were fair, and we did this ex­per­i­ment a lot, then only 11% of the se­quences gen­er­ated would have at least five H val­ues.”noteThis does not mean that the coin is 89% likely to be bi­ased! For ex­am­ple, if the only al­ter­na­tive is that the coin is bi­ased to­wards tails, then HHHHHT is ev­i­dence that it’s fair. This is a com­mon source of con­fu­sion with p-val­ues. If the p-value is lower than an ar­bi­trary thresh­old (usu­ally \(p < 0.05\)) then the re­sult is called “statis­ti­cally sig­nifi­cant” and the null hy­poth­e­sis is “re­jected.”

This page ad­vo­cates that sci­en­tific ar­ti­cles should re­port like­li­hood func­tions in­stead of p-val­ues. A like­li­hood func­tion for a piece of ev­i­dence \(e\) is a func­tion \(\mathcal L\) which says, for each hy­poth­e­sis \(H\) in some set of hy­pothe­ses, the prob­a­bil­ity that \(H\) as­signed to \(e\), writ­ten \(\mathcal L_e(H)\).noteMany au­thors write \(\mathcal L(H \mid e)\) in­stead. We think this is con­fus­ing, as then \(\mathcal L(H \mid e) = \mathbb P(e \mid H),\) and it’s hard enough for stu­dents of statis­tics to keep “prob­a­bil­ity of \(H\) given \(e\)” and “prob­a­bil­ity of \(e\) given \(H\)” straight as it is if the no­ta­tion isn’t swapped around ev­ery so of­ten. For ex­am­ple, if \(e\) is “this coin, flipped 6 times, gen­er­ated HHHHHT”, and the set of hy­pothe­ses are \(H_{0.25} =\) “the coin only pro­duces heads 25% of the time” and \(H_{0.5}\) = “the coin is fair”, then \(\mathcal L_e(H_{0.25})\) \(=\) \(0.25^5 \cdot 0.75\) \(\approx 0.07\%\) and \(\mathcal L_e(H_{0.5})\) \(=\) \(0.5^6\) \(\approx 1.56\%,\) for a like­li­hood ra­tio of about \(21 : 1\) in fa­vor of the coin be­ing fair (as op­posed to bi­ased 75% to­wards tails).

In fact, with a sin­gle like­li­hood func­tion, we can re­port the amount of sup­port \(e\) gives to ev­ery hy­poth­e­sis \(H_b\) of the form “the coin has bias \(b\) to­wards heads”:noteTo learn how this graph was gen­er­ated, see Bayes’ rule: Func­tional form.

Note that this like­li­hood func­tion is not tel­ling us the prob­a­bil­ity that the coin is ac­tu­ally bi­ased, it is only tel­ling us how much the ev­i­dence sup­ports each hy­poth­e­sis. For ex­am­ple, this graph says that HHHHHT pro­vides about 3.8 times as much ev­i­dence for \(H_{0.75}\) over \(H_{0.5}\), and about 81 times as much ev­i­dence for \(H_{0.75}\) over \(H_{0.25}.\)

Note also that the like­li­hood func­tion doesn’t nec­es­sar­ily con­tain the right hy­poth­e­sis; for ex­am­ple, the func­tion above shows the sup­port of \(e\) for ev­ery pos­si­ble bias on the coin, but it doesn’t con­sider hy­pothe­ses like “the coin al­ter­nates be­tween H and T”. Like­li­hood func­tions, like p-val­ues, are es­sen­tially a mere sum­mary of the raw data — there is no sub­sti­tute for the raw data when it comes to al­low­ing peo­ple to test hy­pothe­ses that the origi­nal re­searchers did not con­sider. (In other words, even if you re­port like­li­hoods in­stead of p-val­ues, it’s still vir­tu­ous to share your raw data.)

Where p-val­ues let you mea­sure (roughly) how well the data sup­ports a sin­gle “null hy­poth­e­sis”, with an ar­bi­trary 0.05 “not well enough” cut­off, the like­li­hood func­tion shows the sup­port of the ev­i­dence for lots and lots of differ­ent hy­pothe­ses at once, with­out any need for an ar­bi­trary cut­off.

Why re­port like­li­hoods in­stead of p-val­ues?

1. Like­li­hood func­tions are less ar­bi­trary than p-val­ues. To re­port a like­li­hood func­tion, all you have to do is pick which hy­poth­e­sis class to gen­er­ate the like­li­hood func­tion for. That’s your only de­gree of free­dom. This in­tro­duces one source of ar­bi­trari­ness, and if some­one wants to check some other hy­poth­e­sis they still need ac­cess to the raw data, but it is bet­ter than the p-value case, where you only re­port a num­ber for a sin­gle “null” hy­poth­e­sis.

Fur­ther­more, in the p-value case, you have to pick not only a null hy­poth­e­sis but also an ex­per­i­ment and a sum­mary statis­tic, and these de­grees of free­dom can have a huge im­pact on the fi­nal re­port. Th­ese ex­tra de­grees of free­dom are both un­nec­es­sary (to carry out a prob­a­bil­is­tic up­date, all you need are your own per­sonal be­liefs and a like­li­hood func­tion) and ex­ploitable, and em­piri­cally, they’re ac­tively harm­ing sci­en­tific re­search.

2. Re­port­ing like­li­hoods would solve p-hack­ing. If you’re us­ing p-val­ues, then you can game the statis­tics via your choice of ex­per­i­ment and sum­mary statis­tics. In the ex­am­ple with the coin above, if you say your ex­per­i­ment and sum­mary statis­tic are “flip the coin 6 times and count the num­ber of heads” then the p-value of HHHHHT with re­spect to \(H_{0.5}\) is 0.11, whereas if you say your ex­per­i­ment and sum­mary statis­tic are “flip the coin un­til it comes up tails and count the num­ber of heads” then the p-value of HHHHHT with re­spect to \(H_{0.5}\) is 0.03, which is “sig­nifi­cant.” This is called “p-hack­ing”, and it’s a se­ri­ous prob­lem in mod­ern sci­ence.

In a like­li­hood func­tion, the amount of sup­port an ev­i­dence gives to a hy­poth­e­sis does not de­pend on which ex­per­i­ment the re­searcher had in mind. Like­li­hood func­tions de­pend only on the data you ac­tu­ally saw, and the hy­pothe­ses you chose to re­port. The only way to cheat a like­li­hood func­tion is to lie about the data you col­lected, or re­fuse to re­port like­li­hoods for a par­tic­u­lar hy­poth­e­sis.

If your pa­per fails to re­port like­li­hoods for some ob­vi­ous hy­pothe­ses, then (a) that’s pre­cisely analo­gous to you choos­ing the wrong null hy­poth­e­sis to con­sider; (b) it’s just as eas­ily no­tice­able as when your pa­per con­sid­ers the wrong null hy­poth­e­sis; and (c) it can be eas­ily rec­tified given ac­cess to the raw data. By con­trast, p-hack­ing can be sub­tle and hard to de­tect af­ter the fact.

3. Like­li­hood func­tions are very difficult to game. There is no ana­log of p-hack­ing for like­li­hood func­tions. This is a the­o­rem of prob­a­bil­ity the­ory known as con­ser­va­tion of ex­pected ev­i­dence, which says that like­li­hood func­tions can’t be gamed un­less you’re falsify­ing or omit­ting data (or screw­ing up the like­li­hood calcu­la­tions).noteDis­claimer: the the­o­rem says like­li­hood func­tions can’t be gamed, but we still shouldn’t un­der­es­ti­mate the guile of dishon­est re­searchers strug­gling to make their re­sults look im­por­tant. Like­li­hood func­tions have not been put through the gaunt­let of real sci­en­tific prac­tice; p-val­ues have. That said, when p-val­ues were put through that gaunt­let, they failed in a spec­tac­u­lar fash­ion. When re­build­ing, it’s prob­a­bly bet­ter to start from foun­da­tions that prov­ably can­not be gamed.

4. Like­li­hood func­tions would help stop the “van­ish­ing effect sizes” phe­nomenon. The de­cline effect oc­curs when stud­ies which re­ject a null hy­poth­e­sis \(H_0\) have effect sizes that get smaller and smaller and smaller over time (the more some­one tries to repli­cate the re­sult). This is usu­ally ev­i­dence that there is no ac­tual effect, and that the ini­tial “large effects” were a re­sult of pub­li­ca­tion bias.

Like­li­hood func­tions help avoid the de­cline effect by treat­ing differ­ent effect sizes differ­ently. The like­li­hood func­tion for coins of differ­ent bi­ases shows that the ev­i­dence HHHHHT gives a differ­ent amount of sup­port to \(H_{0.52},\) \(H_{0.61}\), and \(H_{0.8}\) (which cor­re­spond to small, medium, and large effect sizes, re­spec­tively). If three differ­ent stud­ies find low sup­port for \(H_{0.5},\) and one of them gives all of its sup­port to the large effect, an­other gives all its sup­port to the medium effect, and the third gives all of its sup­port to the small­est effect, then like­li­hood func­tions re­veal that some­thing fishy is go­ing on (be­cause they’re all peaked in differ­ent places).

If in­stead we only use p-val­ues, and always de­cide whether or not to “keep” or “re­ject” the null hy­poth­e­sis (with­out spec­i­fy­ing how much sup­port goes to differ­ent al­ter­na­tives), then it’s hard to no­tice that the stud­ies are ac­tu­ally con­tra­dic­tory (and that some­thing very fishy is go­ing on). In­stead, it’s very tempt­ing to ex­claim “3 out of 3 stud­ies re­ject \(H_{0.5}\)!” and move on.

5. Like­li­hood func­tions would help stop pub­li­ca­tion bias. When us­ing p-val­ues, if the data yields a p-value of 0.11 us­ing a null hy­poth­e­sis \(H_0\), the study is con­sid­ered “in­signifi­cant,” and many jour­nals have a strong bias to­wards pos­i­tive re­sults. When re­port­ing like­li­hood func­tions, there is no ar­bi­trary “sig­nifi­cance” thresh­old. A study that re­ports a rel­a­tive like­li­hoods of \(21 : 1\) in fa­vor of \(H_a\) vs \(H_0,\) that’s ex­actly the same strength of ev­i­dence as a study that re­ports \(21 : 1\) odds against \(H_a\) vs \(H_0.\) It’s all just ev­i­dence, and it can all be added to the cor­pus, there’s no ar­bi­trary “sig­nifi­cance” thresh­old.

6. Like­li­hood func­tions make it triv­ially easy to com­bine stud­ies. When com­bin­ing stud­ies that used p-val­ues, re­searchers have to perform com­plex meta-analy­ses with dozens of pa­ram­e­ters to tune, and they of­ten find ex­actly what they were ex­pect­ing to find. By con­trast, the way you com­bine mul­ti­ple stud­ies that re­ported like­li­hood func­tions is… (drum­roll) …you just mul­ti­ply the like­li­hood func­tions to­gether. If study A re­ports that \(H_{0.75}\) was fa­vored over \(H_{0.5}\) with a rel­a­tive like­li­hood of \(3.8 : 1\), and study B re­ports that \(H_{0.75}\) was fa­vored over \(H_{0.5}\) at \(5 : 1\), then the com­bined like­li­hood func­tions of both stud­ies fa­vors \(H_{0.75}\) over \(H_{0.5}\) at \((3.8 \cdot 5) : 1\) \(=\) \(19 : 1.\)

Want to com­bine a hun­dred stud­ies on the same sub­ject? Mul­ti­ply a hun­dred func­tions to­gether. Done. No pa­ram­e­ter tun­ing, no de­grees of free­dom through which bias can be in­tro­duced — just mul­ti­ply.

7. Like­li­hood func­tions make it ob­vi­ous when some­thing has gone wrong. If, when you mul­ti­ply all the like­li­hood func­tions to­gether, all hy­pothe­ses have ex­traor­di­nar­ily low like­li­hoods, then some­thing has gone wrong. Either a mis­take has been made some­where, or fraud has been com­mit­ted, or the true hy­poth­e­sis wasn’t in the hy­poth­e­sis class you’re con­sid­er­ing.

The ac­tual hy­poth­e­sis that ex­plains all the data will have de­cently high like­li­hood across all the data. If none of the hy­pothe­ses fit that de­scrip­tion, then ei­ther you aren’t con­sid­er­ing the right hy­poth­e­sis yet, or some of the stud­ies went wrong. (Try look­ing for one study that has a like­li­hood func­tion very very differ­ent from all the other stud­ies, and in­ves­ti­gate that one.)

Like­li­hood func­tions won’t do your sci­ence for you — you still have to gen­er­ate good hy­pothe­ses, and be hon­est in your data re­port­ing — but they do make it ob­vi­ous when some­thing went wrong. (Speci­fi­cally, each hy­poth­e­sis can tell you how low its like­li­hood is ex­pected to be on the data, and if ev­ery hy­poth­e­sis has a like­li­hood far lower than ex­pected, then some­thing’s fishy.)


A sci­en­tific com­mu­nity us­ing like­li­hood func­tions would pro­duce sci­en­tific re­search that’s eas­ier to use. If ev­ery­one’s re­port­ing like­li­hood func­tions, then all you per­son­ally need to do in or­der to figure out what to be­lieve is take your own per­sonal (sub­jec­tive) prior prob­a­bil­ities and mul­ti­ply them by all the like­li­hood func­tions in or­der to get your own per­sonal (sub­jec­tive) pos­te­rior prob­a­bil­ities.

For ex­am­ple, let’s say you per­son­ally think the coin is prob­a­bly fair, with \(10 : 1\) odds of be­ing fair as op­posed to 75% bi­ased in fa­vor of heads. Now let’s say that study A re­ports a like­li­hood func­tion which fa­vors \(H_{0.75}\) over \(H_{0.5}\) with a like­li­hood ra­tio of \(3.8 : 1.\), and study B re­ports a \(5 : 1\) like­li­hood ra­tio in the same di­rec­tion. Mul­ti­ply­ing all these to­gether, your per­sonal pos­te­rior be­liefs should be \(19 : 10\) in fa­vor of \(H_{0.75}\) over \(H_{0.5}\). This is sim­ply Bayes’ rule. Re­port­ing like­li­hoods in­stead of p-val­ues lets sci­ence re­main ob­jec­tive, while al­low­ing ev­ery­one to find their own per­sonal pos­te­rior prob­a­bil­ities via a sim­ple ap­pli­ca­tion of Bayes’ the­o­rem.

Why should we think this would work?

This may all sound too good to be true. Can one sim­ple change re­ally solve that many prob­lems in mod­ern sci­ence?

First of all, you can be as­sured that re­port­ing like­li­hoods in­stead of p-val­ues would not “solve” all the prob­lems above, and it would surely not solve all prob­lems with mod­ern ex­per­i­men­tal sci­ence. Open ac­cess to raw data, pre­reg­is­tra­tion of stud­ies, a cul­ture that re­wards repli­ca­tion, and many other ideas are also cru­cial in­gre­di­ents to a sci­en­tific com­mu­nity that ze­roes in on truth.

How­ever, re­port­ing like­li­hoods would help solve lots of differ­ent prob­lems in mod­ern ex­per­i­men­tal sci­ence. This may come as a sur­prise. Aren’t like­li­hood func­tions just one more statis­ti­cal tech­nique, just an­other tool for the toolbox? Why should we think that one sin­gle tool can solve that many prob­lems?

The rea­son lies in prob­a­bil­ity the­ory. Ac­cord­ing to the ax­ioms of prob­a­bil­ity the­ory, there is only one good way to ac­count for ev­i­dence when up­dat­ing your be­liefs, and that way is via like­li­hood func­tions. Any other method is sub­ject to in­con­sis­ten­cies and patholo­gies, as per the co­her­ence the­o­rems of prob­a­bil­ity the­ory.

If you’re ma­nipu­lat­ing equa­tions like \(2 + 2 = 4,\) and you’re us­ing meth­ods that may or may not let you throw in an ex­tra 3 on the right hand side (de­pend­ing on the ar­ith­meti­cian’s state of mind), then it’s no sur­prise that you’ll oc­ca­sion­ally get your­self into trou­ble and de­duce that \(2 + 2 = 7.\) The laws of ar­ith­metic show that there is only one cor­rect set of tools for ma­nipu­lat­ing equa­tions if you want to avoid in­con­sis­tency.

Similarly, the laws of prob­a­bil­ity the­ory show that there is only one cor­rect set of tools for ma­nipu­lat­ing un­cer­tainty if you want to avoid in­con­sis­tency. Ac­cord­ing to those rules, the right way to rep­re­sent ev­i­dence is through like­li­hood func­tions.

Th­ese laws (and a solid un­der­stand­ing of them) are younger than the ex­per­i­men­tal sci­ence com­mu­nity, and the statis­ti­cal tools of that com­mu­nity pre­date a mod­ern un­der­stand­ing of prob­a­bil­ity the­ory. Thus, it makes a lot of sense that the ex­ist­ing liter­a­ture uses differ­ent tools. How­ever, now that hu­man­ity does pos­sess a solid un­der­stand­ing of prob­a­bil­ity the­ory, it should come as no sur­prise that many di­verse patholo­gies in statis­tics can be cleaned up by switch­ing to a policy of re­port­ing like­li­hoods in­stead of p-val­ues.

What are the draw­backs?

The main draw­back is in­er­tia. Ex­per­i­men­tal sci­ence to­day re­ports p-val­ues al­most en­tirely across the board. Modern statis­ti­cal toolsets have built-in sup­port for p-val­ues (and other re­lated statis­ti­cal tools) but very lit­tle sup­port for re­port­ing like­li­hood func­tions. Ex­per­i­men­tal sci­en­tists are trained mainly in fre­quen­tist statis­tics, and thus most are much more fa­mil­iar with p-value-type tools than like­li­hood-func­tion-type tools. Mak­ing the switch would be painful.

Bar­ring the switch­ing costs, though, mak­ing the switch could well be a strict im­prove­ment over mod­ern tech­niques, and would help solve some of the biggest prob­lems fac­ing sci­ence to­day.

See also the Like­li­hoods not p-val­ues FAQ and Like­li­hood func­tions, p-val­ues, and the repli­ca­tion crisis.

Children:

Parents:

  • Probability

    The de­gree to which some­one be­lieves some­thing, mea­sured on a scale from 0 to 1, al­low­ing us to do math to it.