Likelihood functions, p-values, and the replication crisis

Or: Switch­ing From Re­port­ing p-val­ues to Re­port­ing Like­li­hood Func­tions Might Help Fix the Repli­ca­tion Cri­sis: A per­sonal view by Eliezer Yud­kowsky.


  • This di­alogue was writ­ten by a Bayesian. The voice of the Scien­tist in the di­alogue be­low may fail to pass the Ide­olog­i­cal Tur­ing Test for fre­quen­tism, that is, it may fail to do jus­tice to fre­quen­tist ar­gu­ments and coun­ter­ar­gu­ments.

  • It does not seem so­ciolog­i­cally re­al­is­tic, to the au­thor, that the pro­posal be­low could be adopted by the sci­en­tific com­mu­nity at large within the next 10 years. It seemed worth writ­ing down nev­er­the­less.

If you don’t already know Bayes’ rule, check out Ar­bital’s Guide to Bayes’ Rule if con­fused.

Moder­a­tor: Hello, ev­ery­one. I’m here to­day with the Scien­tist, a work­ing ex­per­i­men­tal­ist in… chem­i­cal psy­chol­ogy, or some­thing; with the Bayesian, who’s go­ing to ex­plain why, on their view, we can make progress on the repli­ca­tion crisis by re­plac­ing p-val­ues with some sort of Bayesian thing--

Un­der­grad: Sorry, can you re­peat that?

Moder­a­tor: And fi­nally, the Con­fused Un­der­grad on my right. Bayesian, would you care to start by ex­plain­ing the rough idea?

Bayesian: Well, the rough idea is some­thing like this. Sup­pose we flip a pos­si­bly-un­fair coin six times, and ob­serve the se­quence HHHHHT. Should we be sus­pi­cious that the coin is bi­ased?

Scien­tist: No.

Bayesian: This isn’t a literal coin. Let’s say we pre­sent a se­ries of ex­per­i­men­tal sub­jects with two cook­ies on a plate, one with green sprin­kles and one with red sprin­kles. The first five peo­ple took cook­ies with green sprin­kles and the sixth per­son took a cookie with red sprin­kles. noteAnd they all saw sep­a­rate plates, on a table in the wait­ing room marked “please take only one” so no­body knew what was be­ing tested, and none of them saw the oth­ers’ cookie choices. Do we think most peo­ple pre­fer green-sprin­kled cook­ies or do we think it was just ran­dom?

Un­der­grad: I think I would be sus­pi­cious that maybe peo­ple liked green sprin­kles bet­ter. Or at least that the sort of peo­ple who go to the uni­ver­sity and get used as test sub­jects like green sprin­kles bet­ter. Yes, even if I just saw that hap­pen in the first six cases. But I’m guess­ing I’m go­ing to get dumped-on for that.

Scien­tist: I think I would be gen­uinely not-yet-sus­pi­cious. There’s just too much stuff that looks good af­ter N=6 that doesn’t pan out with N=60.

Bayesian: I’d at least strongly sus­pect that peo­ple in the test pop­u­la­tion don’t mostly pre­fer red sprin­kles. But the rea­son I in­tro­duced this ex­am­ple is as an over­sim­plified ex­am­ple of how cur­rent sci­en­tific statis­tics calcu­late so-called “p-val­ues”, and what a Bayesian sees as the cen­tral prob­lem with that.

Scien­tist: And we can’t use a more re­al­is­tic ex­am­ple with 30 sub­jects?

Bayesian: That would not be nice to the Con­fused Un­der­grad.

Un­der­grad: Se­conded.

Bayesian: So: Heads, heads, heads, heads, heads, tails. I ask: is this “statis­ti­cally sig­nifi­cant”, as cur­rent con­ven­tional statis­ti­ci­ans would have the phrase?

Scien­tist: I re­ply: no. On the null hy­poth­e­sis that the coin is fair, or analo­gously that peo­ple have no strong prefer­ence be­tween green and red sprin­kles, we should ex­pect to see a re­sult as ex­treme as this in 14 out of 64 cases.

Un­der­grad: Okay, just to make sure I have this straight: That’s be­cause we’re con­sid­er­ing re­sults like HHHTHH or TTTTTT to be equally or more ex­treme, and there are 14 to­tal pos­si­ble cases like that, and we flipped the coin 6 times which gives us \(2^6 = 64\) pos­si­ble re­sults. 1464 = 22%, which is not less than 5%, so this is not statis­ti­cally sig­nifi­cant at the \(p<.05\) level.

Scien­tist: That’s right. How­ever, I’d also like to ob­serve as a mat­ter of prac­tice that even if you get HHHHHH on your first six flips, I don’t ad­vise stop­ping there and send­ing in a pa­per where you claim that the coin is bi­ased to­wards heads.

Bayesian: Be­cause if you can de­cide to stop flip­ping the coin at a time of your choice, then we have to ask “How likely it is that you can find some place to stop flip­ping the coin where it looks like there’s a sig­nifi­cant num­ber of heads?” That’s a whole differ­ent ket­tle of fish ac­cord­ing to the p-value con­cept.

Scien­tist: I was just think­ing that N=6 is not a good num­ber of ex­per­i­men­tal sub­jects when it comes to test­ing cookie prefer­ences. But yes, that too.

Un­der­grad: Uh… why does it make a differ­ence if I can de­cide when to stop flip­ping the coin?

Bayesian: What an ex­cel­lent ques­tion.

Scien­tist: Well, this is where the con­cept of p-val­ues is less straight­for­ward than plug­ging the num­bers into a statis­tics pack­age and be­liev­ing what­ever the stats pack­age says. If you pre­vi­ously de­cided to flip ex­actly six coins, and then stop, re­gard­less of what re­sults you got, then you would get a re­sult as ex­treme as “HHHHHH” or “TTTTTT” 264 of the time, or 3.1%, so p<0.05. How­ever, sup­pose that in­stead you are a bad fraud­u­lent sci­en­tist, or maybe just an ig­no­rant un­der­grad­u­ate who doesn’t re­al­ize what they’re do­ing wrong. In­stead of pick­ing the num­ber of flips in ad­vance, you keep flip­ping the coin un­til the statis­tics soft­ware pack­age tells you that you got a re­sult that would have been statis­ti­cally sig­nifi­cant if, con­trary to the ac­tual facts of the case, you’d de­cided in ad­vance to flip the coin ex­actly that many times. But you didn’t de­cide in ad­vance to flip the coin that many times. You de­cided it af­ter check­ing the ac­tual re­sults. Which you are not al­lowed to do.

Un­der­grad: I’ve heard that be­fore, but I’m not sure I un­der­stand on a gut level why it’s bad for me to de­cide when I’ve col­lected enough data.

Scien­tist: What we’re try­ing to do here is set up a test that the null hy­poth­e­sis can­not pass—to make sure that where there’s no fire, there’s un­likely to be smoke. We want a com­plete ex­per­i­men­tal pro­cess which is un­likely to gen­er­ate a “statis­ti­cally sig­nifi­cant” dis­cov­ery if there’s no real phe­nomenon be­ing in­ves­ti­gated. If you flip the coin ex­actly six times, if you de­cide that in ad­vance, then you are less than 5% likely to get a re­sult as ex­treme as “six heads” or “six tails”. If you flip the coin re­peat­edly, and check re­peat­edly for a re­sult that would have had p<0.05 if you’d de­cided in ad­vance to flip the coin ex­actly that many times, your chance of get­ting a nod from the statis­tics pack­age is much greater than 5%. You’re car­ry­ing out a pro­cess which is much more than 5% likely to yell “smoke” in the ab­sence of fire.

Bayesian: The way I like to ex­plain the prob­lem is like this: Sup­pose you flip a coin six times and get HHHHHT. If you, in the se­cret depths of your mind, in your in­ner heart that no other mor­tal knows, de­cided to flip the coin ex­actly six times and then stop, this re­sult is not statis­ti­cally sig­nifi­cant; p=0.22. If you de­cided in the se­cret depths of your heart, that you were go­ing to keep flip­ping the coin un­til it came up tails, then the re­sult HHHHHT is statis­ti­cally sig­nifi­cant with p=0.03, be­cause the chance of a fair coin re­quiring you to wait six or more flips to get one tail is 132.

Un­der­grad: What.

Scien­tist: It’s a bit of a par­ody, ob­vi­ously—no­body would re­ally de­cide to flip a coin un­til they got one tail, then stop—but the Bayesian is tech­ni­cally cor­rect about how the rules for p-val­ues work. What we’re ask­ing is how rare our out­come is, within the class of out­comes we could have got­ten. The per­son who keeps flip­ping the coin un­til they get one tail has pos­si­ble out­comes {T, HT, HHT, HHHT, HHHHT, HHHHHT, HHHHHHT…} and so on. The class of out­comes where you get to the sixth round or later is the class of out­comes {HHHHHT, HHHHHHT…} and so on, a set of out­comes which col­lec­tively have a prob­a­bil­ity of 164 + 1128 + 1/​256… = 132. Whereas if you flip the coin ex­actly six times, your class of pos­si­ble out­comes is {TTTTTT, TTTTTH, TTTTHT, TTTTHH…}, a set of 64 pos­si­bil­ities within which the out­come HHHHHT is some­thing we could lump in with {HHHHTH, HHHTHH, THTTTT…} and so on. So al­though it’s coun­ter­in­tu­itive, if we re­ally had de­cided to run the first ex­per­i­ment, HHHHHT would be a statis­ti­cally sig­nifi­cant re­sult that a fair coin would be un­likely to give us. And if we’d de­cided to run the sec­ond ex­per­i­ment, HHHHHT would not be a statis­ti­cally sig­nifi­cant re­sult be­cause fair coins some­times do some­thing like that.

Bayesian: And it doesn’t bother you that the mean­ing of your ex­per­i­ment de­pends on your pri­vate state of mind?

Scien­tist: It’s an honor sys­tem. Just like the pro­cess doesn’t work if you lie about which coin­flips you ac­tu­ally saw, the pro­cess doesn’t work—is not a fair test in which non-fires are un­likely to gen­er­ate smoke—if you lie about which ex­per­i­ment you performed. You must hon­estly say the rules you fol­lowed to flip the coin the way you did. Un­for­tu­nately, since what you were think­ing about the ex­per­i­ment is less clearly visi­ble than the ac­tual coin­flips, peo­ple are much more likely to twist how they se­lected their num­ber of ex­per­i­men­tal sub­jects, or how they se­lected which tests to run on the data, than they are to tell a blatant lie about what the data said. That’s p-hack­ing. There are, un­for­tu­nately, much sub­tler and less ob­vi­ous ways of gen­er­at­ing smoke with­out fire than claiming post facto to have fol­lowed the rule of flip­ping the coin un­til it came up tails. It’s a se­ri­ous prob­lem, and un­der­pins some large part of the great repli­ca­tion crisis, no­body’s sure ex­actly how much.

Un­der­grad: That… sorta makes sense, maybe? I’m guess­ing this is one of those cases where I have to work through a lot of ex­am­ple prob­lems be­fore it be­comes ob­vi­ous.

Bayesian: No.

Un­der­grad: No?

Bayesian: You were right the first time, Un­der­grad. If what the ex­per­i­men­tal­ist is think­ing has no causal im­pact on the coin, then the ex­per­i­men­tal­ist’s thoughts can­not pos­si­bly make any differ­ence to what the coin is say­ing to us about Na­ture. My dear Un­der­grad, you are be­ing taught weird, ad-hoc, over­com­pli­cated rules that aren’t even in­ter­nally con­sis­tent—rules that the­o­ret­i­cally out­put differ­ent wrong an­swers de­pend­ing on your pri­vate state of mind! And that is a prob­lem that runs far deeper into the repli­ca­tion crisis than peo­ple mis­re­port­ing their in­ner thoughts.

Scien­tist: A bold claim to say the least. But don’t be coy; tell us what you think we should be do­ing in­stead?

Bayesian: I an­a­lyze as fol­lows: The ex­act re­sult HHHHHT has a 164 or roughly 1.6% prob­a­bil­ity of be­ing pro­duced by a fair coin flipped six times. To sim­plify mat­ters, sup­pose we for some rea­son were already pon­der­ing the hy­poth­e­sis that the coin was bi­ased to pro­duce 56 heads—again, this is an un­re­al­is­tic ex­am­ple we can de-sim­plify later. Then this hy­po­thet­i­cal bi­ased coin would have a \((5/6)^5 \cdot (1/6)^1 \approx 6.7\%\) prob­a­bil­ity of pro­duc­ing HHHHHT. So be­tween our two hy­pothe­ses “The coin is fair” and “The coin is bi­ased to pro­duce 5/​6ths heads”, our ex­act ex­per­i­men­tal re­sult is 4.3 times more likely to be ob­served in the sec­ond case. HHHHHT is also 0.01% likely to be pro­duced if the coin is bi­ased to pro­duce 1/​6th heads and 5/​6ths tails, so we’ve already seen some quite strong ev­i­dence against that par­tic­u­lar hy­poth­e­sis, if any­one was con­sid­er­ing it. The ex­act ex­per­i­men­tal out­come HHHHHT is 146 times as likely to be pro­duced by a fair coin as by a coin bi­ased to pro­duce 1/​6th heads. noteRe­call the Bayesian’s ear­lier thought that, af­ter see­ing five sub­jects se­lect green cook­ies fol­lowed by one sub­ject se­lect­ing a red cookie, we’d already picked up strong ev­i­dence against the propo­si­tion: “Sub­jects in this ex­per­i­men­tal pop­u­la­tion lop­sid­edly pre­fer red cook­ies over green cook­ies.”

Un­der­grad: Well, I think I can fol­low the calcu­la­tion you just did, but I’m not clear on what that calcu­la­tion means.

Bayesian: I’ll try to ex­plain the mean­ing shortly, but first, note this. That calcu­la­tion we just did has no de­pen­dency what­so­ever on why you flipped the coin six times. You could have stopped af­ter six be­cause you thought you’d seen enough coin­flips. You could have done an ex­tra coin­flip af­ter the first five be­cause Na­m­a­giri Tha­yar spoke to you in a dream. The coin doesn’t care. The coin isn’t af­fected. It re­mains true that the ex­act re­sult HHHHHT is 23% as likely to be pro­duced by a fair coin as by a bi­ased coin that comes up heads 5/​6ths of the time.

Scien­tist: I agree that this is an in­ter­est­ing prop­erty of the calcu­la­tion that you just did. And then what?

Bayesian: You re­port the re­sults in a jour­nal. Prefer­ably in­clud­ing the raw data so that oth­ers can calcu­late the like­li­hoods for any other hy­pothe­ses of in­ter­est. Say some­body else sud­denly be­comes in­ter­ested in the hy­poth­e­sis “The coin is bi­ased to pro­duce 9/​10ths heads.” See­ing HHHHHT is 5.9% likely in that case, so 88% as likely than if the coin is bi­ased to pro­duce 5/​6ths heads (mak­ing the data 6.7% likely), or 3.7 times as likely than if the coin is fair (mak­ing the data 1.6% likely). But you shouldn’t have to think of all pos­si­ble hy­pothe­ses in ad­vance. Just re­port the raw data so that oth­ers can calcu­late what­ever like­li­hoods they need. Since this calcu­la­tion deals with the ex­act re­sults we got, rather than sum­ma­riz­ing it into some class or set of sup­pos­edly similar re­sults, it puts a greater em­pha­sis on re­port­ing your ex­act ex­per­i­men­tal data to oth­ers.

Scien­tist: Re­port­ing raw data seems an im­por­tant leg of a good strat­egy for fight­ing the repli­ca­tion crisis, on this we agree. I nonethe­less don’t un­der­stand what ex­per­i­men­tal­ists are sup­posed to do with this “X is Q times as likely as Y” stuff.

Un­der­grad: Se­conded.

Bayesian: Okay, so… this isn’t triv­ial to de­scribe with­out mak­ing you run through a whole in­tro­duc­tion to Bayes’ rule--

Un­der­grad: Great. Just what I need, an­other weird com­pli­cated 4-credit course on statis­tics.

Bayesian: It’s liter­ally a 1-hour read if you’re good at math. It just isn’t liter­ally triv­ial to un­der­stand with no prior in­tro­duc­tion. Well, even with no in­tro­duc­tion what­so­ever, I may be able to fake it with state­ments that will sound like they might be rea­son­able—and the rea­son­ing is valid, it just might not be ob­vi­ous that it is. Any­way. It is a the­o­rem of prob­a­bil­ity that the fol­low­ing is valid rea­son­ing:

(the Bayesian takes a breath)

Bayesian: Sup­pose that Pro­fes­sor Plum and Miss Scar­let are two sus­pects in a mur­der. Based on their prior crim­i­nal con­vic­tions, we start out think­ing that Pro­fes­sor Plum is twice as likely to have com­mit­ted the mur­der as Miss Scar­let. We then dis­cover that the vic­tim was poi­soned. We think that, as­sum­ing he com­mit­ted a mur­der, Pro­fes­sor Plum would be 10% likely to use poi­son; as­sum­ing Miss Scar­let com­mit­ted a mur­der, she would be 60% likely to use poi­son. So Pro­fes­sor Plum is around one-sixth as likely to use poi­son as Miss Scar­let. Then af­ter ob­serv­ing the vic­tim was poi­soned, we should up­date to think Plum is around one-third as likely to have com­mit­ted the mur­der as Scar­let: \(2 \times \frac{1}{6} = \frac{1}{3}.\)

Un­der­grad: Just to check, what do you mean by say­ing that “Pro­fes­sor Plum is one-third as likely to have com­mit­ted the mur­der as Miss Scar­let”?

Bayesian: I mean that if these two peo­ple are our only sus­pects, we think Pro­fes­sor Plum has a 14 prob­a­bil­ity of hav­ing com­mit­ted the mur­der and Miss Scar­let has a 34 prob­a­bil­ity of be­ing guilty. So Pro­fes­sor Plum’s prob­a­bil­ity of guilt is one-third that of Miss Scar­let’s.

Scien­tist: Now I’d like to know what you mean by say­ing that Pro­fes­sor Plum had a 14 prob­a­bil­ity of com­mit­ting the mur­der. Either Plum com­mit­ted the mur­der or he didn’t; we can’t ob­serve the mur­der be com­mit­ted mul­ti­ple times and Pro­fes­sor Plum do­ing it 1/​4th of the time.

Bayesian: Are we go­ing there? I guess we’re go­ing there. My good Scien­tist, I mean that if you offered me ei­ther side of an even-money bet on whether Plum com­mit­ted the mur­der, I’d bet that he didn’t do it. But if you offered me a gam­ble that costs \$1 if Pro­fes­sor Plum is in­no­cent and pays out \$5 if he’s guilty, I’d cheer­fully ac­cept that gam­ble. We only ran the 2012 US Pres­i­den­tial Elec­tion one time, but that doesn’t mean that on Novem­ber 7th you should’ve re­fused a \$10 bet that paid out \$1000 if Obama won. In gen­eral when pre­dic­tion mar­kets and large liquid bet­ting pools put 60% bet­ting odds on some­body win­ning the pres­i­dency, that out­come tends to hap­pen 60% of the time; they are well-cal­ibrated for prob­a­bil­ities in that range. If they were sys­tem­at­i­cally un­cal­ibrated—if in gen­eral things hap­pened 80% of the time when pre­dic­tion mar­kets said 60%--you could use that fact to pump money out of pre­dic­tion mar­kets. And your pump­ing out that money would ad­just the pre­dic­tion-mar­ket prices un­til they were well-cal­ibrated. If things to which pre­dic­tion mar­kets as­sign 70% prob­a­bil­ity hap­pen around 7 times out of 10, why in­sist for rea­sons of ide­olog­i­cal pu­rity that the prob­a­bil­ity state­ment is mean­ingless?

Un­der­grad: I ad­mit, that sounds to me like it makes sense, if it’s not just the illu­sion of un­der­stand­ing due to my failing to grasp some far deeper de­bate.

Bayesian: There is in­deed a deeper de­bate, but what the deeper de­bate works out to is that your illu­sion of un­der­stand­ing is pretty much ac­cu­rate as illu­sions go.

Scien­tist: Yeah, I’m go­ing to want to come back to that is­sue later. What if there are two agents who both seem ‘well-cal­ibrated’ as you put it, but one agent says 60% and the other agent says 70


Bayesian: <span> If I flip a coin and don’t look at it, so that I don’t know yet if it came up heads or tails, then my ig­no­rance about the coin isn’t a fact about the coin, it’s a fact about me. Ig­no­rance ex­ists in the mind, not in the en­vi­ron­ment. A blank map does not cor­re­spond to a blank ter­ri­tory. If you peek at the coin and I don’t, it’s perfectly rea­son­able for the two of us to oc­cupy differ­ent states of un­cer­tainty about the coin. And given that I’m not ab­solutely cer­tain, I can and should quan­tify my un­cer­tainty us­ing prob­a­bil­ities. There’s like 300 differ­ent the­o­rems show­ing that I’ll get into trou­ble if my state of sub­jec­tive un­cer­tainty can­not be viewed as a co­her­ent prob­a­bil­ity dis­tri­bu­tion. You kinda pick up on the trend af­ter just the fourth time you see a slightly differ­ent clever proof that any vi­o­la­tion of the stan­dard prob­a­bil­ity ax­ioms will cause the graves to vomit forth their dead, the seas to turn red as blood, and the skies to rain down dom­i­nated strate­gies and com­bi­na­tions of bets that pro­duce cer­tain losses--

Scien­tist: Sorry, I shouldn’t have said any­thing just then. Let’s come back to this later? I’d rather hear first what you think we should do with the like­li­hoods once we have them.

Bayesian: On the laws of prob­a­bil­ity the­ory, those like­li­hood func­tions are the ev­i­dence. They are the ob­jects that send our prior odds of 2 : 1 for Plum vs. Scar­let to pos­te­rior odds of 1 : 3 for Plum vs. Scar­let. For any two hy­pothe­ses you care to name, if you tell me the rel­a­tive like­li­hoods of the data given those hy­pothe­ses, I know how to up­date my be­liefs. If you change your be­liefs in any other fash­ion, the skies shall rain dom­i­nated strate­gies etcetera. Bayes’ the­o­rem: It’s not just a statis­ti­cal method, it’s the LAW.

Un­der­grad: I’m sorry, I still don’t un­der­stand. Let’s say we do an ex­per­i­ment and find data that’s 6 times as likely if Pro­fes­sor Plum kil­led Mr. Boddy than if Miss Scar­let did. Do we ar­rest Pro­fes­sor Plum?

Scien­tist: My guess is that you’re sup­posed to make up a ‘prior prob­a­bil­ity’ that sounds vaguely plau­si­ble, like ‘a pri­ori, I think Pro­fes­sor Plum is 20<div> likely to have kil­led Mr. Boddy’. Then you com­bine that with your 6 : 1 like­li­hood ra­tio to get 3 : 2 pos­te­rior odds that Plum kil­led Mr. Boddy. So your pa­per re­ports that you’ve es­tab­lished a 60% pos­te­rior prob­a­bil­ity that Pro­fes­sor Plum is guilty, and the le­gal pro­cess does what­ever it does with that.

Bayesian: No. Dear God, no! Is that re­ally what peo­ple think Bayesi­anism is?

Scien­tist: It’s not? I did always hear that the strength of Bayesi­anism is that it gives us pos­te­rior prob­a­bil­ities, which p-val­ues don’t ac­tu­ally do, and that the big weak­ness was that it got there by mak­ing up prior prob­a­bil­ities more or less out of thin air, which means that no­body will ever be able to agree on what the pos­te­ri­ors are.

Bayesian: Science pa­pers should re­port like­li­hoods. Or rather, they should re­port the raw data and helpfully calcu­late some like­li­hood func­tions on it. Not pos­te­ri­ors, never pos­te­ri­ors.

Un­der­grad: What’s a pos­te­rior? I’m trust­ing both of you to avoid the ob­vi­ous joke here.

Bayesian: A pos­te­rior prob­a­bil­ity is when you say, “There’s a 60% prob­a­bil­ity that Pro­fes­sor Plum kil­led Mr. Boddy.” Which, as the Scien­tist points out, is some­thing you never get from p-val­ues. It’s also some­thing that, in my own opinion, should never be re­ported in an ex­per­i­men­tal pa­per, be­cause it’s not the re­sult of an ex­per­i­ment.

Un­der­grad: But… okay, Scien­tist, I’m ask­ing you this one. Sup­pose we see data statis­ti­cally sig­nifi­cant at p<0.01, some­thing we’re less than 1% prob­a­ble to see if the null hy­poth­e­sis “Pro­fes­sor Plum didn’t kill Mr. Boddy” is true. Do we ar­rest him?

Scien­tist: First of all, that’s not a re­al­is­tic null hy­poth­e­sis. A null hy­poth­e­sis is some­thing like “No­body kil­led Mr. Boddy” or “All sus­pects are equally guilty.” But even if what you just said made sense, even if we could re­ject Pro­fes­sor Plum’s in­no­cence at p<0.01, you still can’t say any­thing like, “It is 99% prob­a­ble that Pro­fes­sor Plum is guilty.” That is just not what p-val­ues mean.

Un­der­grad: Then what do p-val­ues mean?

Scien­tist: They mean we saw data in­side a pre­s­e­lected class of pos­si­ble re­sults, which class, as a whole, is less than 1% likely to be pro­duced if the null hy­poth­e­sis is true. That’s all that it means. You can’t go from there to “Pro­fes­sor Plum is 99% likely to be guilty,” for rea­sons the Bayesian is prob­a­bly bet­ter at ex­plain­ing. You can’t go from there to any­where that’s some­place else. What you heard is what there is.

Un­der­grad: Now I’m dou­bly con­fused. I don’t un­der­stand what we’re sup­posed to do with p-val­ues or like­li­hood ra­tios. What kind of ex­per­i­ment does it take to throw Pro­fes­sor Plum in prison?

Scien­tist: Well, re­al­is­ti­cally, if you get a cou­ple more ex­per­i­ments at differ­ent labs also say­ing p<0.01, Pro­fes­sor Plum is prob­a­bly guilty.

Bayesian: And the ‘repli­ca­tion crisis’ is that it turns out he’s not guilty.

Scien­tist: Pretty much.

Un­der­grad: That’s not ex­actly re­as­sur­ing.

Scien­tist: Ex­per­i­men­tal sci­ence is not for the weak of nerve.

Un­der­grad: So… Bayesian, are you about to say similarly that once you get an ex­treme enough like­li­hood ra­tio, say, any­thing over 100 to 1, or some­thing, you can prob­a­bly take some­thing as true?

Bayesian: No, it’s a bit more com­pli­cated than that. Let’s say I flip a coin 20 times and get HHHTHHHTHTHTTHHHTHHHTTHT. Well, the hy­poth­e­sis “This coin was rigged to pro­duce ex­actly HHHTHHHTHTHTTHHHTHHHTTHT” has a like­li­hood ad­van­tage of roughly a mil­lion-to-one over the hy­poth­e­sis “this is a fair coin”. On any rea­son­able sys­tem, un­less you wrote down that sin­gle hy­poth­e­sis in ad­vance and handed it to me in an en­velope and didn’t write down any other hy­pothe­ses or hand out any other en­velopes, we’d say the hy­poth­e­sis “This coin was rigged to pro­duce HHHTHHHTHTHTTHHHTHHHTTHT” has a com­plex­ity penalty of at least \(2^{20} : 1\) be­cause it takes 20 bits just to de­scribe what the coin is rigged to do. In other words, the penalty to prior plau­si­bil­ity more than can­cels out a mil­lion-to-one like­li­hood ad­van­tage. And that’s just the start of the is­sues. But, with that said, I think there’s a pretty good chance you could do okay out of just wing­ing it, once you un­der­stood in an in­tu­itive and com­mon-sense way how Bayes’ rule worked. If there’s ev­i­dence point­ing to Pro­fes­sor Plum with a like­li­hood of 1,000 : 1 over any other sus­pects you can think of, in a field that prob­a­bly only con­tained six sus­pects to be­gin with, you can figure that the prior odds against Plum weren’t much more ex­treme than 10 : 1 and that you can le­gi­t­i­mately be at least 99% sure now.

Scien­tist: But you say that this is not some­thing you should re­port in the pa­per.

Bayesian: That’s right. How can I put this… one of the great com­mand­ments of Bayesi­anism is that you ought to take into ac­count all the rele­vant ev­i­dence you have available; you can’t ex­clude some ev­i­dence from your calcu­la­tion just be­cause you don’t like it. Be­sides sound­ing like com­mon sense, this is also a rule you have to fol­low to pre­vent your calcu­la­tions from com­ing up with para­dox­i­cal re­sults, and there are var­i­ous par­tic­u­lar prob­lems where there’s a seem­ingly crazy con­clu­sion and the an­swer is, “Well, you also need to con­di­tion on blah blah blah.” My point be­ing, how do I, as an ex­per­i­men­tal­ist, know what all the rele­vant ev­i­dence is? Who am I to calcu­late a pos­te­rior? Maybe some­body else pub­lished a pa­per that in­cludes more ev­i­dence, with more like­li­hoods to be taken into ac­count, and I haven’t heard about it yet, but some­body else has. I just con­tribute my own data and its like­li­hood func­tion—that’s all! It’s not my place to claim that I’ve col­lected all the rele­vant ev­i­dence and can now calcu­late pos­te­rior odds, and even if I could, some­body else could pub­lish an­other pa­per a week later and the pos­te­rior odds would change again.

Un­der­grad: So, roughly your an­swer is, “An ex­per­i­men­tal­ist just pub­lishes the pa­per and calcu­lates the like­li­hood thin­gies for that dataset, and then some­body out­side has to figure out what to do with the like­li­hood thin­gies.”

Bayesian: Some­body out­side has to set up pri­ors—prob­a­bly just rea­son­able-sound­ing ig­no­rance pri­ors, max­i­mum en­tropy stuff or com­plex­ity-based penalties or what­ever—then try to make sure they’ve col­lected all the ev­i­dence, ap­ply the like­li­hood func­tions, check to see if the re­sult makes sense, etcetera. And then they might have to re­vise that es­ti­mate if some­body pub­lishes a new pa­per a week later--

Un­der­grad: That sounds awful.

Bayesian: It would be awful if we were do­ing meta-analy­ses of p-val­ues. Bayesian up­dates are a hell of a lot sim­pler! Like, you liter­ally just mul­ti­ply the old pos­te­rior by the new like­li­hood func­tion and nor­mal­ize. If ex­per­i­ment 1 has a like­li­hood ra­tio of 4 for hy­poth­e­sis A over hy­poth­e­sis B, and ex­per­i­ment 2 has a like­li­hood ra­tio of 9 for A over B, the two ex­per­i­ments to­gether have a like­li­hood ra­tio of 36.

Un­der­grad: And you can’t do that with p-val­ues, I mean, a p-value of 0.05 and a p-value of 0.01 don’t mul­ti­ply out to p<0.0005--

Scien­tist: No.

Bayesian: I should like to take this mo­ment to call at­ten­tion to my su­pe­rior smile.

Scien­tist: I am still wor­ried about the part of this pro­cess where some­body gets to make up prior prob­a­bil­ities.

Bayesian: Look, that just cor­re­sponds to the part of the pro­cess where some­body de­cides that, hav­ing seen 1 dis­cov­ery and 2 repli­ca­tions with p<0.01, they are will­ing to buy the new pill or what­ever.

Scien­tist: So your re­ply there is, “It’s sub­jec­tive, but so is what you do when you make de­ci­sions based on hav­ing seen some ex­per­i­ments with p-val­ues.” Hm. I was go­ing to say some­thing like, “If I set up a rule that says I want data with p<0.001, there’s no fur­ther ob­jec­tivity be­yond that,” but I guess you’d say that my ask­ing for p<0.001 in­stead of p<0.0001 cor­re­sponds to my pul­ling a prior out of my butt?

Bayesian: Well, ex­cept that ask­ing for a par­tic­u­lar p-value is not ac­tu­ally as good as pul­ling a prior out of your butt. One of the first of those 300 the­o­rems prov­ing doom if you vi­o­late prob­a­bil­ity ax­ioms, was Abra­ham Wald’s “com­plete class the­o­rem” in 1947. Wald set out to in­ves­ti­gate all the pos­si­ble ad­mis­si­ble strate­gies, where a strat­egy is a way of act­ing differ­ently based on what­ever ob­ser­va­tions you make, and differ­ent ac­tions get differ­ent pay­offs in differ­ent pos­si­ble wor­lds. Wald termed an ad­mis­si­ble strat­egy a strat­egy which was not dom­i­nated by some other strat­egy across all pos­si­ble mea­sures you could put on the pos­si­ble wor­lds. Wald found that the class of ad­mis­si­ble strate­gies was sim­ply the class that cor­re­sponded to hav­ing a prob­a­bil­ity dis­tri­bu­tion, do­ing Bayesian up­dat­ing on ob­ser­va­tions, and max­i­miz­ing ex­pected pay­off.

Un­der­grad: Can you per­haps re­peat that in slightly smaller words?

Bayesian: If you want to do differ­ent things de­pend­ing on what you ob­serve, and get differ­ent pay­offs de­pend­ing on what the real facts are, ei­ther your strat­egy can be seen as hav­ing a prob­a­bil­ity dis­tri­bu­tion and do­ing Bayesian up­dat­ing, or there’s an­other strat­egy that does bet­ter given at least some pos­si­ble mea­sures on the wor­lds and never does worse. So if you say any­thing as wild as “I’m wait­ing to see data with p<0.0001 to ban smok­ing,” in prin­ci­ple there must be some way of say­ing some­thing along the lines of, “I have a prior prob­a­bil­ity of 0.01% that smok­ing causes can­cer, let’s see those like­li­hood func­tions” which does at least as well or bet­ter no mat­ter what any­one else would say as their own prior prob­a­bil­ities over the back­ground facts.

Scien­tist: Huh.

Bayesian: In­deed. And that was when the Bayesian rev­olu­tion very slowly started; it’s sort of been gath­er­ing steam since then. It’s worth not­ing that Wald only proved his the­o­rem a cou­ple of decades af­ter “p-val­ues” were in­vented, which, from my per­spec­tive, helps ex­plain how sci­ence got wedged into its pe­cu­liar cur­rent sys­tem.

Scien­tist: So you think we should burn all p-val­ues and switch to re­port­ing all like­li­hood ra­tios all the time.

Bayesian: In a word… yes.

Scien­tist: I’m sus­pi­cious, in gen­eral, of one-size-fits-all solu­tions like that. I sus­pect you—I hope this is not too hor­ribly offen­sive—I sus­pect you of ideal­ism. In my ex­pe­rience, differ­ent peo­ple need differ­ent tools from the toolbox at differ­ent times, and it’s not wise to throw out all the tools in your toolbox ex­cept one.

Bayesian: Well, let’s be clear where I am and amn’t ideal­is­tic, then. Like­li­hood func­tions can­not solve the en­tire repli­ca­tion crisis. There are as­pects of this that can’t be solved by us­ing bet­ter statis­tics. Open ac­cess jour­nals aren’t some­thing that hinge on p-val­ues ver­sus like­li­hood func­tions. The bro­ken sys­tem of peer com­men­tary, presently in the form of peer re­view, is not some­thing like­li­hood func­tions can solve.

Scien­tist: But like­li­hood func­tions will solve ev­ery­thing else?

Bayesian: No, but they’ll at least help on a sur­pris­ing amount. Let me count the ways:

Bayesian: One. Like­li­hood func­tions don’t dis­t­in­guish be­tween ‘statis­ti­cally sig­nifi­cant’ re­sults and ‘failed’ repli­ca­tions. There are no ‘pos­i­tive’ and ‘nega­tive’ re­sults. What used to be called the null hy­poth­e­sis is now just an­other hy­poth­e­sis, with noth­ing spe­cial about it. If you flip a coin and get HHTHTTTHHH, you have not “failed to re­ject the null hy­poth­e­sis with p<0.05” or “failed to repli­cate”. You have found ex­per­i­men­tal data that fa­vors the fair-coin hy­poth­e­sis over the 5/​6ths-heads hy­poth­e­sis with a like­li­hood ra­tio of 3.78 : 1. This may help to fight the file-drawer effect—not en­tirely, be­cause there is a mind­set in the jour­nals of ‘pos­i­tive’ re­sults and bi­ased coins be­ing more ex­cit­ing than fair coins, and we need to tackle that mind­set di­rectly. But the p-value sys­tem en­courages that bad mind­set. That’s why p-hack­ing even ex­ists. So switch­ing to like­li­hoods won’t fix ev­ery­thing right away, but it sure will help.

Bayesian: Two. The sys­tem of like­li­hoods makes the im­por­tance of raw data clearer and will en­courage a sys­tem of pub­lish­ing the raw data when­ever pos­si­ble, be­cause Bayesian analy­ses cen­ter around the prob­a­bil­ity of the ex­act data we saw, given our var­i­ous hy­pothe­ses. The p-value sys­tem en­courages you to think in terms of the data as be­ing just one mem­ber of a class of ‘equally ex­treme’ re­sults. There’s a mind­set here of peo­ple hoard­ing their pre­cious data, which is not purely a mat­ter of statis­tics. But the p-value sys­tem en­courages that mind­set by en­courag­ing peo­ple to think of their re­sult as part of some undis­t­in­guished class of ‘equally or more ex­treme’ val­ues or what­ever, and that its mean­ing is en­tirely con­tained in it be­ing a ‘pos­i­tive’ re­sult that is ‘statis­ti­cally sig­nifi­cant’.

Bayesian: Three. The prob­a­bil­ity-the­o­retic view, or Bayesian view, makes it clear that differ­ent effect sizes are differ­ent hy­pothe­ses, as they must be, be­cause they as­sign differ­ent prob­a­bil­ities to the ex­act ob­ser­va­tions we see. If one ex­per­i­ment finds a ‘statis­ti­cally sig­nifi­cant’ effect size of 0.4 and an­other ex­per­i­ment finds a ‘statis­ti­cally sig­nifi­cant’ effect size of 0.1 on what­ever scale we’re work­ing in, the ex­per­i­ment has not repli­cated and we do not yet know what real state of af­fairs is gen­er­at­ing our ob­ser­va­tions. This di­rectly fights and negates the ‘amaz­ing shrink­ing effect size’ phe­nomenon that is part of the repli­ca­tion crisis.

Bayesian: Four. Work­ing in like­li­hood func­tions makes it far eas­ier to ag­gre­gate our data. It even helps to point up when our data is be­ing pro­duced un­der in­con­sis­tent con­di­tions or when the true hy­poth­e­sis is not be­ing con­sid­ered, be­cause in this case we will find like­li­hood func­tions that end up be­ing nearly zero ev­ery­where, or where the best available hy­poth­e­sis is achiev­ing a much lower like­li­hood on the com­bined data than that hy­poth­e­sis ex­pects it­self to achieve. It is a stric­ter con­cept of repli­ca­tion that helps quickly point up when differ­ent ex­per­i­ments are be­ing performed un­der differ­ent con­di­tions and yield­ing re­sults in­com­pat­i­ble with a sin­gle con­sis­tent phe­nomenon.

Bayesian: Five. Like­li­hood func­tions are ob­jec­tive facts about the data which do not de­pend on your state of mind. You can­not de­ceive some­body by re­port­ing like­li­hood func­tions un­less you are liter­ally ly­ing about the data or omit­ting data. There’s no equiv­a­lent of ‘p-hack­ing’.

Scien­tist: Okay, that last claim in par­tic­u­lar strikes me as very sus­pi­cious. What hap­pens if I want to per­suade you that a coin is bi­ased to­wards heads, so I keep flip­ping it un­til I ran­domly get to a point where there’s a pre­dom­i­nance of heads, and then choose to stop?

Bayesian: “Shrug,” I say. You can’t mis­lead me by tel­ling me what a real coin ac­tu­ally did.

Scien­tist: I’m ask­ing you what hap­pens if I keep flip­ping the coin, check­ing the like­li­hood each time, un­til I see that the cur­rent statis­tics fa­vor my pet the­ory, and then I stop.

Bayesian: As a pure ideal­ist se­duced by the se­duc­tively pure ideal­ism of prob­a­bil­ity the­ory, I say that so long as you pre­sent me with the true data, all I can and should do is up­date in the way Bayes’ the­o­rem says I should.

Scien­tist: Se­ri­ously.

Bayesian: I am se­ri­ous.

Scien­tist: So it doesn’t bother you if I keep check­ing the like­li­hood ra­tio and con­tin­u­ing to flip the coin un­til I can con­vince you of any­thing I want.

Bayesian: Go ahead and try it.

Scien­tist: What I’m ac­tu­ally go­ing to do is write a Python pro­gram which simu­lates flip­ping a fair coin up to 300 times, and I’m go­ing to see how many times I can get a 20:1 like­li­hood ra­tio falsely in­di­cat­ing that the coin is bi­ased to come up heads 55% of the time… why are you smil­ing?

Bayesian: I wrote pretty much the same Python pro­gram when I was first con­vert­ing to Bayesi­anism and find­ing out about like­li­hood ra­tios and feel­ing skep­ti­cal about the sys­tem maybe be­ing abus­able in some way, and then a friend of mine found out about like­li­hood ra­tios and he wrote es­sen­tially the same pro­gram, also in Python. And lo, he found that false ev­i­dence of 20:1 for the coin be­ing 55% bi­ased was found at least once, some­where along the way… 1.4% of the time. If you asked for more ex­treme like­li­hood ra­tios, the chances of find­ing them dropped off even faster.

Scien­tist: Okay, that’s not bad by the p-value way of look­ing at things. But what if there’s some more clever way of bi­as­ing it?

Bayesian: When I was… I must have been five years old, or maybe even younger, and first learn­ing about ad­di­tion, one of the ear­liest child­hood mem­o­ries I have at all, is of adding 3 to 5 by count­ing 5, 6, 7 and be­liev­ing that you could get differ­ent re­sults from adding num­bers de­pend­ing on ex­actly how you did it. Which is cute, yes, and also in­di­cates a kind of ex­plor­ing, of prob­ing, that was no doubt im­por­tant in my start­ing to un­der­stand ad­di­tion. But you still look back and find it hu­morous, be­cause now you’re a big grownup and you know you can’t do that. My writ­ing Python pro­grams to try to find clever ways to fool my­self by re­peat­edly check­ing the like­li­hood ra­tios was the same, in the sense that af­ter I ma­tured a bit more as a Bayesian, I re­al­ized that the feat I’d writ­ten those pro­grams to try to do was ob­vi­ously im­pos­si­ble. In the same way that try­ing to find a clever way to break apart the 3 into 2 and 1, and try­ing to add them sep­a­rately to 5, and then try­ing to add the 1 and then the 2, in hopes you can get to 7 or 9 in­stead of 8, is just never ever go­ing to work. The re­sults in ar­ith­metic are the­o­rems, and it doesn’t mat­ter in what clever or­der you switch things up, you are never go­ing to get any­thing ex­cept 8 when you carry out an op­er­a­tion that is val­idly equiv­a­lent to adding 3 plus 5. The the­o­rems of prob­a­bil­ity the­ory are also the­o­rems. If your Python pro­gram had ac­tu­ally worked, it would have pro­duced a con­tra­dic­tion in prob­a­bil­ity the­ory, and thereby a con­tra­dic­tion in Peano Arith­metic, which pro­vides a model for prob­a­bil­ity the­ory car­ried out us­ing ra­tio­nal num­bers. The thing you tried to do is ex­actly as hard as adding 3 and 5 us­ing the stan­dard ar­ith­metic ax­ioms and get­ting 7.

Un­der­grad: Uh, why?

Scien­tist: Se­conded.

Bayesian: Be­cause let­ting \(e\) de­note the ev­i­dence, \(H\) de­note the hy­poth­e­sis, \(\neg\) de­note the nega­tion of a propo­si­tion, \(\mathbb P(X)\) de­note the prob­a­bil­ity of propo­si­tion \(X\), and \(\mathbb P(X \mid Y)\) de­note the con­di­tional prob­a­bil­ity of \(X\) as­sum­ing \(Y\) to be true, it is a the­o­rem of prob­a­bil­ity that

$$\mathbb P(H) = \left(P(H \mid e) \cdot P(e)\right) + \left(P(H\mid \neg e) \cdot P(\neg e)\right).$$
There­fore like­li­hood func­tions can never be p-hacked by any pos­si­ble clever setup with­out you out­right ly­ing, be­cause you can’t have any pos­si­ble pro­ce­dure that a Bayesian knows in ad­vance will make them up­date in a pre­dictable net di­rec­tion. For ev­ery up­date that we ex­pect to be pro­duced by a piece of ev­i­dence \(e,\) there’s an equal and op­po­site up­date that we ex­pect to prob­a­bly oc­cur from see­ing \(\neg e.\)

Un­der­grad: What?

Scien­tist: Se­conded.

Bayesian: Look… let me try to zoom out a bit, and yes, look at the on­go­ing repli­ca­tion crisis. The Scien­tist pro­claimed sus­pi­cion of grand new sweep­ing ideals. Okay, but the shift to like­li­hood func­tions is the kind of thing that ought to be able to solve a lot of prob­lems at once. Let’s say… I’m try­ing to think of a good anal­ogy here. Let’s say there’s a cor­po­ra­tion which is hav­ing a big crisis be­cause their ac­coun­tants are us­ing float­ing-point num­bers, only there’s three differ­ent parts of the firm us­ing three differ­ent rep­re­sen­ta­tions of float­ing-point num­bers to do nu­mer­i­cally un­sta­ble calcu­la­tions. Some­body starts with 1.0 and adds 0.0001 a thou­sand times and then sub­tracts 0.1 and gets 0.999999999999989. Or you can go to the other side of the build­ing and use a differ­ent float­ing-point rep­re­se­n­ata­tion and get a differ­ent re­sult. And no­body has any con­cep­tion that there’s any­thing wrong with this. Sup­pose there are BIG er­rors in the float­ing-point num­bers, they’re us­ing the float­ing-point-num­ber equiv­a­lent of crude ideograms and Ro­man nu­mer­als, you can get big prag­matic differ­ences de­pend­ing on what rep­re­sen­ta­tion you use. And nat­u­rally, peo­ple ‘di­vi­sion-hack’ to get what­ever fi­nan­cial re­sults they want. So all the spread­sheets are failing to repli­cate, and peo­ple are start­ing to worry the ‘cog­ni­tive prim­ing’ sub­di­vi­sion has ac­tu­ally been bankrupt for 20 years. And then one day you come in and you say, “Hey. Every­one. Sup­pose that in­stead of these com­pet­ing float­ing-point rep­re­sen­ta­tions, we use my new rep­re­sen­ta­tion in­stead. It can’t be fooled the same way, which will solve a sur­pris­ing num­ber of your prob­lems.”

(The Bayesian now imi­tates the Scien­tist’s voice:) “I’m sus­pi­cious,” says the Se­nior Au­di­tor. “I sus­pect you of ideal­ism. In my ex­pe­rience, peo­ple need to use differ­ent float­ing-point rep­re­sen­ta­tions for differ­ent fi­nan­cial prob­lems, and it’s good to have a lot of differ­ent nu­mer­i­cal rep­re­sen­ta­tions of frac­tions in your toolbox.”

Bayesian: “Well,” I re­ply, “it may sound ideal­is­tic, but in point of fact, this thing I’m about to show you is the rep­re­sen­ta­tion of frac­tions, in which you can­not get differ­ent re­sults de­pend­ing on which way you add things or what or­der you do the op­er­a­tions in. It might be slightly more com­pu­ta­tion­ally ex­pen­sive, but it is now no longer 1920 like when you first adopted the old sys­tem, and se­ri­ously, you can af­ford the com­put­ing power in a very large frac­tion of cases where you’re only work­ing with 30,000,000 bank ac­counts or some triv­ial num­ber like that. Yes, if you want to do some­thing like take square roots, it gets a bit more com­pli­cated, but very few of you are ac­tu­ally tak­ing the square root of bank ac­count bal­ances. For the vast ma­jor­ity of things you are try­ing to do on a day-to-day ba­sis, this sys­tem is un­hack­able with­out ac­tu­ally mis­re­port­ing the num­bers.” And then I show them how to rep­re­sent ar­bi­trary-mag­ni­tude finite in­te­gers pre­cisely, and how to rep­re­sent a ra­tio­nal num­ber as the ra­tio of two in­te­gers. What we would, nowa­days, con­sider to be a di­rect, pre­cise, com­pu­ta­tional rep­re­sen­ta­tion of the sys­tem of ra­tio­nal num­bers. The one unique ax­io­m­a­tized math­e­mat­i­cal sys­tem of ra­tio­nal num­bers, to which float­ing-point num­bers are a mere ap­prox­i­ma­tion. And if you’re just work­ing with 30,000,000 bank ac­count bal­ances and your crude ap­prox­i­mate float­ing-point num­bers are in prac­tice blow­ing up and failing to repli­cate and be­ing ex­ploited by peo­ple to get what­ever re­sults they want, and it is no longer 1920 and you can af­ford real com­put­ers now, it is an ob­vi­ous step to have all the ac­coun­tants switch to us­ing the ra­tio­nal num­bers. Just as Bayesian up­dates are the ra­tio­nal up­dates, in the unique math­e­mat­i­cal ax­io­m­a­tized sys­tem of prob­a­bil­ities. And that’s why you can’t p-hack them.

Scien­tist: That is a rather… au­da­cious claim. And I con­fess, even if ev­ery­thing you said about the math were true, I would still be skep­ti­cal of the prag­mat­ics. The cur­rent sys­tem of sci­en­tific statis­tics is some­thing that’s grown up over time and ma­tured. Has this bright Bayesian way ac­tu­ally been tried?

Bayesian: It hasn’t been tried very much in sci­ence. In ma­chine learn­ing, where, uh, not to put too fine a point on it, we can ac­tu­ally see where the mod­els are break­ing be­cause our AI doesn’t work, it’s been ten years since I’ve read a pa­per that tries to go at things from a fre­quen­tist an­gle and I can’t ever re­call see­ing an AI al­gorithm calcu­late the p-value of any­thing. If you’re do­ing any­thing prin­ci­pled at all from a prob­a­bil­ity-the­o­retic stance, it’s prob­a­bly Bayesian, and pretty much never fre­quen­tist. If you’re clas­sify­ing data us­ing n-hot en­cod­ings, your loss func­tion is the cross-en­tropy, not… I’m not even sure what the equiv­a­lent of try­ing to use 1920s-style p-val­ues in AI would be like. I would frankly at­tribute this to peo­ple in ma­chine learn­ing hav­ing to use statis­ti­cal tools that visi­bly suc­ceed or fail; rather than need­ing to get pub­lished by go­ing through a par­tic­u­lar tra­di­tional rit­ual of p-value re­port­ing, and failure to repli­cate not be­ing all that bad for your ca­reer.

Scien­tist: So you’re ac­tu­ally more of a com­puter sci­ence guy than an ex­per­i­men­tal­ist your­self. Why does this not sur­prise me? It’s not im­pos­si­ble that some bet­ter statis­ti­cal sys­tem than p-val­ues could ex­ist, but I’d ad­vise you to re­spect the wis­dom of ex­pe­rience. The fact that we know what p-hack­ing is, and are cur­rently fight­ing it, is be­cause we’ve had time to see where the edges of the sys­tem have prob­lems, and we’re figur­ing out how to fight those prob­lems. This shiny new sys­tem will also have prob­lems; you just have no idea what they’ll be. Per­haps they’ll be worse.

Bayesian: It’s not im­pos­si­ble that the ac­coun­tants would figure out new shenani­gans to pull with ra­tio­nal num­bers, es­pe­cially if they were do­ing some things com­pu­ta­tion­ally in­ten­sive enough that they could no longer af­ford to use the ra­tio­nal num­bers and had to use some ap­prox­i­ma­tion in­stead. But I stand by my state­ment that if your fi­nan­cial spread­sheets are right now blow­ing up in a gi­ant repli­ca­tion crisis in ways that seem clearly linked to us­ing p-val­ues, and the p-val­ues are, frankly, bloody ad-hoc in­con­sis­tent non­sense, an ob­vi­ous first step is to try us­ing the ra­tio­nal up­dates in­stead. Although, it’s pos­si­ble we don’t dis­agree too much in prac­tice. I’d also prag­mat­i­cally fa­vor try­ing to roll things out one step at a time, like, maybe just switch over the psy­cholog­i­cal sci­ences and see how that goes.

Scien­tist: How would you per­suade them to do that?

Bayesian: I have no god­damn idea. Hon­estly, I’m not ex­pect­ing any­one to ac­tu­ally fix any­thing. Peo­ple will just go on us­ing p-val­ues un­til the end of the world, prob­a­bly. It’s just one more Nice Thing We Can’t Have. But there’s a chance the idea will catch on. I was pleas­antly sur­prised when open ac­cess caught on as quickly as it did. I was pleas­antly sur­prised when peo­ple, like, ac­tu­ally no­ticed the repli­ca­tion crisis and it be­came a big is­sue that peo­ple cared about. Maybe I’ll be pleas­antly sur­prised again and peo­ple will ac­tu­ally take up the cru­sade to bury the p-value at a cross­roads at mid­night and put a stake through its heart. If so, I’ll have done my part by mak­ing an un­der­stand­ing of Bayes’ rule and like­li­hoods more ac­cessible to ev­ery­one.

Scien­tist: Or it could turn out that peo­ple don’t like like­li­hoods, and that part of the wis­dom of ex­pe­rience is the les­son that p-val­ues are a kind of thing that ex­per­i­men­tal­ists ac­tu­ally find use­ful and easy to use.

Bayesian: If the ex­pe­rience of learn­ing tra­di­tional statis­tics trau­ma­tized them so heav­ily that the thought of need­ing to learn a new sys­tem sends them scream­ing into the night, then yes, change might need to be im­posed from out­side. I’m hop­ing though that the Un­der­grad will read a short, cheer­ful in­tro­duc­tion to Bayesian prob­a­bil­ity, com­pare this with his om­i­nous heavy tra­di­tional statis­tics text­book, and come back go­ing “Please let me use like­li­hoods please let me use like­li­hoods oh god please let me use like­li­hoods.”

Un­der­grad: I’ll guess I’ll look into it and see?

Bayesian: Weigh your de­ci­sion care­fully, Un­der­grad. Some changes in sci­ence de­pend upon stu­dents grow­ing up fa­mil­iar with mul­ti­ple ideas and choos­ing the right one. Max Planck said so in a fa­mous apho­rism, so it must be true. Ergo, the en­tire abil­ity of sci­ence to dis­t­in­guish good and bad ideas within that class must rest upon the cog­ni­tive ca­pac­i­ties of un­der­grads.

Scien­tist: Oh, now that is just--

Moder­a­tor: And we’re out of time. Thanks for join­ing us, ev­ery­one!