Likelihood functions, p-values, and the replication crisis

Or: Switching From Reporting p-values to Reporting Likelihood Functions Might Help Fix the Replication Crisis: A personal view by Eliezer Yudkowsky.

Disclaimers:

This dialogue was written by a Bayesian. The voice of the Scientist in the dialogue below may fail to pass the Ideological Turing Test for frequentism, that is, it may fail to do justice to frequentist arguments and counterarguments.
It does not seem sociologically realistic, to the author, that the proposal below could be adopted by the scientific community at large within the next 10 years. It seemed worth writing down nevertheless.

If you don’t already know Bayes’ rule, check out Arbital’s Guide to Bayes’ Rule if confused.

Moderator: Hello, everyone. I’m here today with the Scientist, a working experimentalist in… chemical psychology, or something; with the Bayesian, who’s going to explain why, on their view, we can make progress on the replication crisis by replacing p-values with some sort of Bayesian thing--

Undergrad: Sorry, can you repeat that?

Moderator: And finally, the Confused Undergrad on my right. Bayesian, would you care to start by explaining the rough idea?

Bayesian: Well, the rough idea is something like this. Suppose we flip a possibly-unfair coin six times, and observe the sequence HHHHHT. Should we be suspicious that the coin is biased?

Scientist: No.

Bayesian: This isn’t a literal coin. Let’s say we present a series of experimental subjects with two cookies on a plate, one with green sprinkles and one with red sprinkles. The first five people took cookies with green sprinkles and the sixth person took a cookie with red sprinkles. noteAnd they all saw separate plates, on a table in the waiting room marked “please take only one” so nobody knew what was being tested, and none of them saw the others’ cookie choices. Do we think most people prefer green-sprinkled cookies or do we think it was just random?

Undergrad: I think I would be suspicious that maybe people liked green sprinkles better. Or at least that the sort of people who go to the university and get used as test subjects like green sprinkles better. Yes, even if I just saw that happen in the first six cases. But I’m guessing I’m going to get dumped-on for that.

Scientist: I think I would be genuinely not-yet-suspicious. There’s just too much stuff that looks good after N=6 that doesn’t pan out with N=60.

Bayesian: I’d at least strongly suspect that people in the test population don’t mostly prefer red sprinkles. But the reason I introduced this example is as an oversimplified example of how current scientific statistics calculate so-called “p-values”, and what a Bayesian sees as the central problem with that.

Scientist: And we can’t use a more realistic example with 30 subjects?

Bayesian: That would not be nice to the Confused Undergrad.

Undergrad: Seconded.

Bayesian: So: Heads, heads, heads, heads, heads, tails. I ask: is this “statistically significant”, as current conventional statisticians would have the phrase?

Scientist: I reply: no. On the null hypothesis that the coin is fair, or analogously that people have no strong preference between green and red sprinkles, we should expect to see a result as extreme as this in 14 out of 64 cases.

Undergrad: Okay, just to make sure I have this straight: That’s because we’re considering results like HHHTHH or TTTTTT to be equally or more extreme, and there are 14 total possible cases like that, and we flipped the coin 6 times which gives us $2^6 = 64$ possible results. ¹⁴⁄₆₄ = 22%, which is not less than 5%, so this is not statistically significant at the $p<.05$ level.

Scientist: That’s right. However, I’d also like to observe as a matter of practice that even if you get HHHHHH on your first six flips, I don’t advise stopping there and sending in a paper where you claim that the coin is biased towards heads.

Bayesian: Because if you can decide to stop flipping the coin at a time of your choice, then we have to ask “How likely it is that you can find some place to stop flipping the coin where it looks like there’s a significant number of heads?” That’s a whole different kettle of fish according to the p-value concept.

Scientist: I was just thinking that N=6 is not a good number of experimental subjects when it comes to testing cookie preferences. But yes, that too.

Undergrad: Uh… why does it make a difference if I can decide when to stop flipping the coin?

Bayesian: What an excellent question.

Scientist: Well, this is where the concept of p-values is less straightforward than plugging the numbers into a statistics package and believing whatever the stats package says. If you previously decided to flip exactly six coins, and then stop, regardless of what results you got, then you would get a result as extreme as “HHHHHH” or “TTTTTT” ²⁄₆₄ of the time, or 3.1%, so p<0.05. However, suppose that instead you are a bad fraudulent scientist, or maybe just an ignorant undergraduate who doesn’t realize what they’re doing wrong. Instead of picking the number of flips in advance, you keep flipping the coin until the statistics software package tells you that you got a result that would have been statistically significant if, contrary to the actual facts of the case, you’d decided in advance to flip the coin exactly that many times. But you didn’t decide in advance to flip the coin that many times. You decided it after checking the actual results. Which you are not allowed to do.

Undergrad: I’ve heard that before, but I’m not sure I understand on a gut level why it’s bad for me to decide when I’ve collected enough data.

Scientist: What we’re trying to do here is set up a test that the null hypothesis cannot pass—to make sure that where there’s no fire, there’s unlikely to be smoke. We want a complete experimental process which is unlikely to generate a “statistically significant” discovery if there’s no real phenomenon being investigated. If you flip the coin exactly six times, if you decide that in advance, then you are less than 5% likely to get a result as extreme as “six heads” or “six tails”. If you flip the coin repeatedly, and check repeatedly for a result that would have had p<0.05 if you’d decided in advance to flip the coin exactly that many times, your chance of getting a nod from the statistics package is much greater than 5%. You’re carrying out a process which is much more than 5% likely to yell “smoke” in the absence of fire.

Bayesian: The way I like to explain the problem is like this: Suppose you flip a coin six times and get HHHHHT. If you, in the secret depths of your mind, in your inner heart that no other mortal knows, decided to flip the coin exactly six times and then stop, this result is not statistically significant; p=0.22. If you decided in the secret depths of your heart, that you were going to keep flipping the coin until it came up tails, then the result HHHHHT is statistically significant with p=0.03, because the chance of a fair coin requiring you to wait six or more flips to get one tail is ¹⁄₃₂.

Undergrad: What.

Scientist: It’s a bit of a parody, obviously—nobody would really decide to flip a coin until they got one tail, then stop—but the Bayesian is technically correct about how the rules for p-values work. What we’re asking is how rare our outcome is, within the class of outcomes we could have gotten. The person who keeps flipping the coin until they get one tail has possible outcomes {T, HT, HHT, HHHT, HHHHT, HHHHHT, HHHHHHT…} and so on. The class of outcomes where you get to the sixth round or later is the class of outcomes {HHHHHT, HHHHHHT…} and so on, a set of outcomes which collectively have a probability of ¹⁄₆₄ + ¹⁄₁₂₈ + 1/256… = ¹⁄₃₂. Whereas if you flip the coin exactly six times, your class of possible outcomes is {TTTTTT, TTTTTH, TTTTHT, TTTTHH…}, a set of 64 possibilities within which the outcome HHHHHT is something we could lump in with {HHHHTH, HHHTHH, THTTTT…} and so on. So although it’s counterintuitive, if we really had decided to run the first experiment, HHHHHT would be a statistically significant result that a fair coin would be unlikely to give us. And if we’d decided to run the second experiment, HHHHHT would not be a statistically significant result because fair coins sometimes do something like that.

Bayesian: And it doesn’t bother you that the meaning of your experiment depends on your private state of mind?

Scientist: It’s an honor system. Just like the process doesn’t work if you lie about which coinflips you actually saw, the process doesn’t work—is not a fair test in which non-fires are unlikely to generate smoke—if you lie about which experiment you performed. You must honestly say the rules you followed to flip the coin the way you did. Unfortunately, since what you were thinking about the experiment is less clearly visible than the actual coinflips, people are much more likely to twist how they selected their number of experimental subjects, or how they selected which tests to run on the data, than they are to tell a blatant lie about what the data said. That’s p-hacking. There are, unfortunately, much subtler and less obvious ways of generating smoke without fire than claiming post facto to have followed the rule of flipping the coin until it came up tails. It’s a serious problem, and underpins some large part of the great replication crisis, nobody’s sure exactly how much.

Undergrad: That… sorta makes sense, maybe? I’m guessing this is one of those cases where I have to work through a lot of example problems before it becomes obvious.

Bayesian: No.

Undergrad: No?

Bayesian: You were right the first time, Undergrad. If what the experimentalist is thinking has no causal impact on the coin, then the experimentalist’s thoughts cannot possibly make any difference to what the coin is saying to us about Nature. My dear Undergrad, you are being taught weird, ad-hoc, overcomplicated rules that aren’t even internally consistent—rules that theoretically output different wrong answers depending on your private state of mind! And that is a problem that runs far deeper into the replication crisis than people misreporting their inner thoughts.

Scientist: A bold claim to say the least. But don’t be coy; tell us what you think we should be doing instead?

Bayesian: I analyze as follows: The exact result HHHHHT has a ¹⁄₆₄ or roughly 1.6% probability of being produced by a fair coin flipped six times. To simplify matters, suppose we for some reason were already pondering the hypothesis that the coin was biased to produce ⁵⁄₆ heads—again, this is an unrealistic example we can de-simplify later. Then this hypothetical biased coin would have a $(5/6)^5 \cdot (1/6)^1 \approx 6.7\%$ probability of producing HHHHHT. So between our two hypotheses “The coin is fair” and “The coin is biased to produce 5/6ths heads”, our exact experimental result is 4.3 times more likely to be observed in the second case. HHHHHT is also 0.01% likely to be produced if the coin is biased to produce 1/6th heads and 5/6ths tails, so we’ve already seen some quite strong evidence against that particular hypothesis, if anyone was considering it. The exact experimental outcome HHHHHT is 146 times as likely to be produced by a fair coin as by a coin biased to produce 1/6th heads. noteRecall the Bayesian’s earlier thought that, after seeing five subjects select green cookies followed by one subject selecting a red cookie, we’d already picked up strong evidence against the proposition: “Subjects in this experimental population lopsidedly prefer red cookies over green cookies.”

Undergrad: Well, I think I can follow the calculation you just did, but I’m not clear on what that calculation means.

Bayesian: I’ll try to explain the meaning shortly, but first, note this. That calculation we just did has no dependency whatsoever on why you flipped the coin six times. You could have stopped after six because you thought you’d seen enough coinflips. You could have done an extra coinflip after the first five because Namagiri Thayar spoke to you in a dream. The coin doesn’t care. The coin isn’t affected. It remains true that the exact result HHHHHT is 23% as likely to be produced by a fair coin as by a biased coin that comes up heads 5/6ths of the time.

Scientist: I agree that this is an interesting property of the calculation that you just did. And then what?

Bayesian: You report the results in a journal. Preferably including the raw data so that others can calculate the likelihoods for any other hypotheses of interest. Say somebody else suddenly becomes interested in the hypothesis “The coin is biased to produce 9/10ths heads.” Seeing HHHHHT is 5.9% likely in that case, so 88% as likely than if the coin is biased to produce 5/6ths heads (making the data 6.7% likely), or 3.7 times as likely than if the coin is fair (making the data 1.6% likely). But you shouldn’t have to think of all possible hypotheses in advance. Just report the raw data so that others can calculate whatever likelihoods they need. Since this calculation deals with the exact results we got, rather than summarizing it into some class or set of supposedly similar results, it puts a greater emphasis on reporting your exact experimental data to others.

Scientist: Reporting raw data seems an important leg of a good strategy for fighting the replication crisis, on this we agree. I nonetheless don’t understand what experimentalists are supposed to do with this “X is Q times as likely as Y” stuff.

Undergrad: Seconded.

Bayesian: Okay, so… this isn’t trivial to describe without making you run through a whole introduction to Bayes’ rule--

Undergrad: Great. Just what I need, another weird complicated 4-credit course on statistics.

Bayesian: It’s literally a 1-hour read if you’re good at math. It just isn’t literally trivial to understand with no prior introduction. Well, even with no introduction whatsoever, I may be able to fake it with statements that will sound like they might be reasonable—and the reasoning is valid, it just might not be obvious that it is. Anyway. It is a theorem of probability that the following is valid reasoning:

(the Bayesian takes a breath)

Bayesian: Suppose that Professor Plum and Miss Scarlet are two suspects in a murder. Based on their prior criminal convictions, we start out thinking that Professor Plum is twice as likely to have committed the murder as Miss Scarlet. We then discover that the victim was poisoned. We think that, assuming he committed a murder, Professor Plum would be 10% likely to use poison; assuming Miss Scarlet committed a murder, she would be 60% likely to use poison. So Professor Plum is around one-sixth as likely to use poison as Miss Scarlet. Then after observing the victim was poisoned, we should update to think Plum is around one-third as likely to have committed the murder as Scarlet: $2 \times \frac{1}{6} = \frac{1}{3}.$

Undergrad: Just to check, what do you mean by saying that “Professor Plum is one-third as likely to have committed the murder as Miss Scarlet”?

Bayesian: I mean that if these two people are our only suspects, we think Professor Plum has a ¹⁄₄ probability of having committed the murder and Miss Scarlet has a ³⁄₄ probability of being guilty. So Professor Plum’s probability of guilt is one-third that of Miss Scarlet’s.

Scientist: Now I’d like to know what you mean by saying that Professor Plum had a ¹⁄₄ probability of committing the murder. Either Plum committed the murder or he didn’t; we can’t observe the murder be committed multiple times and Professor Plum doing it 1/4th of the time.

Bayesian: Are we going there? I guess we’re going there. My good Scientist, I mean that if you offered me either side of an even-money bet on whether Plum committed the murder, I’d bet that he didn’t do it. But if you offered me a gamble that costs $1 if Professor Plum is innocent and pays out $5 if he’s guilty, I’d cheerfully accept that gamble. We only ran the 2012 US Presidential Election one time, but that doesn’t mean that on November 7th you should’ve refused a $10 bet that paid out $1000 if Obama won. In general when prediction markets and large liquid betting pools put 60% betting odds on somebody winning the presidency, that outcome tends to happen 60% of the time; they are well-calibrated for probabilities in that range. If they were systematically uncalibrated—if in general things happened 80% of the time when prediction markets said 60%--you could use that fact to pump money out of prediction markets. And your pumping out that money would adjust the prediction-market prices until they were well-calibrated. If things to which prediction markets assign 70% probability happen around 7 times out of 10, why insist for reasons of ideological purity that the probability statement is meaningless?

Undergrad: I admit, that sounds to me like it makes sense, if it’s not just the illusion of understanding due to my failing to grasp some far deeper debate.

Bayesian: There is indeed a deeper debate, but what the deeper debate works out to is that your illusion of understanding is pretty much accurate as illusions go.

Scientist: Yeah, I’m going to want to come back to that issue later. What if there are two agents who both seem ‘well-calibrated’ as you put it, but one agent says 60% and the other agent says 70

Bayesian: <span> If I flip a coin and don’t look at it, so that I don’t know yet if it came up heads or tails, then my ignorance about the coin isn’t a fact about the coin, it’s a fact about me. Ignorance exists in the mind, not in the environment. A blank map does not correspond to a blank territory. If you peek at the coin and I don’t, it’s perfectly reasonable for the two of us to occupy different states of uncertainty about the coin. And given that I’m not absolutely certain, I can and should quantify my uncertainty using probabilities. There’s like 300 different theorems showing that I’ll get into trouble if my state of subjective uncertainty cannot be viewed as a coherent probability distribution. You kinda pick up on the trend after just the fourth time you see a slightly different clever proof that any violation of the standard probability axioms will cause the graves to vomit forth their dead, the seas to turn red as blood, and the skies to rain down dominated strategies and combinations of bets that produce certain losses--

Scientist: Sorry, I shouldn’t have said anything just then. Let’s come back to this later? I’d rather hear first what you think we should do with the likelihoods once we have them.

Bayesian: On the laws of probability theory, those likelihood functions are the evidence. They are the objects that send our prior odds of 2 : 1 for Plum vs. Scarlet to posterior odds of 1 : 3 for Plum vs. Scarlet. For any two hypotheses you care to name, if you tell me the relative likelihoods of the data given those hypotheses, I know how to update my beliefs. If you change your beliefs in any other fashion, the skies shall rain dominated strategies etcetera. Bayes’ theorem: It’s not just a statistical method, it’s the LAW.

Undergrad: I’m sorry, I still don’t understand. Let’s say we do an experiment and find data that’s 6 times as likely if Professor Plum killed Mr. Boddy than if Miss Scarlet did. Do we arrest Professor Plum?

Scientist: My guess is that you’re supposed to make up a ‘prior probability’ that sounds vaguely plausible, like ‘a priori, I think Professor Plum is 20<div> likely to have killed Mr. Boddy’. Then you combine that with your 6 : 1 likelihood ratio to get 3 : 2 posterior odds that Plum killed Mr. Boddy. So your paper reports that you’ve established a 60% posterior probability that Professor Plum is guilty, and the legal process does whatever it does with that.

Bayesian: No. Dear God, no! Is that really what people think Bayesianism is?

Scientist: It’s not? I did always hear that the strength of Bayesianism is that it gives us posterior probabilities, which p-values don’t actually do, and that the big weakness was that it got there by making up prior probabilities more or less out of thin air, which means that nobody will ever be able to agree on what the posteriors are.

Bayesian: Science papers should report likelihoods. Or rather, they should report the raw data and helpfully calculate some likelihood functions on it. Not posteriors, never posteriors.

Undergrad: What’s a posterior? I’m trusting both of you to avoid the obvious joke here.

Bayesian: A posterior probability is when you say, “There’s a 60% probability that Professor Plum killed Mr. Boddy.” Which, as the Scientist points out, is something you never get from p-values. It’s also something that, in my own opinion, should never be reported in an experimental paper, because it’s not the result of an experiment.

Undergrad: But… okay, Scientist, I’m asking you this one. Suppose we see data statistically significant at p<0.01, something we’re less than 1% probable to see if the null hypothesis “Professor Plum didn’t kill Mr. Boddy” is true. Do we arrest him?

Scientist: First of all, that’s not a realistic null hypothesis. A null hypothesis is something like “Nobody killed Mr. Boddy” or “All suspects are equally guilty.” But even if what you just said made sense, even if we could reject Professor Plum’s innocence at p<0.01, you still can’t say anything like, “It is 99% probable that Professor Plum is guilty.” That is just not what p-values mean.

Undergrad: Then what do p-values mean?

Scientist: They mean we saw data inside a preselected class of possible results, which class, as a whole, is less than 1% likely to be produced if the null hypothesis is true. That’s all that it means. You can’t go from there to “Professor Plum is 99% likely to be guilty,” for reasons the Bayesian is probably better at explaining. You can’t go from there to anywhere that’s someplace else. What you heard is what there is.

Undergrad: Now I’m doubly confused. I don’t understand what we’re supposed to do with p-values or likelihood ratios. What kind of experiment does it take to throw Professor Plum in prison?

Scientist: Well, realistically, if you get a couple more experiments at different labs also saying p<0.01, Professor Plum is probably guilty.

Bayesian: And the ‘replication crisis’ is that it turns out he’s not guilty.

Scientist: Pretty much.

Undergrad: That’s not exactly reassuring.

Scientist: Experimental science is not for the weak of nerve.

Undergrad: So… Bayesian, are you about to say similarly that once you get an extreme enough likelihood ratio, say, anything over 100 to 1, or something, you can probably take something as true?

Bayesian: No, it’s a bit more complicated than that. Let’s say I flip a coin 20 times and get HHHTHHHTHTHTTHHHTHHHTTHT. Well, the hypothesis “This coin was rigged to produce exactly HHHTHHHTHTHTTHHHTHHHTTHT” has a likelihood advantage of roughly a million-to-one over the hypothesis “this is a fair coin”. On any reasonable system, unless you wrote down that single hypothesis in advance and handed it to me in an envelope and didn’t write down any other hypotheses or hand out any other envelopes, we’d say the hypothesis “This coin was rigged to produce HHHTHHHTHTHTTHHHTHHHTTHT” has a complexity penalty of at least $2^{20} : 1$ because it takes 20 bits just to describe what the coin is rigged to do. In other words, the penalty to prior plausibility more than cancels out a million-to-one likelihood advantage. And that’s just the start of the issues. But, with that said, I think there’s a pretty good chance you could do okay out of just winging it, once you understood in an intuitive and common-sense way how Bayes’ rule worked. If there’s evidence pointing to Professor Plum with a likelihood of 1,000 : 1 over any other suspects you can think of, in a field that probably only contained six suspects to begin with, you can figure that the prior odds against Plum weren’t much more extreme than 10 : 1 and that you can legitimately be at least 99% sure now.

Scientist: But you say that this is not something you should report in the paper.

Bayesian: That’s right. How can I put this… one of the great commandments of Bayesianism is that you ought to take into account all the relevant evidence you have available; you can’t exclude some evidence from your calculation just because you don’t like it. Besides sounding like common sense, this is also a rule you have to follow to prevent your calculations from coming up with paradoxical results, and there are various particular problems where there’s a seemingly crazy conclusion and the answer is, “Well, you also need to condition on blah blah blah.” My point being, how do I, as an experimentalist, know what all the relevant evidence is? Who am I to calculate a posterior? Maybe somebody else published a paper that includes more evidence, with more likelihoods to be taken into account, and I haven’t heard about it yet, but somebody else has. I just contribute my own data and its likelihood function—that’s all! It’s not my place to claim that I’ve collected all the relevant evidence and can now calculate posterior odds, and even if I could, somebody else could publish another paper a week later and the posterior odds would change again.

Undergrad: So, roughly your answer is, “An experimentalist just publishes the paper and calculates the likelihood thingies for that dataset, and then somebody outside has to figure out what to do with the likelihood thingies.”

Bayesian: Somebody outside has to set up priors—probably just reasonable-sounding ignorance priors, maximum entropy stuff or complexity-based penalties or whatever—then try to make sure they’ve collected all the evidence, apply the likelihood functions, check to see if the result makes sense, etcetera. And then they might have to revise that estimate if somebody publishes a new paper a week later--

Undergrad: That sounds awful.

Bayesian: It would be awful if we were doing meta-analyses of p-values. Bayesian updates are a hell of a lot simpler! Like, you literally just multiply the old posterior by the new likelihood function and normalize. If experiment 1 has a likelihood ratio of 4 for hypothesis A over hypothesis B, and experiment 2 has a likelihood ratio of 9 for A over B, the two experiments together have a likelihood ratio of 36.

Undergrad: And you can’t do that with p-values, I mean, a p-value of 0.05 and a p-value of 0.01 don’t multiply out to p<0.0005--

Scientist: No.

Bayesian: I should like to take this moment to call attention to my superior smile.

Scientist: I am still worried about the part of this process where somebody gets to make up prior probabilities.

Bayesian: Look, that just corresponds to the part of the process where somebody decides that, having seen 1 discovery and 2 replications with p<0.01, they are willing to buy the new pill or whatever.

Scientist: So your reply there is, “It’s subjective, but so is what you do when you make decisions based on having seen some experiments with p-values.” Hm. I was going to say something like, “If I set up a rule that says I want data with p<0.001, there’s no further objectivity beyond that,” but I guess you’d say that my asking for p<0.001 instead of p<0.0001 corresponds to my pulling a prior out of my butt?

Bayesian: Well, except that asking for a particular p-value is not actually as good as pulling a prior out of your butt. One of the first of those 300 theorems proving doom if you violate probability axioms, was Abraham Wald’s “complete class theorem” in 1947. Wald set out to investigate all the possible admissible strategies, where a strategy is a way of acting differently based on whatever observations you make, and different actions get different payoffs in different possible worlds. Wald termed an admissible strategy a strategy which was not dominated by some other strategy across all possible measures you could put on the possible worlds. Wald found that the class of admissible strategies was simply the class that corresponded to having a probability distribution, doing Bayesian updating on observations, and maximizing expected payoff.

Undergrad: Can you perhaps repeat that in slightly smaller words?

Bayesian: If you want to do different things depending on what you observe, and get different payoffs depending on what the real facts are, either your strategy can be seen as having a probability distribution and doing Bayesian updating, or there’s another strategy that does better given at least some possible measures on the worlds and never does worse. So if you say anything as wild as “I’m waiting to see data with p<0.0001 to ban smoking,” in principle there must be some way of saying something along the lines of, “I have a prior probability of 0.01% that smoking causes cancer, let’s see those likelihood functions” which does at least as well or better no matter what anyone else would say as their own prior probabilities over the background facts.

Scientist: Huh.

Bayesian: Indeed. And that was when the Bayesian revolution very slowly started; it’s sort of been gathering steam since then. It’s worth noting that Wald only proved his theorem a couple of decades after “p-values” were invented, which, from my perspective, helps explain how science got wedged into its peculiar current system.

Scientist: So you think we should burn all p-values and switch to reporting all likelihood ratios all the time.

Bayesian: In a word… yes.

Scientist: I’m suspicious, in general, of one-size-fits-all solutions like that. I suspect you—I hope this is not too horribly offensive—I suspect you of idealism. In my experience, different people need different tools from the toolbox at different times, and it’s not wise to throw out all the tools in your toolbox except one.

Bayesian: Well, let’s be clear where I am and amn’t idealistic, then. Likelihood functions cannot solve the entire replication crisis. There are aspects of this that can’t be solved by using better statistics. Open access journals aren’t something that hinge on p-values versus likelihood functions. The broken system of peer commentary, presently in the form of peer review, is not something likelihood functions can solve.

Scientist: But likelihood functions will solve everything else?

Bayesian: No, but they’ll at least help on a surprising amount. Let me count the ways:

Bayesian: One. Likelihood functions don’t distinguish between ‘statistically significant’ results and ‘failed’ replications. There are no ‘positive’ and ‘negative’ results. What used to be called the null hypothesis is now just another hypothesis, with nothing special about it. If you flip a coin and get HHTHTTTHHH, you have not “failed to reject the null hypothesis with p<0.05” or “failed to replicate”. You have found experimental data that favors the fair-coin hypothesis over the 5/6ths-heads hypothesis with a likelihood ratio of 3.78 : 1. This may help to fight the file-drawer effect—not entirely, because there is a mindset in the journals of ‘positive’ results and biased coins being more exciting than fair coins, and we need to tackle that mindset directly. But the p-value system encourages that bad mindset. That’s why p-hacking even exists. So switching to likelihoods won’t fix everything right away, but it sure will help.

Bayesian: Two. The system of likelihoods makes the importance of raw data clearer and will encourage a system of publishing the raw data whenever possible, because Bayesian analyses center around the probability of the exact data we saw, given our various hypotheses. The p-value system encourages you to think in terms of the data as being just one member of a class of ‘equally extreme’ results. There’s a mindset here of people hoarding their precious data, which is not purely a matter of statistics. But the p-value system encourages that mindset by encouraging people to think of their result as part of some undistinguished class of ‘equally or more extreme’ values or whatever, and that its meaning is entirely contained in it being a ‘positive’ result that is ‘statistically significant’.

Bayesian: Three. The probability-theoretic view, or Bayesian view, makes it clear that different effect sizes are different hypotheses, as they must be, because they assign different probabilities to the exact observations we see. If one experiment finds a ‘statistically significant’ effect size of 0.4 and another experiment finds a ‘statistically significant’ effect size of 0.1 on whatever scale we’re working in, the experiment has not replicated and we do not yet know what real state of affairs is generating our observations. This directly fights and negates the ‘amazing shrinking effect size’ phenomenon that is part of the replication crisis.

Bayesian: Four. Working in likelihood functions makes it far easier to aggregate our data. It even helps to point up when our data is being produced under inconsistent conditions or when the true hypothesis is not being considered, because in this case we will find likelihood functions that end up being nearly zero everywhere, or where the best available hypothesis is achieving a much lower likelihood on the combined data than that hypothesis expects itself to achieve. It is a stricter concept of replication that helps quickly point up when different experiments are being performed under different conditions and yielding results incompatible with a single consistent phenomenon.

Bayesian: Five. Likelihood functions are objective facts about the data which do not depend on your state of mind. You cannot deceive somebody by reporting likelihood functions unless you are literally lying about the data or omitting data. There’s no equivalent of ‘p-hacking’.

Scientist: Okay, that last claim in particular strikes me as very suspicious. What happens if I want to persuade you that a coin is biased towards heads, so I keep flipping it until I randomly get to a point where there’s a predominance of heads, and then choose to stop?

Bayesian: “Shrug,” I say. You can’t mislead me by telling me what a real coin actually did.

Scientist: I’m asking you what happens if I keep flipping the coin, checking the likelihood each time, until I see that the current statistics favor my pet theory, and then I stop.

Bayesian: As a pure idealist seduced by the seductively pure idealism of probability theory, I say that so long as you present me with the true data, all I can and should do is update in the way Bayes’ theorem says I should.

Scientist: Seriously.

Bayesian: I am serious.

Scientist: So it doesn’t bother you if I keep checking the likelihood ratio and continuing to flip the coin until I can convince you of anything I want.

Bayesian: Go ahead and try it.

Scientist: What I’m actually going to do is write a Python program which simulates flipping a fair coin up to 300 times, and I’m going to see how many times I can get a 20:1 likelihood ratio falsely indicating that the coin is biased to come up heads 55% of the time… why are you smiling?

Bayesian: I wrote pretty much the same Python program when I was first converting to Bayesianism and finding out about likelihood ratios and feeling skeptical about the system maybe being abusable in some way, and then a friend of mine found out about likelihood ratios and he wrote essentially the same program, also in Python. And lo, he found that false evidence of 20:1 for the coin being 55% biased was found at least once, somewhere along the way… 1.4% of the time. If you asked for more extreme likelihood ratios, the chances of finding them dropped off even faster.

Scientist: Okay, that’s not bad by the p-value way of looking at things. But what if there’s some more clever way of biasing it?

Bayesian: When I was… I must have been five years old, or maybe even younger, and first learning about addition, one of the earliest childhood memories I have at all, is of adding 3 to 5 by counting 5, 6, 7 and believing that you could get different results from adding numbers depending on exactly how you did it. Which is cute, yes, and also indicates a kind of exploring, of probing, that was no doubt important in my starting to understand addition. But you still look back and find it humorous, because now you’re a big grownup and you know you can’t do that. My writing Python programs to try to find clever ways to fool myself by repeatedly checking the likelihood ratios was the same, in the sense that after I matured a bit more as a Bayesian, I realized that the feat I’d written those programs to try to do was obviously impossible. In the same way that trying to find a clever way to break apart the 3 into 2 and 1, and trying to add them separately to 5, and then trying to add the 1 and then the 2, in hopes you can get to 7 or 9 instead of 8, is just never ever going to work. The results in arithmetic are theorems, and it doesn’t matter in what clever order you switch things up, you are never going to get anything except 8 when you carry out an operation that is validly equivalent to adding 3 plus 5. The theorems of probability theory are also theorems. If your Python program had actually worked, it would have produced a contradiction in probability theory, and thereby a contradiction in Peano Arithmetic, which provides a model for probability theory carried out using rational numbers. The thing you tried to do is exactly as hard as adding 3 and 5 using the standard arithmetic axioms and getting 7.

Undergrad: Uh, why?

Scientist: Seconded.

Bayesian: Because letting $e$ denote the evidence, $H$ denote the hypothesis, $\neg$ denote the negation of a proposition, $\mathbb P(X)$ denote the probability of proposition $X$, and $\mathbb P(X \mid Y)$ denote the conditional probability of $X$ assuming $Y$ to be true, it is a theorem of probability that $$\mathbb P(H) = \left(P(H \mid e) \cdot P(e)\right) + \left(P(H\mid \neg e) \cdot P(\neg e)\right).$$ Therefore likelihood functions can never be p-hacked by any possible clever setup without you outright lying, because you can’t have any possible procedure that a Bayesian knows in advance will make them update in a predictable net direction. For every update that we expect to be produced by a piece of evidence $e,$ there’s an equal and opposite update that we expect to probably occur from seeing $\neg e.$

Undergrad: What?

Scientist: Seconded.

Bayesian: Look… let me try to zoom out a bit, and yes, look at the ongoing replication crisis. The Scientist proclaimed suspicion of grand new sweeping ideals. Okay, but the shift to likelihood functions is the kind of thing that ought to be able to solve a lot of problems at once. Let’s say… I’m trying to think of a good analogy here. Let’s say there’s a corporation which is having a big crisis because their accountants are using floating-point numbers, only there’s three different parts of the firm using three different representations of floating-point numbers to do numerically unstable calculations. Somebody starts with 1.0 and adds 0.0001 a thousand times and then subtracts 0.1 and gets 0.999999999999989. Or you can go to the other side of the building and use a different floating-point represenatation and get a different result. And nobody has any conception that there’s anything wrong with this. Suppose there are BIG errors in the floating-point numbers, they’re using the floating-point-number equivalent of crude ideograms and Roman numerals, you can get big pragmatic differences depending on what representation you use. And naturally, people ‘division-hack’ to get whatever financial results they want. So all the spreadsheets are failing to replicate, and people are starting to worry the ‘cognitive priming’ subdivision has actually been bankrupt for 20 years. And then one day you come in and you say, “Hey. Everyone. Suppose that instead of these competing floating-point representations, we use my new representation instead. It can’t be fooled the same way, which will solve a surprising number of your problems.”

(The Bayesian now imitates the Scientist’s voice:) “I’m suspicious,” says the Senior Auditor. “I suspect you of idealism. In my experience, people need to use different floating-point representations for different financial problems, and it’s good to have a lot of different numerical representations of fractions in your toolbox.”

Bayesian: “Well,” I reply, “it may sound idealistic, but in point of fact, this thing I’m about to show you is the representation of fractions, in which you cannot get different results depending on which way you add things or what order you do the operations in. It might be slightly more computationally expensive, but it is now no longer 1920 like when you first adopted the old system, and seriously, you can afford the computing power in a very large fraction of cases where you’re only working with 30,000,000 bank accounts or some trivial number like that. Yes, if you want to do something like take square roots, it gets a bit more complicated, but very few of you are actually taking the square root of bank account balances. For the vast majority of things you are trying to do on a day-to-day basis, this system is unhackable without actually misreporting the numbers.” And then I show them how to represent arbitrary-magnitude finite integers precisely, and how to represent a rational number as the ratio of two integers. What we would, nowadays, consider to be a direct, precise, computational representation of the system of rational numbers. The one unique axiomatized mathematical system of rational numbers, to which floating-point numbers are a mere approximation. And if you’re just working with 30,000,000 bank account balances and your crude approximate floating-point numbers are in practice blowing up and failing to replicate and being exploited by people to get whatever results they want, and it is no longer 1920 and you can afford real computers now, it is an obvious step to have all the accountants switch to using the rational numbers. Just as Bayesian updates are the rational updates, in the unique mathematical axiomatized system of probabilities. And that’s why you can’t p-hack them.

Scientist: That is a rather… audacious claim. And I confess, even if everything you said about the math were true, I would still be skeptical of the pragmatics. The current system of scientific statistics is something that’s grown up over time and matured. Has this bright Bayesian way actually been tried?

Bayesian: It hasn’t been tried very much in science. In machine learning, where, uh, not to put too fine a point on it, we can actually see where the models are breaking because our AI doesn’t work, it’s been ten years since I’ve read a paper that tries to go at things from a frequentist angle and I can’t ever recall seeing an AI algorithm calculate the p-value of anything. If you’re doing anything principled at all from a probability-theoretic stance, it’s probably Bayesian, and pretty much never frequentist. If you’re classifying data using n-hot encodings, your loss function is the cross-entropy, not… I’m not even sure what the equivalent of trying to use 1920s-style p-values in AI would be like. I would frankly attribute this to people in machine learning having to use statistical tools that visibly succeed or fail; rather than needing to get published by going through a particular traditional ritual of p-value reporting, and failure to replicate not being all that bad for your career.

Scientist: So you’re actually more of a computer science guy than an experimentalist yourself. Why does this not surprise me? It’s not impossible that some better statistical system than p-values could exist, but I’d advise you to respect the wisdom of experience. The fact that we know what p-hacking is, and are currently fighting it, is because we’ve had time to see where the edges of the system have problems, and we’re figuring out how to fight those problems. This shiny new system will also have problems; you just have no idea what they’ll be. Perhaps they’ll be worse.

Bayesian: It’s not impossible that the accountants would figure out new shenanigans to pull with rational numbers, especially if they were doing some things computationally intensive enough that they could no longer afford to use the rational numbers and had to use some approximation instead. But I stand by my statement that if your financial spreadsheets are right now blowing up in a giant replication crisis in ways that seem clearly linked to using p-values, and the p-values are, frankly, bloody ad-hoc inconsistent nonsense, an obvious first step is to try using the rational updates instead. Although, it’s possible we don’t disagree too much in practice. I’d also pragmatically favor trying to roll things out one step at a time, like, maybe just switch over the psychological sciences and see how that goes.

Scientist: How would you persuade them to do that?

Bayesian: I have no goddamn idea. Honestly, I’m not expecting anyone to actually fix anything. People will just go on using p-values until the end of the world, probably. It’s just one more Nice Thing We Can’t Have. But there’s a chance the idea will catch on. I was pleasantly surprised when open access caught on as quickly as it did. I was pleasantly surprised when people, like, actually noticed the replication crisis and it became a big issue that people cared about. Maybe I’ll be pleasantly surprised again and people will actually take up the crusade to bury the p-value at a crossroads at midnight and put a stake through its heart. If so, I’ll have done my part by making an understanding of Bayes’ rule and likelihoods more accessible to everyone.

Scientist: Or it could turn out that people don’t like likelihoods, and that part of the wisdom of experience is the lesson that p-values are a kind of thing that experimentalists actually find useful and easy to use.

Bayesian: If the experience of learning traditional statistics traumatized them so heavily that the thought of needing to learn a new system sends them screaming into the night, then yes, change might need to be imposed from outside. I’m hoping though that the Undergrad will read a short, cheerful introduction to Bayesian probability, compare this with his ominous heavy traditional statistics textbook, and come back going “Please let me use likelihoods please let me use likelihoods oh god please let me use likelihoods.”

Undergrad: I’ll guess I’ll look into it and see?

Bayesian: Weigh your decision carefully, Undergrad. Some changes in science depend upon students growing up familiar with multiple ideas and choosing the right one. Max Planck said so in a famous aphorism, so it must be true. Ergo, the entire ability of science to distinguish good and bad ideas within that class must rest upon the cognitive capacities of undergrads.

Scientist: Oh, now that is just--

Moderator: And we’re out of time. Thanks for joining us, everyone!

Gregor Gerasev 14 Oct 2016 20:55 UTC
“That’s because we’re considering results like HHHTHH or TTTTTT to be equally or more extreme, and there are 14 total possible cases like that.”

It is not evident for me, because I am not familiar with statistics. I think where is a need to provide calculation, that there is two examples in form TTTTTT and HHHHHH and 6*2 like HTHHHH or HTTTTT.
Gregor Gerasev 14 Oct 2016 22:09 UTC
“We only ran the 2012 US Presidential Election one time, but that doesn’t mean that on November 7th you should’ve refused a $10 bet that paid out $1000 if Obama won.”

First of all, as non American I do not know what is specific about November 7th. Second, I think that some people may do not even know, that Obama has actually won.
Adnll L 15 Feb 2018 17:42 UTC
To be sure. Does this mean that the claim “We have observed 20 times against 1 that the coin is 55% biased” is only made 1.4% of the time?

If so, it seems like a lot…
Eyal Roth 27 Mar 2019 11:11 UTC
Not quite. The private state of mind of the researcher changes nothing. It’s only the issue of which question is asked.

In this case, the two questions are (a) what is the probability of such an event occurring after tossing a fair coin 6 times and (b) what is the probability of such an event occurring if a fair coin is tossed until it lands tails.

The meaning of the questions does not change, nor do the answers to them. It is only a matter of what question is being asked—which is obviously important when conducting a study, but is not so counter-intuitive (and much less confusing) when presented in such way (IMO).
Eyal Roth 27 Mar 2019 15:07 UTC
This “argument” by the “scientist” doesn’t IMO represent how a true experimentalist would approach the issue; they would not necessarily be so opposed to trying new ways of improving their methods, as long as it is done step by step without replacing the entire system over night (just like the “bayesian” explains in the next paragraph).

This is also a bit side-tracking as it opens up the topic of how much more “experience” computer scientists have given the simpler and much more reproducible systems they’re dealing with—especially in the modern commercial world—in contrast with natural sciences (I’m a programmer myself, so I’m a bit biased on this).
Eyal Roth 27 Mar 2019 15:17 UTC
I feel like this paragraph might be a little necessary for someone who haven’t read the bayes rule intro, but on the other hand is a bit off-topic in this context and quite distracting, as it raises questions which are not part of this “discussion”; mainly, questions regarding how to approach “one-off” events.

Say, what if I can’t quantify the outcome of my decision so nicely like in the case of a bet? What if I need to decide whether to send Miss Scarlet to prison or not based on these likelihood probabilities?
Eyal Roth 27 Mar 2019 15:33 UTC
I’m going to take the role of the “undergrad” here and try to interpret this in the following way:

Given that a hypothesis is true—but it is unknown to be true—it is far more likely to come by a “statistically significant” result indicating it is wrong, than it is likely to come by a result indicating that another hypothesis is significantly more likely.

In simpler words—it is far easier to “prove” a true hypothesis is wrong by accident, than it is to “prove” that an alternative hypothesis is superior (a better estimator of reality) by accident.

Would you consider this interpretation accurate?