Bayes' rule: Probability form
The formulation of Bayes’ rule you are most likely to see in textbooks runs as follows:
Where:
Hi is the hypothesis we’re interested in.
e is the piece of evidence we observed.
∑k(expression containing k) means “Add up, for every k, the sum of all the (expressions containing k).”
H is a set of mutually exclusive and exhaustive hypotheses that include Hi as one of the possibilities, and the expression Hk inside the sum ranges over all the possible hypotheses in H.
As a quick example, let’s say there’s a bathtub full of potentially biased coins.
Coin type 1 is fair, 50% heads / 50% tails. 40% of the coins in the bathtub are type 1.
Coin type 2 produces 70% heads. 35% of the coins are type 2.
Coin type 3 produces 20% heads. 25% of the coins are type 3.
We want to know the posterior probability that a randomly drawn coin is of type 2, after flipping the coin once and seeing it produce heads once.
Let H1,H2,H3 stand for the hypotheses that the coin is of types 1, 2, and 3 respectively. Then using conditional probability notation, we want to know the probability P(H2∣heads).
The probability form of Bayes’ theorem says:
Expanding the sum:
Computing the actual quantities:
This calculation was big and messy. Which is fine, because the probability form of Bayes’ theorem is okay for directly grinding through the numbers, but not so good for doing things in your head.
Meaning
We can think of the advice of Bayes’ theorem as saying:
“Think of how much each hypothesis in H contributed to our expectation of seeing the evidence e, including both the likelihood of seeing e if Hk is true, and the prior probability of Hk. The posterior of Hi after seeing e, is the amount Hi contributed to our expectation of seeing e, within the total expectation of seeing e contributed by every hypothesis in H.”
Or to say it at somewhat greater length:
Imagine each hypothesis H1,H2,H3… as an expert who has to distribute the probability of their predictions among all possible pieces of evidence. We can imagine this more concretely by visualizing “probability” as a lump of clay.
The total amount of clay is one kilogram (probability 1). Each expert Hk has been allocated a fraction P(Hk) of that kilogram. For example, if P(H4)=15 then expert 4 has been allocated 200 grams of clay.
We’re playing a game with the experts to determine which one is the best predictor.
Each time we’re about to make an observation E, each expert has to divide up all their clay among the possible outcomes e1,e2,….
After we observe that E=ej, we take away all the clay that wasn’t put onto ej. And then our new belief in all the experts is the relative amount of clay that each expert has left.
So to know how much we now believe in expert H4 after observing e3, say, we need to know two things: First, the amount of clay that H4 put onto e3, and second, the total amount of clay that all experts (including H4) put onto e3.
In turn, to know that, we need to know how much clay H4 started with, and what fraction of its clay H4 put onto e3. And similarly, to compute the total clay on e3, we need to know how much clay each expert Hk started with, and what fraction of their clay Hk put onto e3.
So Bayes’ theorem here would say:
What are the incentives of this game of clay?
On each round, the experts who gain the most are the experts who put the most clay on the observed ej, so if you know for certain that e3 is about to be observed, your incentive is to put all your clay on e3.
But putting literally all your clay on e3 is risky; if e5 is observed instead, you lose all your clay and are out of the game. Once an expert’s amount of clay goes all the way to zero, there’s no way for them to recover over any number of future rounds. That hypothesis is done, dead, and removed from the game. (“Falsification,” some people call that.) If you’re not certain that e5 is literally impossible, you’d be wiser to put at least a little clay on e5 instead. That is to say: if your mind puts some probability on e5, you’d better put some clay there too!
(As it happens, if at the end of the game we score each expert by the logarithm of the amount of clay they have left, then each expert is incentivized to place clay exactly proportionally to their honest probability on each successive round.)
It’s an important part of the game that we make the experts put down their clay in advance. If we let the experts put down their clay afterwards, they might be tempted to cheat by putting down all their clay on whichever ej had actually been observed. But since we make the experts put down their clay in advance, they have to divide up their clay among the possible outcomes: to give more clay to e3, that clay has to be taken away from some other outcome, like e5. To put a very high probability on e3 and gain a lot of relative credibility if e3 is observed, an expert has to stick their neck out and risk losing a lot of credibility if some other outcome like e5 happens instead. If we force the experts to make advance predictions, that is!
We can also derive from this game that the question “does evidence e3 support hypothesis H4?” depends on how well H4 predicted e3 compared to the competition. It’s not enough for H4 to predict e3 well if every other hypothesis also predicted e3 well—your amazing new theory of physics gets no points for predicting that the sky is blue. Hk only goes up in probability when it predicts ej better than the alternatives. And that means we have to ask what the alternative hypotheses predicted, even if we think those hypotheses are false.
If you get in a car accident, and don’t want to relinquish the hypothesis that you’re a great driver, then you can find all sorts of reasons (“the road was slippery! my car freaked out!”) why P(e∣GoodDriver) is not too low. But P(e∣BadDriver) is also part of the update equation, and the “bad driver” hypothesis better predicts the evidence. Thus, your first impulse, when deciding how to update your beliefs in the face of a car accident, should not be “But my preferred hypothesis allows for this evidence!” It should instead be “Points to the ‘bad driver’ hypothesis for predicting this evidence better than the alternatives!” (And remember, you’re allowed to increase P(BadDriver) a little bit, while still thinking that it’s less than 50% probable.)
Proof
The proof of Bayes’ theorem follows from the definition of conditional probability:
And from the law of marginal probability:
Therefore:
QED.
Children:
Parents:
- Bayes' rule
Bayes’ rule is the core theorem of probability theory saying how to revise our beliefs when we make a new observation.
Is it possible to have a visualisation of the clay allocation scenario?
This experts-with-clay analogy I found EXTREMELY helpful. I appreciate different explanations work for different people, but I really do think this could have come a LOT earlier in the essay.
No, you’re really not.
You’re only allowed to replace P(BadDriver) with P(BadDriver|HadOneAccident).
If you have a second accident, you replace that in turn with P(BadDriver|HadOneAccident^HadASecondAccident), which if you are rational you might reexamine and update to P(BadDriver|HadTwoAccidents^HadQuiteALotOfNearMissesIfWeAreBeingHonest)
But my point is, when applying each new piece of evidence, you have to remember the conditions that caused you to get your current probability, or you end up with naive Bayes and after seeing a few new bookcases you believe in aliens.