# Bayes' rule: Probability form

The for­mu­la­tion of Bayes’ rule you are most likely to see in text­books runs as fol­lows:

$$\mathbb P(H_i\mid e) = \dfrac{\mathbb P(e\mid H_i) \cdot \mathbb P(H_i)}{\sum_k \mathbb P(e\mid H_k) \cdot \mathbb P(H_k)}$$

Where:

• $$H_i$$ is the hy­poth­e­sis we’re in­ter­ested in.

• $$e$$ is the piece of ev­i­dence we ob­served.

• $$\sum_k (\text {expression containing } k)$$ means “Add up, for ev­ery $$k$$, the sum of all the (ex­pres­sions con­tain­ing $$k$$).”

• $$\mathbf H$$ is a set of mu­tu­ally ex­clu­sive and ex­haus­tive hy­pothe­ses that in­clude $$H_i$$ as one of the pos­si­bil­ities, and the ex­pres­sion $$H_k$$ in­side the sum ranges over all the pos­si­ble hy­pothe­ses in $$\mathbf H$$.

As a quick ex­am­ple, let’s say there’s a bath­tub full of po­ten­tially bi­ased coins.

• Coin type 1 is fair, 50% heads /​ 50% tails. 40% of the coins in the bath­tub are type 1.

• Coin type 2 pro­duces 70% heads. 35% of the coins are type 2.

• Coin type 3 pro­duces 20% heads. 25% of the coins are type 3.

We want to know the pos­te­rior prob­a­bil­ity that a ran­domly drawn coin is of type 2, af­ter flip­ping the coin once and see­ing it pro­duce heads once.

Let $$H_1, H_2, H_3$$ stand for the hy­pothe­ses that the coin is of types 1, 2, and 3 re­spec­tively. Then us­ing con­di­tional prob­a­bil­ity no­ta­tion, we want to know the prob­a­bil­ity $$\mathbb P(H_2 \mid heads).$$

The prob­a­bil­ity form of Bayes’ the­o­rem says:

$$\mathbb P(H_2 \mid heads) = \frac{\mathbb P(heads \mid H_2) \cdot \mathbb P(H_2)}{\sum_k \mathbb P(heads \mid H_k) \cdot \mathbb P(H_k)}$$

Ex­pand­ing the sum:

$$\mathbb P(H_2 \mid heads) = \frac{\mathbb P(heads \mid H_2) \cdot \mathbb P(H_2)}{[\mathbb P(heads \mid H_1) \cdot \mathbb P(H_1)] + [\mathbb P(heads \mid H_2) \cdot \mathbb P(H_2)] + [\mathbb P(heads \mid H_3) \cdot \mathbb P(H_3)]}$$

Com­put­ing the ac­tual quan­tities:

$$\mathbb P(H_2 \mid heads) = \frac{0.70 \cdot 0.35 }{[0.50 \cdot 0.40] + [0.70 \cdot 0.35] + [0.20 \cdot 0.25]} = \frac{0.245}{0.20 + 0.245 + 0.05} = 0.\overline{49}$$

This calcu­la­tion was big and messy. Which is fine, be­cause the prob­a­bil­ity form of Bayes’ the­o­rem is okay for di­rectly grind­ing through the num­bers, but not so good for do­ing things in your head.

# Meaning

We can think of the ad­vice of Bayes’ the­o­rem as say­ing:

“Think of how much each hy­poth­e­sis in $$H$$ con­tributed to our ex­pec­ta­tion of see­ing the ev­i­dence $$e$$, in­clud­ing both the like­li­hood of see­ing $$e$$ if $$H_k$$ is true, and the prior prob­a­bil­ity of $$H_k$$. The pos­te­rior of $$H_i$$ af­ter see­ing $$e,$$ is the amount $$H_i$$ con­tributed to our ex­pec­ta­tion of see­ing $$e,$$ within the to­tal ex­pec­ta­tion of see­ing $$e$$ con­tributed by ev­ery hy­poth­e­sis in $$H.$$

Or to say it at some­what greater length:

Imag­ine each hy­poth­e­sis $$H_1,H_2,H_3\ldots$$ as an ex­pert who has to dis­tribute the prob­a­bil­ity of their pre­dic­tions among all pos­si­ble pieces of ev­i­dence. We can imag­ine this more con­cretely by vi­su­al­iz­ing “prob­a­bil­ity” as a lump of clay.

The to­tal amount of clay is one kilo­gram (prob­a­bil­ity $$1$$). Each ex­pert $$H_k$$ has been al­lo­cated a frac­tion $$\mathbb P(H_k)$$ of that kilo­gram. For ex­am­ple, if $$\mathbb P(H_4)=\frac{1}{5}$$ then ex­pert 4 has been al­lo­cated 200 grams of clay.

We’re play­ing a game with the ex­perts to de­ter­mine which one is the best pre­dic­tor.

Each time we’re about to make an ob­ser­va­tion $$E,$$ each ex­pert has to di­vide up all their clay among the pos­si­ble out­comes $$e_1, e_2, \ldots.$$

After we ob­serve that $$E = e_j,$$ we take away all the clay that wasn’t put onto $$e_j.$$ And then our new be­lief in all the ex­perts is the rel­a­tive amount of clay that each ex­pert has left.

So to know how much we now be­lieve in ex­pert $$H_4$$ af­ter ob­serv­ing $$e_3,$$ say, we need to know two things: First, the amount of clay that $$H_4$$ put onto $$e_3,$$ and sec­ond, the to­tal amount of clay that all ex­perts (in­clud­ing $$H_4$$) put onto $$e_3.$$

In turn, to know that, we need to know how much clay $$H_4$$ started with, and what frac­tion of its clay $$H_4$$ put onto $$e_3.$$ And similarly, to com­pute the to­tal clay on $$e_3,$$ we need to know how much clay each ex­pert $$H_k$$ started with, and what frac­tion of their clay $$H_k$$ put onto $$e_3.$$

So Bayes’ the­o­rem here would say:

$$\mathbb P(H_4 \mid e_3) = \frac{\mathbb P(e_3 \mid H_4) \cdot \mathbb P(H_4)}{\sum_k \mathbb P(e_3 \mid H_k) \cdot \mathbb P(H_k)}$$

What are the in­cen­tives of this game of clay?

On each round, the ex­perts who gain the most are the ex­perts who put the most clay on the ob­served $$e_j,$$ so if you know for cer­tain that $$e_3$$ is about to be ob­served, your in­cen­tive is to put all your clay on $$e_3.$$

But putting liter­ally all your clay on $$e_3$$ is risky; if $$e_5$$ is ob­served in­stead, you lose all your clay and are out of the game. Once an ex­pert’s amount of clay goes all the way to zero, there’s no way for them to re­cover over any num­ber of fu­ture rounds. That hy­poth­e­sis is done, dead, and re­moved from the game. (“Falsifi­ca­tion,” some peo­ple call that.) If you’re not cer­tain that $$e_5$$ is liter­ally im­pos­si­ble, you’d be wiser to put at least a lit­tle clay on $$e_5$$ in­stead. That is to say: if your mind puts some prob­a­bil­ity on $$e_5,$$ you’d bet­ter put some clay there too!

(As it hap­pens, if at the end of the game we score each ex­pert by the log­a­r­ithm of the amount of clay they have left, then each ex­pert is in­cen­tivized to place clay ex­actly pro­por­tion­ally to their hon­est prob­a­bil­ity on each suc­ces­sive round.)

It’s an im­por­tant part of the game that we make the ex­perts put down their clay in ad­vance. If we let the ex­perts put down their clay af­ter­wards, they might be tempted to cheat by putting down all their clay on whichever $$e_j$$ had ac­tu­ally been ob­served. But since we make the ex­perts put down their clay in ad­vance, they have to di­vide up their clay among the pos­si­ble out­comes: to give more clay to $$e_3,$$ that clay has to be taken away from some other out­come, like $$e_5.$$ To put a very high prob­a­bil­ity on $$e_3$$ and gain a lot of rel­a­tive cred­i­bil­ity if $$e_3$$ is ob­served, an ex­pert has to stick their neck out and risk los­ing a lot of cred­i­bil­ity if some other out­come like $$e_5$$ hap­pens in­stead. If we force the ex­perts to make ad­vance pre­dic­tions, that is!

We can also de­rive from this game that the ques­tion “does ev­i­dence $$e_3$$ sup­port hy­poth­e­sis $$H_4$$?” de­pends on how well $$H_4$$ pre­dicted $$e_3$$ com­pared to the com­pe­ti­tion. It’s not enough for $$H_4$$ to pre­dict $$e_3$$ well if ev­ery other hy­poth­e­sis also pre­dicted $$e_3$$ well—your amaz­ing new the­ory of physics gets no points for pre­dict­ing that the sky is blue. $$H_k$$ only goes up in prob­a­bil­ity when it pre­dicts $$e_j$$ bet­ter than the al­ter­na­tives. And that means we have to ask what the al­ter­na­tive hy­pothe­ses pre­dicted, even if we think those hy­pothe­ses are false.

If you get in a car ac­ci­dent, and don’t want to re­lin­quish the hy­poth­e­sis that you’re a great driver, then you can find all sorts of rea­sons (“the road was slip­pery! my car freaked out!”) why $$\mathbb P(e \mid GoodDriver)$$ is not too low. But $$\mathbb P(e \mid BadDriver)$$ is also part of the up­date equa­tion, and the “bad driver” hy­poth­e­sis bet­ter pre­dicts the ev­i­dence. Thus, your first im­pulse, when de­cid­ing how to up­date your be­liefs in the face of a car ac­ci­dent, should not be “But my preferred hy­poth­e­sis al­lows for this ev­i­dence!” It should in­stead be “Points to the ‘bad driver’ hy­poth­e­sis for pre­dict­ing this ev­i­dence bet­ter than the al­ter­na­tives!” (And re­mem­ber, you’re al­lowed to in­crease $$\mathbb P(BadDriver)$$ a lit­tle bit, while still think­ing that it’s less than 50% prob­a­ble.)

# Proof

The proof of Bayes’ the­o­rem fol­lows from the defi­ni­tion of con­di­tional prob­a­bil­ity:

$$\mathbb P(X \mid Y) = \frac{\mathbb P(X \wedge Y)}{\mathbb P (Y)}$$

And from the law of marginal prob­a­bil­ity:

$$\mathbb P(Y) = \sum_k \mathbb P(Y \wedge X_k)$$

There­fore:

$$\mathbb P(H_i \mid e) = \frac{\mathbb P(H_i \wedge e)}{\mathbb P (e)} \tag{defn. conditional prob.}$$

$$\mathbb P(H_i \mid e) = \frac{\mathbb P(e \wedge H_i)}{\sum_k \mathbb P (e \wedge H_k)} \tag {law of marginal prob.}$$

$$\mathbb P(H_i \mid e) = \frac{\mathbb P(e \mid H_i) \cdot \mathbb P(H_i)}{\sum_k \mathbb P (e \mid H_k) \cdot \mathbb P(H_k)} \tag {defn. conditional prob.}$$

QED.

Children:

Parents:

• Bayes' rule

Bayes’ rule is the core the­o­rem of prob­a­bil­ity the­ory say­ing how to re­vise our be­liefs when we make a new ob­ser­va­tion.

• This ex­perts-with-clay anal­ogy I found EXTREMELY helpful. I ap­pre­ci­ate differ­ent ex­pla­na­tions work for differ­ent peo­ple, but I re­ally do think this could have come a LOT ear­lier in the es­say.

• “you’re al­lowed to in­crease P(BadDriver) a lit­tle bit,”

No, you’re re­ally not.