Bayes' rule: Probability form

The for­mu­la­tion of Bayes’ rule you are most likely to see in text­books runs as fol­lows:

$$\mathbb P(H_i\mid e) = \dfrac{\mathbb P(e\mid H_i) \cdot \mathbb P(H_i)}{\sum_k \mathbb P(e\mid H_k) \cdot \mathbb P(H_k)}$$


  • \(H_i\) is the hy­poth­e­sis we’re in­ter­ested in.

  • \(e\) is the piece of ev­i­dence we ob­served.

  • \(\sum_k (\text {expression containing } k)\) means “Add up, for ev­ery \(k\), the sum of all the (ex­pres­sions con­tain­ing \(k\)).”

  • \(\mathbf H\) is a set of mu­tu­ally ex­clu­sive and ex­haus­tive hy­pothe­ses that in­clude \(H_i\) as one of the pos­si­bil­ities, and the ex­pres­sion \(H_k\) in­side the sum ranges over all the pos­si­ble hy­pothe­ses in \(\mathbf H\).

As a quick ex­am­ple, let’s say there’s a bath­tub full of po­ten­tially bi­ased coins.

  • Coin type 1 is fair, 50% heads /​ 50% tails. 40% of the coins in the bath­tub are type 1.

  • Coin type 2 pro­duces 70% heads. 35% of the coins are type 2.

  • Coin type 3 pro­duces 20% heads. 25% of the coins are type 3.

We want to know the pos­te­rior prob­a­bil­ity that a ran­domly drawn coin is of type 2, af­ter flip­ping the coin once and see­ing it pro­duce heads once.

Let \(H_1, H_2, H_3\) stand for the hy­pothe­ses that the coin is of types 1, 2, and 3 re­spec­tively. Then us­ing con­di­tional prob­a­bil­ity no­ta­tion, we want to know the prob­a­bil­ity \(\mathbb P(H_2 \mid heads).\)

The prob­a­bil­ity form of Bayes’ the­o­rem says:

$$\mathbb P(H_2 \mid heads) = \frac{\mathbb P(heads \mid H_2) \cdot \mathbb P(H_2)}{\sum_k \mathbb P(heads \mid H_k) \cdot \mathbb P(H_k)}$$

Ex­pand­ing the sum:

$$\mathbb P(H_2 \mid heads) = \frac{\mathbb P(heads \mid H_2) \cdot \mathbb P(H_2)}{[\mathbb P(heads \mid H_1) \cdot \mathbb P(H_1)] + [\mathbb P(heads \mid H_2) \cdot \mathbb P(H_2)] + [\mathbb P(heads \mid H_3) \cdot \mathbb P(H_3)]}$$

Com­put­ing the ac­tual quan­tities:

$$\mathbb P(H_2 \mid heads) = \frac{0.70 \cdot 0.35 }{[0.50 \cdot 0.40] + [0.70 \cdot 0.35] + [0.20 \cdot 0.25]} = \frac{0.245}{0.20 + 0.245 + 0.05} = 0.\overline{49}$$

This calcu­la­tion was big and messy. Which is fine, be­cause the prob­a­bil­ity form of Bayes’ the­o­rem is okay for di­rectly grind­ing through the num­bers, but not so good for do­ing things in your head.


We can think of the ad­vice of Bayes’ the­o­rem as say­ing:

“Think of how much each hy­poth­e­sis in \(H\) con­tributed to our ex­pec­ta­tion of see­ing the ev­i­dence \(e\), in­clud­ing both the like­li­hood of see­ing \(e\) if \(H_k\) is true, and the prior prob­a­bil­ity of \(H_k\). The pos­te­rior of \(H_i\) af­ter see­ing \(e,\) is the amount \(H_i\) con­tributed to our ex­pec­ta­tion of see­ing \(e,\) within the to­tal ex­pec­ta­tion of see­ing \(e\) con­tributed by ev­ery hy­poth­e­sis in \(H.\)

Or to say it at some­what greater length:

Imag­ine each hy­poth­e­sis \(H_1,H_2,H_3\ldots\) as an ex­pert who has to dis­tribute the prob­a­bil­ity of their pre­dic­tions among all pos­si­ble pieces of ev­i­dence. We can imag­ine this more con­cretely by vi­su­al­iz­ing “prob­a­bil­ity” as a lump of clay.

The to­tal amount of clay is one kilo­gram (prob­a­bil­ity \(1\)). Each ex­pert \(H_k\) has been al­lo­cated a frac­tion \(\mathbb P(H_k)\) of that kilo­gram. For ex­am­ple, if \(\mathbb P(H_4)=\frac{1}{5}\) then ex­pert 4 has been al­lo­cated 200 grams of clay.

We’re play­ing a game with the ex­perts to de­ter­mine which one is the best pre­dic­tor.

Each time we’re about to make an ob­ser­va­tion \(E,\) each ex­pert has to di­vide up all their clay among the pos­si­ble out­comes \(e_1, e_2, \ldots.\)

After we ob­serve that \(E = e_j,\) we take away all the clay that wasn’t put onto \(e_j.\) And then our new be­lief in all the ex­perts is the rel­a­tive amount of clay that each ex­pert has left.

So to know how much we now be­lieve in ex­pert \(H_4\) af­ter ob­serv­ing \(e_3,\) say, we need to know two things: First, the amount of clay that \(H_4\) put onto \(e_3,\) and sec­ond, the to­tal amount of clay that all ex­perts (in­clud­ing \(H_4\)) put onto \(e_3.\)

In turn, to know that, we need to know how much clay \(H_4\) started with, and what frac­tion of its clay \(H_4\) put onto \(e_3.\) And similarly, to com­pute the to­tal clay on \(e_3,\) we need to know how much clay each ex­pert \(H_k\) started with, and what frac­tion of their clay \(H_k\) put onto \(e_3.\)

So Bayes’ the­o­rem here would say:

$$\mathbb P(H_4 \mid e_3) = \frac{\mathbb P(e_3 \mid H_4) \cdot \mathbb P(H_4)}{\sum_k \mathbb P(e_3 \mid H_k) \cdot \mathbb P(H_k)}$$

What are the in­cen­tives of this game of clay?

On each round, the ex­perts who gain the most are the ex­perts who put the most clay on the ob­served \(e_j,\) so if you know for cer­tain that \(e_3\) is about to be ob­served, your in­cen­tive is to put all your clay on \(e_3.\)

But putting liter­ally all your clay on \(e_3\) is risky; if \(e_5\) is ob­served in­stead, you lose all your clay and are out of the game. Once an ex­pert’s amount of clay goes all the way to zero, there’s no way for them to re­cover over any num­ber of fu­ture rounds. That hy­poth­e­sis is done, dead, and re­moved from the game. (“Falsifi­ca­tion,” some peo­ple call that.) If you’re not cer­tain that \(e_5\) is liter­ally im­pos­si­ble, you’d be wiser to put at least a lit­tle clay on \(e_5\) in­stead. That is to say: if your mind puts some prob­a­bil­ity on \(e_5,\) you’d bet­ter put some clay there too!

(As it hap­pens, if at the end of the game we score each ex­pert by the log­a­r­ithm of the amount of clay they have left, then each ex­pert is in­cen­tivized to place clay ex­actly pro­por­tion­ally to their hon­est prob­a­bil­ity on each suc­ces­sive round.)

It’s an im­por­tant part of the game that we make the ex­perts put down their clay in ad­vance. If we let the ex­perts put down their clay af­ter­wards, they might be tempted to cheat by putting down all their clay on whichever \(e_j\) had ac­tu­ally been ob­served. But since we make the ex­perts put down their clay in ad­vance, they have to di­vide up their clay among the pos­si­ble out­comes: to give more clay to \(e_3,\) that clay has to be taken away from some other out­come, like \(e_5.\) To put a very high prob­a­bil­ity on \(e_3\) and gain a lot of rel­a­tive cred­i­bil­ity if \(e_3\) is ob­served, an ex­pert has to stick their neck out and risk los­ing a lot of cred­i­bil­ity if some other out­come like \(e_5\) hap­pens in­stead. If we force the ex­perts to make ad­vance pre­dic­tions, that is!

We can also de­rive from this game that the ques­tion “does ev­i­dence \(e_3\) sup­port hy­poth­e­sis \(H_4\)?” de­pends on how well \(H_4\) pre­dicted \(e_3\) com­pared to the com­pe­ti­tion. It’s not enough for \(H_4\) to pre­dict \(e_3\) well if ev­ery other hy­poth­e­sis also pre­dicted \(e_3\) well—your amaz­ing new the­ory of physics gets no points for pre­dict­ing that the sky is blue. \(H_k\) only goes up in prob­a­bil­ity when it pre­dicts \(e_j\) bet­ter than the al­ter­na­tives. And that means we have to ask what the al­ter­na­tive hy­pothe­ses pre­dicted, even if we think those hy­pothe­ses are false.

If you get in a car ac­ci­dent, and don’t want to re­lin­quish the hy­poth­e­sis that you’re a great driver, then you can find all sorts of rea­sons (“the road was slip­pery! my car freaked out!”) why \(\mathbb P(e \mid GoodDriver)\) is not too low. But \(\mathbb P(e \mid BadDriver)\) is also part of the up­date equa­tion, and the “bad driver” hy­poth­e­sis bet­ter pre­dicts the ev­i­dence. Thus, your first im­pulse, when de­cid­ing how to up­date your be­liefs in the face of a car ac­ci­dent, should not be “But my preferred hy­poth­e­sis al­lows for this ev­i­dence!” It should in­stead be “Points to the ‘bad driver’ hy­poth­e­sis for pre­dict­ing this ev­i­dence bet­ter than the al­ter­na­tives!” (And re­mem­ber, you’re al­lowed to in­crease \(\mathbb P(BadDriver)\) a lit­tle bit, while still think­ing that it’s less than 50% prob­a­ble.)


The proof of Bayes’ the­o­rem fol­lows from the defi­ni­tion of con­di­tional prob­a­bil­ity:

$$\mathbb P(X \mid Y) = \frac{\mathbb P(X \wedge Y)}{\mathbb P (Y)}$$

And from the law of marginal prob­a­bil­ity:

$$\mathbb P(Y) = \sum_k \mathbb P(Y \wedge X_k)$$


$$ \mathbb P(H_i \mid e) = \frac{\mathbb P(H_i \wedge e)}{\mathbb P (e)} \tag{defn. conditional prob.} $$

$$ \mathbb P(H_i \mid e) = \frac{\mathbb P(e \wedge H_i)}{\sum_k \mathbb P (e \wedge H_k)} \tag {law of marginal prob.} $$

$$ \mathbb P(H_i \mid e) = \frac{\mathbb P(e \mid H_i) \cdot \mathbb P(H_i)}{\sum_k \mathbb P (e \mid H_k) \cdot \mathbb P(H_k)} \tag {defn. conditional prob.} $$




  • Bayes' rule

    Bayes’ rule is the core the­o­rem of prob­a­bil­ity the­ory say­ing how to re­vise our be­liefs when we make a new ob­ser­va­tion.