Bayes' rule: Functional form

Bayes’ rule gen­er­al­izes to con­tin­u­ous func­tions, and states, “The pos­te­rior prob­a­bil­ity den­sity is pro­por­tional to the like­li­hood func­tion times the prior prob­a­bil­ity den­sity.”

$$\mathbb P(H_x\mid e) \propto \mathcal L_e(H_x) \cdot \mathbb P(H_x)$$


Sup­pose we have a bi­ased coin with an un­known bias \(b\) be­tween 0 and 1 of com­ing up heads on each in­di­vi­d­ual coin­flip. Since the bias \(b\) is a con­tin­u­ous vari­able, we ex­press our be­liefs about the coin’s bias us­ing a prob­a­bil­ity den­sity func­tion \(\mathbb P(b),\) where \(\mathbb P(b)\cdot \mathrm{d}b\) is the prob­a­bil­ity that \(b\) is in the in­ter­val \([b + \mathrm{d}b]\) for \(\mathrm db\) small. (Speci­fi­cally, the prob­a­bil­ity that \(b\) is in the in­ter­val \([a, b]\) is \(\int_a^b \mathbb P(b) \, \mathrm db.\))

By hy­poth­e­sis, we start out com­pletely ig­no­rant of the bias \(b,\) mean­ing that all ini­tial val­ues for \(b\) are equally likely. Thus, \(\mathbb P(b) = 1\) for all val­ues of \(b,\) which means that \(\mathbb P(b)\, \mathrm db = \mathrm db\) (e.g., the chance of \(b\) be­ing found in the in­ter­val from 0.72 to 0.76 is 0.04).

plot y = 1 + x * 0, x = 0 to 1

We then flip the coin, and ob­serve it to come up tails. This is our first piece of ev­i­dence. The like­li­hood \(\mathcal L_{t_1}(b)\) of ob­ser­va­tion \(t_1\) given bias \(b\) is a con­tin­u­ous func­tion of \(b\), equal to 0.4 if \(b = 0.6,\) 0.67 if \(b = 0.33,\) and so on (be­cause \(b\) is the prob­a­bil­ity of heads and the ob­ser­va­tion was tails).

Graph­ing the like­li­hood func­tion \(\mathcal L_{t_1}(b)\) as it takes in the fixed ev­i­dence \(t_1\) and ranges over vari­able \(b,\) we ob­tain the straight­for­ward graph \(\mathcal L_{t_1}(b) = 1 - b.\)

plot y = 1 - x, x = 0 to 1

If we mul­ti­ply the like­li­hood func­tion by the prior prob­a­bil­ity func­tion as it ranges over \(b\), we ob­tain a rel­a­tive prob­a­bil­ity func­tion on the pos­te­rior, \(\mathbb O(b\mid t_1) = \mathcal L_{t_1}(b) \cdot \mathbb P(b) = 1 - b,\) which gives us the same graph again:

plot y = 1 - x, x = 0 to 1

But this can’t be our pos­te­rior prob­a­bil­ity func­tion be­cause it doesn’t in­te­grate to 1. \(\int_0^1 (1 - b) \, \mathrm db = \frac{1}{2}.\) (The area un­der a tri­an­gle is half the base times the height.) Nor­mal­iz­ing this rel­a­tive prob­a­bil­ity func­tion will give us the pos­te­rior prob­a­bil­ity func­tion:

\(\mathbb P(b \mid t_1) = \dfrac{\mathbb O(b \mid t_1)}{\int_0^1 \mathbb O(b \mid t_1) \, \mathrm db} = 2 \cdot (1 - f)\)

plot y = 2(1 - x), x = 0 to 1

The shapes are the same, and only the y-axis la­bels have changed to re­flect the differ­ent heights of the pre-nor­mal­ized and nor­mal­ized func­tion.Re­graph these graphs with ac­tual height changes

Sup­pose we now flip the coin an­other two times, and it comes up heads then tails. We’ll de­note this piece of ev­i­dence \(h_2t_3.\) Although these two coin tosses pull our be­liefs about \(b\) in op­po­site di­rec­tions, they don’t can­cel out — far from it! In fact, one value of \(b\) (“the coin is always tails”) is com­pletely elimi­nated by this ev­i­dence, and many ex­treme val­ues of \(b\) (“al­most always heads” and “al­most always tails”) are hit badly. That is, while the heads and the coins tails pull our be­liefs in op­po­site di­rec­tions, they don’t pull with the same strength on all pos­si­ble val­ues of \(b.\)

We mul­ti­ply the old belief

plot y = 2(1 - x), x = 0 to 1

by the ad­di­tional pieces of evidence


and ob­tain the pos­te­rior rel­a­tive density

plot y = 2(1 - x)x(1 - x), x = 0 to 1

which is pro­por­tional to the nor­mal­ized pos­te­rior probability

plot y = 12(1 - x)x(1 - x), x = 0 to 1

Writ­ing out the whole op­er­a­tion from scratch:

$$\mathbb P(b \mid t_1h_2t_3) = \frac{\mathcal L_{t_1h_2t_3}(b) \cdot \mathbb P(b)}{\mathbb P(t_1h_2t_3)} = \frac{(1 - b) \cdot b \cdot (1 - b) \cdot 1}{\int_0^1 (1 - b) \cdot b \cdot (1 - b) \cdot 1 \, \mathrm{d}b} = {12\cdot b(1 - b)^2}$$

Note that it’s okay for a pos­te­rior prob­a­bil­ity den­sity to be greater than 1, so long as the to­tal prob­a­bil­ity mass isn’t greater than 1. If there’s prob­a­bil­ity den­sity 1.2 over an in­ter­val of 0.1, that’s only a prob­a­bil­ity of 0.12 for the true value to be found in that in­ter­val.

Thus, in­tu­itively, Bayes’ rule “just works” when calcu­lat­ing the pos­te­rior prob­a­bil­ity den­sity from the prior prob­a­bil­ity den­sity func­tion and the (con­tin­u­ous) like­li­hood ra­tio func­tion. A proof is be­yond the scope of this guide; re­fer to Proof of Bayes’ rule in the con­tin­u­ous case.


  • Bayes' rule

    Bayes’ rule is the core the­o­rem of prob­a­bil­ity the­ory say­ing how to re­vise our be­liefs when we make a new ob­ser­va­tion.