# Bayes' rule: Functional form

Bayes’ rule gen­er­al­izes to con­tin­u­ous func­tions, and states, “The pos­te­rior prob­a­bil­ity den­sity is pro­por­tional to the like­li­hood func­tion times the prior prob­a­bil­ity den­sity.”

$$\mathbb P(H_x\mid e) \propto \mathcal L_e(H_x) \cdot \mathbb P(H_x)$$

## Example

Sup­pose we have a bi­ased coin with an un­known bias $$b$$ be­tween 0 and 1 of com­ing up heads on each in­di­vi­d­ual coin­flip. Since the bias $$b$$ is a con­tin­u­ous vari­able, we ex­press our be­liefs about the coin’s bias us­ing a prob­a­bil­ity den­sity func­tion $$\mathbb P(b),$$ where $$\mathbb P(b)\cdot \mathrm{d}b$$ is the prob­a­bil­ity that $$b$$ is in the in­ter­val $$[b + \mathrm{d}b]$$ for $$\mathrm db$$ small. (Speci­fi­cally, the prob­a­bil­ity that $$b$$ is in the in­ter­val $$[a, b]$$ is $$\int_a^b \mathbb P(b) \, \mathrm db.$$)

By hy­poth­e­sis, we start out com­pletely ig­no­rant of the bias $$b,$$ mean­ing that all ini­tial val­ues for $$b$$ are equally likely. Thus, $$\mathbb P(b) = 1$$ for all val­ues of $$b,$$ which means that $$\mathbb P(b)\, \mathrm db = \mathrm db$$ (e.g., the chance of $$b$$ be­ing found in the in­ter­val from 0.72 to 0.76 is 0.04).

We then flip the coin, and ob­serve it to come up tails. This is our first piece of ev­i­dence. The like­li­hood $$\mathcal L_{t_1}(b)$$ of ob­ser­va­tion $$t_1$$ given bias $$b$$ is a con­tin­u­ous func­tion of $$b$$, equal to 0.4 if $$b = 0.6,$$ 0.67 if $$b = 0.33,$$ and so on (be­cause $$b$$ is the prob­a­bil­ity of heads and the ob­ser­va­tion was tails).

Graph­ing the like­li­hood func­tion $$\mathcal L_{t_1}(b)$$ as it takes in the fixed ev­i­dence $$t_1$$ and ranges over vari­able $$b,$$ we ob­tain the straight­for­ward graph $$\mathcal L_{t_1}(b) = 1 - b.$$

If we mul­ti­ply the like­li­hood func­tion by the prior prob­a­bil­ity func­tion as it ranges over $$b$$, we ob­tain a rel­a­tive prob­a­bil­ity func­tion on the pos­te­rior, $$\mathbb O(b\mid t_1) = \mathcal L_{t_1}(b) \cdot \mathbb P(b) = 1 - b,$$ which gives us the same graph again:

But this can’t be our pos­te­rior prob­a­bil­ity func­tion be­cause it doesn’t in­te­grate to 1. $$\int_0^1 (1 - b) \, \mathrm db = \frac{1}{2}.$$ (The area un­der a tri­an­gle is half the base times the height.) Nor­mal­iz­ing this rel­a­tive prob­a­bil­ity func­tion will give us the pos­te­rior prob­a­bil­ity func­tion:

$$\mathbb P(b \mid t_1) = \dfrac{\mathbb O(b \mid t_1)}{\int_0^1 \mathbb O(b \mid t_1) \, \mathrm db} = 2 \cdot (1 - f)$$

The shapes are the same, and only the y-axis la­bels have changed to re­flect the differ­ent heights of the pre-nor­mal­ized and nor­mal­ized func­tion.Re­graph these graphs with ac­tual height changes

Sup­pose we now flip the coin an­other two times, and it comes up heads then tails. We’ll de­note this piece of ev­i­dence $$h_2t_3.$$ Although these two coin tosses pull our be­liefs about $$b$$ in op­po­site di­rec­tions, they don’t can­cel out — far from it! In fact, one value of $$b$$ (“the coin is always tails”) is com­pletely elimi­nated by this ev­i­dence, and many ex­treme val­ues of $$b$$ (“al­most always heads” and “al­most always tails”) are hit badly. That is, while the heads and the coins tails pull our be­liefs in op­po­site di­rec­tions, they don’t pull with the same strength on all pos­si­ble val­ues of $$b.$$

We mul­ti­ply the old belief

by the ad­di­tional pieces of evidence

and

and ob­tain the pos­te­rior rel­a­tive density

which is pro­por­tional to the nor­mal­ized pos­te­rior probability

Writ­ing out the whole op­er­a­tion from scratch:

$$\mathbb P(b \mid t_1h_2t_3) = \frac{\mathcal L_{t_1h_2t_3}(b) \cdot \mathbb P(b)}{\mathbb P(t_1h_2t_3)} = \frac{(1 - b) \cdot b \cdot (1 - b) \cdot 1}{\int_0^1 (1 - b) \cdot b \cdot (1 - b) \cdot 1 \, \mathrm{d}b} = {12\cdot b(1 - b)^2}$$

Note that it’s okay for a pos­te­rior prob­a­bil­ity den­sity to be greater than 1, so long as the to­tal prob­a­bil­ity mass isn’t greater than 1. If there’s prob­a­bil­ity den­sity 1.2 over an in­ter­val of 0.1, that’s only a prob­a­bil­ity of 0.12 for the true value to be found in that in­ter­val.

Thus, in­tu­itively, Bayes’ rule “just works” when calcu­lat­ing the pos­te­rior prob­a­bil­ity den­sity from the prior prob­a­bil­ity den­sity func­tion and the (con­tin­u­ous) like­li­hood ra­tio func­tion. A proof is be­yond the scope of this guide; re­fer to Proof of Bayes’ rule in the con­tin­u­ous case.

Parents:

• Bayes' rule

Bayes’ rule is the core the­o­rem of prob­a­bil­ity the­ory say­ing how to re­vise our be­liefs when we make a new ob­ser­va­tion.

• If you’re go­ing to start us­ing prob­a­bil­ity den­sity func­tions in­stead of just prob­a­bil­ity func­tions, I’d in­tro­duce their type and con­sider us­ing a differ­ent sym­bol (e.g. low­er­case p) for PDFs. I ex­pect some peo­ple to get fairly con­fused when you use $$\mathbb P$$ but drop in a $$\operatorname{d}\!f$$ in out of nowhere—“how did a tiny dis­tance come into this!?” cries a reader-model.

Also, I’m not sure it’s stan­dard to in­clude the delta in the den­sity—I’m more fa­mil­iar with no­ta­tion say­ing that the uniform PDF is 1 ev­ery­where, and the cu­mu­la­tive dis­tri­bu­tion is the in­te­gral times $$\operatorname{d}\!f$$.

• The for­mula uses “x”s, but should it use “f”s in­stead?

• @2 I think there should be a small change here? Vari­able f be­comes x and back to f, and I be­lieve it should just be f?

Or, writ­ing out the whole op­er­a­tion from scratch:

$$\mathbb P(f\mid e\!=\!\textbf {THT}) = \dfrac{\mathcal L(e\!=\!\textbf{THT}\mid f) \cdot \mathbb P(f)}{\mathbb P(e\!=\!\textbf {THT})} = **\dfrac{(1 - x) \cdot x \cdot (1 - x) \cdot 1}{\int_0^1 (1 - x) \cdot x \cdot (1 - x) \cdot 1 \** \operatorname{d}\!f} = 12 \cdot f(1 - f)^2$$