Probability notation for Bayes' rule: Intro (Math 1)

To de­note some of the quan­tities used in Bayes’ rule, we’ll need con­di­tional prob­a­bil­ities. The con­di­tional prob­a­bil­ity \(\mathbb{P}(X\mid Y)\) means “The prob­a­bil­ity of \(X\) given \(Y\).” That is, \(\mathbb P(\mathrm{left}\mid \mathrm{right})\) means “The prob­a­bil­ity that \(\mathrm{left}\) is true, as­sum­ing that \(\mathrm{right}\) is true.”

\(\mathbb P(\mathrm{yellow}\mid \mathrm{banana})\) is the prob­a­bil­ity that a ba­nana is yel­low—if we know some­thing to be a ba­nana, what is the prob­a­bil­ity that it is yel­low? \(\mathbb P(\mathrm{banana}\mid \mathrm{yellow})\) is the prob­a­bil­ity that a yel­low thing is a ba­nana—if the right, known side is yel­low­ness, then, we ask the ques­tion on the left, what is the prob­a­bil­ity that this is a ba­nana?

In prob­a­bil­ity the­ory, the defi­ni­tion of “con­di­tional prob­a­bil­ity” is that the con­di­tional prob­a­bil­ity of \(L,\) given \(R,\) is found by look­ing at the prob­a­bil­ity of pos­si­bil­ities with both \(L\) and \(R\) within all pos­si­bil­ities with \(R.\) Us­ing \(L \wedge R\) to de­note the log­i­cal propo­si­tion “L and R both true”:

\(\mathbb P(L\mid R) = \frac{\mathbb P(L \wedge R)}{\mathbb P(R)}\)

Sup­pose you have a bag con­tain­ing ob­jects that are ei­ther red or blue, and ei­ther square or round:

$$\begin{array}{l|r|r} & Red & Blue \\ \hline Square & 1 & 2 \\ \hline Round & 3 & 4 \end{array}$$

If you reach in and feel a round ob­ject, the con­di­tional prob­a­bil­ity that it is red is:

\(\mathbb P(\mathrm{red} \mid \mathrm{round}) = \dfrac{\mathbb P(\mathrm{red} \wedge \mathrm{round})}{\mathbb P(\mathrm{round})} \propto \dfrac{3}{3 + 4} = \frac{3}{7}\)

If you look at the ob­ject near­est the top, and can see that it’s blue, but not see the shape, then the con­di­tional prob­a­bil­ity that it’s a square is:

\(\mathbb P(\mathrm{square} \mid \mathrm{blue}) = \dfrac{\mathbb P(\mathrm{square} \wedge \mathrm{blue})}{\mathbb P(\mathrm{blue})} \propto \dfrac{2}{2 + 4} = \frac{1}{3}\)

conditional probabilities bag

Up­dat­ing as conditioning

Bayes’ rule is use­ful be­cause the pro­cess of ob­serv­ing new ev­i­dence can be in­ter­preted as con­di­tion­ing a prob­a­bil­ity dis­tri­bu­tion.

Again, the Dise­a­sitis prob­lem:

20% of the pa­tients in the screen­ing pop­u­la­tion start out with Dise­a­sitis. Among pa­tients with Dise­a­sitis, 90% turn the tongue de­pres­sor black. 30% of the pa­tients with­out Dise­a­sitis will also turn the tongue de­pres­sor black. Among all the pa­tients with black tongue de­pres­sors, how many have Dise­a­sitis?

Con­sider a sin­gle pa­tient, be­fore ob­serv­ing any ev­i­dence. There are four pos­si­ble wor­lds we could be in, the product of (sick vs. healthy) times (pos­i­tive vs. nega­tive re­sult):

$$\begin{array}{l|r|r} & Sick & Healthy \\ \hline Test + & 18\% & 24\% \\ \hline Test - & 2\% & 56\% \end{array}$$

To ac­tu­ally ob­serve that the pa­tient gets a nega­tive re­sult, is to elimi­nate from fur­ther con­sid­er­a­tion the pos­si­ble wor­lds where the pa­tient gets a pos­i­tive re­sult:

bayes elimination

Once we ob­serve the re­sult \(\mathrm{positive}\), all of our fu­ture rea­son­ing should take place, not in our old \(\mathbb P(\cdot),\) but in our new \(\mathbb P(\cdot \mid \mathrm{positive}).\) This is why, af­ter ob­serv­ing “$\mathrm{pos­i­tive}$” and re­vis­ing our prob­a­bil­ity dis­tri­bu­tion, when we ask about the prob­a­bil­ity the pa­tient is sick, we are in­ter­ested in the new prob­a­bil­ity \(\mathbb P(\mathrm{sick}\mid \mathrm{positive})\) and not the old prob­a­bil­ity \(\mathbb P(\mathrm{sick}).\)

Ex­am­ple: Socks-dresser problem

Real­iz­ing that ob­serv­ing ev­i­dence cor­re­sponds to elimi­nat­ing prob­a­bil­ity mass and con­cern­ing our­selves only with the prob­a­bil­ity mass that re­mains, is the key to solv­ing the sock-dresser search prob­lem:

You left your socks some­where in your room. You think there’s a 45 chance that they’re in your dresser, so you start look­ing through your dresser’s 8 draw­ers. After check­ing 6 draw­ers at ran­dom, you haven’t found your socks yet. What is the prob­a­bil­ity you will find your socks in the next drawer you check?

We ini­tially have 20% of the prob­a­bil­ity mass in “Socks out­side the dresser”, and 80% of the prob­a­bil­ity mass for “Socks in­side the dresser”. This cor­re­sponds to 10% prob­a­bil­ity mass for each of the 8 draw­ers.

After elimi­nat­ing the prob­a­bil­ity mass in 6 of the draw­ers, we have 40% of the origi­nal mass re­main­ing, 20% for “Socks out­side the dresser” and 10% each for the re­main­ing 2 draw­ers.

Since this re­main­ing 40% prob­a­bil­ity mass is now our whole world, the effect on our prob­a­bil­ity dis­tri­bu­tion is like am­plify­ing the 40% un­til it ex­pands back up to 100%, aka renor­mal­iz­ing the prob­a­bil­ity dis­tri­bu­tion. This is why we di­vide \(\mathbb P(L \wedge R)\) by \(\mathbb P(R)\) to get the new prob­a­bil­ities.

In this case, we di­vide “20% prob­a­bil­ity of be­ing out­side the dresser” by 40%, and then di­vide the 10% prob­a­bil­ity mass in each of the two draw­ers by 40%. So the new prob­a­bil­ities are 12 for out­side the dresser, and 14 each for the 2 draw­ers. Or more sim­ply, we could ob­serve that, among the re­main­ing prob­a­bil­ity mass of 40%, the “out­side the dresser” hy­poth­e­sis has half of it, and the two draw­ers have a quar­ter each.

So the prob­a­bil­ity of find­ing our socks in the next drawer is 25%.

Note that as we open suc­ces­sive draw­ers, we both be­come more con­fi­dent that the socks are not in the dresser at all (since we elimi­nated sev­eral draw­ers they could have been in), and also ex­pect more that we might find the socks in the next drawer we open (since there are so few re­main­ing).

Pri­ors, like­li­hoods, and posteriors

Bayes’ the­o­rem is gen­er­ally in­quiring about some ques­tion of the form \(\mathbb P(\mathrm{hypothesis}\mid \mathrm{evidence})\) - the \(\mathrm{evidence}\) is known or as­sumed, so that we are now men­tally liv­ing in the re­vised prob­a­bil­ity dis­tri­bu­tion \(\mathbb P(\cdot\mid \mathrm{evidence}),\) and we are ask­ing what we in­fer or guess about the \(hypothesis.\) This quan­tity is the pos­te­rior prob­a­bil­ity of the \(\mathrm{hypothesis}.\)

To carry out a Bayesian re­vi­sion, we also need to know what our be­liefs were be­fore we saw the ev­i­dence. (E.g., in the Dise­a­sitis prob­lem, the chance that a pa­tient who hasn’t been tested yet is sick.) This is of­ten writ­ten \(\mathbb P(\mathrm{hypothesis}),\) and the hy­poth­e­sis’s prob­a­bil­ity isn’t be­ing con­di­tioned on any­thing be­cause it is our prior be­lief.

The re­main­ing pieces of key in­for­ma­tion are the like­li­hoods of the ev­i­dence, given each hy­poth­e­sis. To in­ter­pret the mean­ing of the pos­i­tive test re­sult as ev­i­dence, we need to imag­ine our­selves in the world where the pa­tient is sick—as­sume the pa­tient to be sick, as if that were known—and then ask, just as if we hadn’t seen any test re­sult yet, what we think the prob­a­bil­ity of the ev­i­dence would be in that world. And then we have to do a similar op­er­a­tion again, this time men­tally in­hab­it­ing the world where the pa­tient is healthy. And un­for­tu­nately, it so hap­pens that the stan­dard no­ta­tion are such as to make this idea be de­noted \(\mathbb P(\mathrm{evidence}\mid \mathrm{hypothesis})\) - look­ing de­cep­tively like the no­ta­tion for the pos­te­rior prob­a­bil­ity, but writ­ten in the re­verse or­der. Not sur­pris­ingly, this trips peo­ple up a bunch un­til they get used to it. (You would at least hope that the stan­dard sym­bol \(\mathbb P(\cdot \mid \cdot)\) wouldn’t be sym­met­ri­cal, but it is. Alas.)


Sup­pose you’re Sher­lock Holmes in­ves­ti­gat­ing a case in which a red hair was left at the scene of the crime.

The Scot­land Yard de­tec­tive says, “Aha! Then it’s Miss Scar­let. She has red hair, so if she was the mur­derer she al­most cer­tainly would have left a red hair there. \(\mathbb P(\mathrm{redhair}\mid \mathrm{Scarlet}) = 99\%,\) let’s say, which is a near-cer­tain con­vic­tion, so we’re done.”

“But no,” replies Sher­lock Holmes. “You see, but you do not cor­rectly track the mean­ing of the con­di­tional prob­a­bil­ities, de­tec­tive. The knowl­edge we re­quire for a con­vic­tion is not \(\mathbb P(\mathrm{redhair}\mid \mathrm{Scarlet}),\) the chance that Miss Scar­let would leave a red hair, but rather \(\mathbb P(\mathrm{Scarlet}\mid \mathrm{redhair}),\) the chance that this red hair was left by Scar­let. There are other peo­ple in this city who have red hair.”

“So you’re say­ing…” the de­tec­tive said slowly, “that \(\mathbb P(\mathrm{redhair}\mid \mathrm{Scarlet})\) is ac­tu­ally much lower than \(1\)?”

“No, de­tec­tive. I am say­ing that just be­cause \(\mathbb P(\mathrm{redhair}\mid \mathrm{Scarlet})\) is high does not im­ply that \(\mathbb P(\mathrm{Scarlet}\mid \mathrm{redhair})\) is high. It is the lat­ter prob­a­bil­ity in which we are in­ter­ested—the de­gree to which, know­ing that a red hair was left at the scene, we in­fer that Miss Scar­let was the mur­derer. The pos­te­rior, as the Bayesi­ans say. This is not the same quan­tity as the de­gree to which, as­sum­ing Miss Scar­let was the mur­derer, we would guess that she might leave a red hair. That is merely the like­li­hood of the ev­i­dence, con­di­tional on Miss Scar­let hav­ing done it.”


Us­ing the wa­ter­fall for the Dise­a­sitis prob­lem:

waterfall labeled probabilities

add an Ex­am­ple 2 and Ex­am­ple 3, maybe with graph­ics, be­cause I ex­pect this part to be con­fus­ing. steal from L0 Bayes.