# Probability notation for Bayes' rule: Intro (Math 1)

To de­note some of the quan­tities used in Bayes’ rule, we’ll need con­di­tional prob­a­bil­ities. The con­di­tional prob­a­bil­ity $$\mathbb{P}(X\mid Y)$$ means “The prob­a­bil­ity of $$X$$ given $$Y$$.” That is, $$\mathbb P(\mathrm{left}\mid \mathrm{right})$$ means “The prob­a­bil­ity that $$\mathrm{left}$$ is true, as­sum­ing that $$\mathrm{right}$$ is true.”

$$\mathbb P(\mathrm{yellow}\mid \mathrm{banana})$$ is the prob­a­bil­ity that a ba­nana is yel­low—if we know some­thing to be a ba­nana, what is the prob­a­bil­ity that it is yel­low? $$\mathbb P(\mathrm{banana}\mid \mathrm{yellow})$$ is the prob­a­bil­ity that a yel­low thing is a ba­nana—if the right, known side is yel­low­ness, then, we ask the ques­tion on the left, what is the prob­a­bil­ity that this is a ba­nana?

In prob­a­bil­ity the­ory, the defi­ni­tion of “con­di­tional prob­a­bil­ity” is that the con­di­tional prob­a­bil­ity of $$L,$$ given $$R,$$ is found by look­ing at the prob­a­bil­ity of pos­si­bil­ities with both $$L$$ and $$R$$ within all pos­si­bil­ities with $$R.$$ Us­ing $$L \wedge R$$ to de­note the log­i­cal propo­si­tion “L and R both true”:

$$\mathbb P(L\mid R) = \frac{\mathbb P(L \wedge R)}{\mathbb P(R)}$$

Sup­pose you have a bag con­tain­ing ob­jects that are ei­ther red or blue, and ei­ther square or round:

$$\begin{array}{l|r|r} & Red & Blue \\ \hline Square & 1 & 2 \\ \hline Round & 3 & 4 \end{array}$$

If you reach in and feel a round ob­ject, the con­di­tional prob­a­bil­ity that it is red is:

$$\mathbb P(\mathrm{red} \mid \mathrm{round}) = \dfrac{\mathbb P(\mathrm{red} \wedge \mathrm{round})}{\mathbb P(\mathrm{round})} \propto \dfrac{3}{3 + 4} = \frac{3}{7}$$

If you look at the ob­ject near­est the top, and can see that it’s blue, but not see the shape, then the con­di­tional prob­a­bil­ity that it’s a square is:

$$\mathbb P(\mathrm{square} \mid \mathrm{blue}) = \dfrac{\mathbb P(\mathrm{square} \wedge \mathrm{blue})}{\mathbb P(\mathrm{blue})} \propto \dfrac{2}{2 + 4} = \frac{1}{3}$$

# Up­dat­ing as conditioning

Bayes’ rule is use­ful be­cause the pro­cess of ob­serv­ing new ev­i­dence can be in­ter­preted as con­di­tion­ing a prob­a­bil­ity dis­tri­bu­tion.

Again, the Dise­a­sitis prob­lem:

20% of the pa­tients in the screen­ing pop­u­la­tion start out with Dise­a­sitis. Among pa­tients with Dise­a­sitis, 90% turn the tongue de­pres­sor black. 30% of the pa­tients with­out Dise­a­sitis will also turn the tongue de­pres­sor black. Among all the pa­tients with black tongue de­pres­sors, how many have Dise­a­sitis?

Con­sider a sin­gle pa­tient, be­fore ob­serv­ing any ev­i­dence. There are four pos­si­ble wor­lds we could be in, the product of (sick vs. healthy) times (pos­i­tive vs. nega­tive re­sult):

$$\begin{array}{l|r|r} & Sick & Healthy \\ \hline Test + & 18\% & 24\% \\ \hline Test - & 2\% & 56\% \end{array}$$

To ac­tu­ally ob­serve that the pa­tient gets a nega­tive re­sult, is to elimi­nate from fur­ther con­sid­er­a­tion the pos­si­ble wor­lds where the pa­tient gets a pos­i­tive re­sult:

Once we ob­serve the re­sult $$\mathrm{positive}$$, all of our fu­ture rea­son­ing should take place, not in our old $$\mathbb P(\cdot),$$ but in our new $$\mathbb P(\cdot \mid \mathrm{positive}).$$ This is why, af­ter ob­serv­ing “$\mathrm{pos­i­tive}$” and re­vis­ing our prob­a­bil­ity dis­tri­bu­tion, when we ask about the prob­a­bil­ity the pa­tient is sick, we are in­ter­ested in the new prob­a­bil­ity $$\mathbb P(\mathrm{sick}\mid \mathrm{positive})$$ and not the old prob­a­bil­ity $$\mathbb P(\mathrm{sick}).$$

## Ex­am­ple: Socks-dresser problem

Real­iz­ing that ob­serv­ing ev­i­dence cor­re­sponds to elimi­nat­ing prob­a­bil­ity mass and con­cern­ing our­selves only with the prob­a­bil­ity mass that re­mains, is the key to solv­ing the sock-dresser search prob­lem:

You left your socks some­where in your room. You think there’s a 45 chance that they’re in your dresser, so you start look­ing through your dresser’s 8 draw­ers. After check­ing 6 draw­ers at ran­dom, you haven’t found your socks yet. What is the prob­a­bil­ity you will find your socks in the next drawer you check?

We ini­tially have 20% of the prob­a­bil­ity mass in “Socks out­side the dresser”, and 80% of the prob­a­bil­ity mass for “Socks in­side the dresser”. This cor­re­sponds to 10% prob­a­bil­ity mass for each of the 8 draw­ers.

After elimi­nat­ing the prob­a­bil­ity mass in 6 of the draw­ers, we have 40% of the origi­nal mass re­main­ing, 20% for “Socks out­side the dresser” and 10% each for the re­main­ing 2 draw­ers.

Since this re­main­ing 40% prob­a­bil­ity mass is now our whole world, the effect on our prob­a­bil­ity dis­tri­bu­tion is like am­plify­ing the 40% un­til it ex­pands back up to 100%, aka renor­mal­iz­ing the prob­a­bil­ity dis­tri­bu­tion. This is why we di­vide $$\mathbb P(L \wedge R)$$ by $$\mathbb P(R)$$ to get the new prob­a­bil­ities.

In this case, we di­vide “20% prob­a­bil­ity of be­ing out­side the dresser” by 40%, and then di­vide the 10% prob­a­bil­ity mass in each of the two draw­ers by 40%. So the new prob­a­bil­ities are 12 for out­side the dresser, and 14 each for the 2 draw­ers. Or more sim­ply, we could ob­serve that, among the re­main­ing prob­a­bil­ity mass of 40%, the “out­side the dresser” hy­poth­e­sis has half of it, and the two draw­ers have a quar­ter each.

So the prob­a­bil­ity of find­ing our socks in the next drawer is 25%.

Note that as we open suc­ces­sive draw­ers, we both be­come more con­fi­dent that the socks are not in the dresser at all (since we elimi­nated sev­eral draw­ers they could have been in), and also ex­pect more that we might find the socks in the next drawer we open (since there are so few re­main­ing).

# Pri­ors, like­li­hoods, and posteriors

Bayes’ the­o­rem is gen­er­ally in­quiring about some ques­tion of the form $$\mathbb P(\mathrm{hypothesis}\mid \mathrm{evidence})$$ - the $$\mathrm{evidence}$$ is known or as­sumed, so that we are now men­tally liv­ing in the re­vised prob­a­bil­ity dis­tri­bu­tion $$\mathbb P(\cdot\mid \mathrm{evidence}),$$ and we are ask­ing what we in­fer or guess about the $$hypothesis.$$ This quan­tity is the pos­te­rior prob­a­bil­ity of the $$\mathrm{hypothesis}.$$

To carry out a Bayesian re­vi­sion, we also need to know what our be­liefs were be­fore we saw the ev­i­dence. (E.g., in the Dise­a­sitis prob­lem, the chance that a pa­tient who hasn’t been tested yet is sick.) This is of­ten writ­ten $$\mathbb P(\mathrm{hypothesis}),$$ and the hy­poth­e­sis’s prob­a­bil­ity isn’t be­ing con­di­tioned on any­thing be­cause it is our prior be­lief.

The re­main­ing pieces of key in­for­ma­tion are the like­li­hoods of the ev­i­dence, given each hy­poth­e­sis. To in­ter­pret the mean­ing of the pos­i­tive test re­sult as ev­i­dence, we need to imag­ine our­selves in the world where the pa­tient is sick—as­sume the pa­tient to be sick, as if that were known—and then ask, just as if we hadn’t seen any test re­sult yet, what we think the prob­a­bil­ity of the ev­i­dence would be in that world. And then we have to do a similar op­er­a­tion again, this time men­tally in­hab­it­ing the world where the pa­tient is healthy. And un­for­tu­nately, it so hap­pens that the stan­dard no­ta­tion are such as to make this idea be de­noted $$\mathbb P(\mathrm{evidence}\mid \mathrm{hypothesis})$$ - look­ing de­cep­tively like the no­ta­tion for the pos­te­rior prob­a­bil­ity, but writ­ten in the re­verse or­der. Not sur­pris­ingly, this trips peo­ple up a bunch un­til they get used to it. (You would at least hope that the stan­dard sym­bol $$\mathbb P(\cdot \mid \cdot)$$ wouldn’t be sym­met­ri­cal, but it is. Alas.)

## Example

Sup­pose you’re Sher­lock Holmes in­ves­ti­gat­ing a case in which a red hair was left at the scene of the crime.

The Scot­land Yard de­tec­tive says, “Aha! Then it’s Miss Scar­let. She has red hair, so if she was the mur­derer she al­most cer­tainly would have left a red hair there. $$\mathbb P(\mathrm{redhair}\mid \mathrm{Scarlet}) = 99\%,$$ let’s say, which is a near-cer­tain con­vic­tion, so we’re done.”

“But no,” replies Sher­lock Holmes. “You see, but you do not cor­rectly track the mean­ing of the con­di­tional prob­a­bil­ities, de­tec­tive. The knowl­edge we re­quire for a con­vic­tion is not $$\mathbb P(\mathrm{redhair}\mid \mathrm{Scarlet}),$$ the chance that Miss Scar­let would leave a red hair, but rather $$\mathbb P(\mathrm{Scarlet}\mid \mathrm{redhair}),$$ the chance that this red hair was left by Scar­let. There are other peo­ple in this city who have red hair.”

“So you’re say­ing…” the de­tec­tive said slowly, “that $$\mathbb P(\mathrm{redhair}\mid \mathrm{Scarlet})$$ is ac­tu­ally much lower than $$1$$?”

“No, de­tec­tive. I am say­ing that just be­cause $$\mathbb P(\mathrm{redhair}\mid \mathrm{Scarlet})$$ is high does not im­ply that $$\mathbb P(\mathrm{Scarlet}\mid \mathrm{redhair})$$ is high. It is the lat­ter prob­a­bil­ity in which we are in­ter­ested—the de­gree to which, know­ing that a red hair was left at the scene, we in­fer that Miss Scar­let was the mur­derer. The pos­te­rior, as the Bayesi­ans say. This is not the same quan­tity as the de­gree to which, as­sum­ing Miss Scar­let was the mur­derer, we would guess that she might leave a red hair. That is merely the like­li­hood of the ev­i­dence, con­di­tional on Miss Scar­let hav­ing done it.”

## Visualization

Us­ing the wa­ter­fall for the Dise­a­sitis prob­lem:

add an Ex­am­ple 2 and Ex­am­ple 3, maybe with graph­ics, be­cause I ex­pect this part to be con­fus­ing. steal from L0 Bayes.

Parents:

• Probability notation for Bayes' rule

The prob­a­bil­ity no­ta­tion used in Bayesian reasoning

• Bayes' rule

Bayes’ rule is the core the­o­rem of prob­a­bil­ity the­ory say­ing how to re­vise our be­liefs when we make a new ob­ser­va­tion.