# Bayes' rule: Vector form

todo: This page con­flates two con­cepts: (1) You can perform a Bayesian up­date on mul­ti­ple hy­pothe­ses at once, by rep­re­sent­ing hy­pothe­ses via vec­tors; and (2) you can perform mul­ti­ple Bayesian up­dates by mul­ti­ply­ing by all the like­li­hood func­tions (and only nor­mal­iz­ing once at the end). We should prob­a­bly have one page for each con­cept, and we should pos­si­bly split this page in or­der to make them. (It’s not yet clear whether we want one unified page for both ideas, as this one cur­rently is.)
com­ment: Com­ment from ESY: it seems to me that these two con­cepts are suffi­ciently closely re­lated, and suffi­ciently com­bined in their demon­stra­tion, that we want to ex­plain them on the same page. They could ar­guably have differ­ent con­cept pages, though.

Bayes’ rule in the odds form says that for ev­ery pair of hy­pothe­ses, their rel­a­tive prior odds, times the rel­a­tive like­li­hood of the ev­i­dence, equals the rel­a­tive pos­te­rior odds.

Let $$\mathbf H$$ be a vec­tor of hy­pothe­ses $$H_1, H_2, \ldots$$ Be­cause Bayes’ rule holds be­tween ev­ery pair of hy­pothe­ses in $$\mathbf H,$$ we can sim­ply mul­ti­ply an odds vec­tor by a like­li­hood vec­tor in or­der to get the cor­rect pos­te­rior vec­tor:

$$\mathbb O(\mathbf H) \times \mathcal L_e(\mathbf H) = \mathbb O(\mathbf H \mid e)$$

com­ment: Com­ment from EN: It seems to me that the dot product would be more ap­pro­pri­ate.

where $$\mathbb O(\mathbf H)$$ is the vec­tor of rel­a­tive prior odds be­tween all the $$H_i$$, $$\mathcal L_e(\mathbf H)$$ is the vec­tor of rel­a­tive like­li­hoods with which each $$H_i$$ pre­dicted $$e,$$ and $$\mathbb O(\mathbf H \mid e)$$ is the rel­a­tive pos­te­rior odds be­tween all the $$H_i.$$

In fact, we can keep mul­ti­ply­ing by like­li­hood vec­tors to perform mul­ti­ple up­dates at once:

$$\begin{array}{r} \mathbb O(\mathbf H) \\ \times\ \mathcal L_{e_1}(\mathbf H) \\ \times\ \mathcal L_{e_2}(\mathbf H \wedge e_1) \\ \times\ \mathcal L_{e_3}(\mathbf H \wedge e_1 \wedge e_2) \\ = \mathbb O(\mathbf H \mid e_1 \wedge e_2 \wedge e_3) \end{array}$$

For ex­am­ple, sup­pose there’s a bath­tub full of coins. Half of the coins are “fair” and have a 50% prob­a­bil­ity of pro­duc­ing heads on each coin­flip. A third of the coins is bi­ased to­wards heads and pro­duces heads 75% of the time. The re­main­ing coins are bi­ased against heads, which they pro­duce only 25% of the time. You pull out a coin at ran­dom, flip it 3 times, and get the re­sult THT. What’s the chance that this was a fair coin?

We have three hy­pothe­ses, which we’ll call $$H_{fair},$$ $$H_{heads}$$, and $$H_{tails}$$ and re­spec­tively, with rel­a­tive odds of $$(1/2 : 13 : 16).$$ The rel­a­tive like­li­hoods that these three hy­pothe­ses as­sign to a coin land­ing heads is $$(2 : 3 : 1)$$; the rel­a­tive like­li­hoods that they as­sign to a coin land­ing tails is $$(2 : 1 : 3).$$ Thus, the pos­te­rior odds for all three hy­pothe­ses are:

$$\begin{array}{rll} (1/2 : 13 : 16) = & (3 : 2 : 1) & \\ \times & (2 : 1 : 3) & \\ \times & (2 : 3 : 1) & \\ \times & (2 : 1 : 3) & \\ = & (24 : 6 : 9) & = (8 : 2 : 3) = (8/13 : 213 : 313) \end{array}$$

…so there is an 813 or ~62% prob­a­bil­ity that the coin is fair.

If you were only fa­mil­iar with the prob­a­bil­ity form of Bayes’ rule, which only works for one hy­poth­e­sis at a time and which only uses prob­a­bil­ities (and so nor­mal­izes the odds into prob­a­bil­ities at ev­ery step)…

$$\mathbb P(H_i\mid e) = \dfrac{\mathbb P(e\mid H_i)P(H_i)}{\sum_k \mathbb P(e\mid H_k)P(H_k)}$$

…then you might have had some gra­tu­itous difficulty solv­ing this prob­lem.

Also, if you hear the idiom of “con­vert to odds, mul­ti­ply lots and lots of things, con­vert back to prob­a­bil­ities” and think “hmm, this sounds like a place where trans­form­ing into log-space (where all mul­ti­pli­ca­tions be­come ad­di­tions) might yield effi­ciency gains,” then con­grat­u­la­tions, you just in­vented the log-odds from of Bayes’ rule. Not only is it effi­cient, it also gives rise to a nat­u­ral unit of mea­sure for “strength of ev­i­dence” and “strength of be­lief”.

# Naive Bayes

Mul­ti­ply­ing an ar­ray of odds by an ar­ray of like­li­hoods is the idiom used in Bayesian spam filters. Sup­pose that there are three cat­e­gories of email, “Busi­ness”, “Per­sonal”, and “Spam”, and that the user hand-la­bel­ing the last 100 emails has la­beled 50 as Busi­ness, 30 as Per­sonal, and 20 as spam. The word “buy” has ap­peared in 10 Busi­ness emails, 3 Per­sonal emails, and 10 spam emails. The word “ra­tio­nal­ity” has ap­peared in 30 Busi­ness emails, 15 Per­sonal emails, and 1 spam email.

First, we as­sume that the fre­quen­cies in our data are rep­re­sen­ta­tive of the ‘true’ fre­quen­cies. (Taken liter­ally, if we see a word we’ve never seen be­fore, we’ll be mul­ti­ply­ing by a zero prob­a­bil­ity. Good-Tur­ing fre­quency es­ti­ma­tion would do bet­ter.)

Se­cond, we make the naive Bayes as­sump­tion that a spam email which con­tains the word “buy” is no more or less likely than any other spam email to con­tain the word “ra­tio­nal­ity”, and so on with the other cat­e­gories.

Then we’d filter a mes­sage con­tain­ing the phrase “buy ra­tio­nal­ity” as fol­lows:

Prior odds: $$(5 : 3 : 2)$$

$$\left(\frac{10}{50} : \frac{3}{30} : \frac{10}{20}\right) = \left(\frac{1}{5} : \frac{1}{10} : \frac{1}{2}\right) = (2 : 1 : 5)$$

Like­li­hood ra­tio for “ra­tio­nal­ity”:

$$\left(\frac{30}{50} : \frac{15}{30} : \frac{1}{20}\right) = \left(\frac{3}{5} : \frac{1}{2} : \frac{1}{20}\right) = (12 : 10 : 1)$$

Pos­te­rior odds:

$$(5 : 3 : 2) \times (2 : 1 : 5) \times (12 : 10 : 1) = (120 : 30 : 10) = \left(\frac{12}{16} : \frac{3}{16} : \frac{1}{16}\right)$$

com­ment: 1216 is in­ten­tion­ally not in low­est form so that the 12 : 3 : 1 ra­tio can be clear.

This email would be 75% likely to be a busi­ness email, if the Naive Bayes as­sump­tions are true. They’re al­most cer­tainly not true, for rea­sons dis­cussed in more de­tail be­low. But while Naive Bayes calcu­la­tions are usu­ally quan­ti­ta­tively wrong, they of­ten point in the right qual­i­ta­tive di­rec­tion—this email may in­deed be more likely than not to be a busi­ness email.

(An ac­tual im­ple­men­ta­tion should add log-like­li­hoods rather than mul­ti­ply­ing by ra­tios, so as not to risk float­ing-point overflow or un­der­flow.)

To do a mul­ti­ple up­date less naively, we must do the equiv­a­lent of ask­ing about the prob­a­bil­ity that a Busi­ness email con­tains the word “ra­tio­nal­ity”, given that it con­tained the word “buy”.

As a real-life ex­am­ple, in a cer­tain ra­tio­nal­ity work­shop, one par­ti­ci­pant was ob­served to have taken an­other par­ti­ci­pant to a mu­seum, and also, on a differ­ent day, to see their work­place. A bet­ting mar­ket soon de­vel­oped on whether the two were ro­man­ti­cally in­volved. One par­ti­ci­pant ar­gued that, as an eye­ball es­ti­mate, some­one was 12 times as likely to take a fel­low par­ti­ci­pant to a mu­seum, or to their work­place, if they were ro­man­ti­cally in­volved, vs. just be­ing strangers. They then mul­ti­plied their prior odds by a 12 : 1 like­li­hood ra­tio for the mu­seum trip and an­other 12 : 1 like­li­hood ra­tio for the work­place trip, and con­cluded that these two were al­most cer­tainly ro­man­ti­cally at­tracted.

It later turned out that the two were child­hood ac­quain­tances who were not ro­man­ti­cally in­volved. What went wrong?

If we want to up­date hy­pothe­ses on mul­ti­ple pieces of ev­i­dence, we need to men­tally stay in­side the world of each hy­poth­e­sis, and con­di­tion the like­li­hood of fu­ture ev­i­dence on the ev­i­dence already ob­served. Sup­pose the two are not ro­man­ti­cally at­tracted. We ob­serve them visit a mu­seum. Ar­guendo, we might in­deed sup­pose that this has a prob­a­bil­ity of, say, 1% (we don’t usu­ally ex­pect strangers to visit mu­se­ums to­gether) which might be about 112 the prob­a­bil­ity of mak­ing that ob­ser­va­tion if the two were ro­man­ti­cally in­volved.

But af­ter this, when we ob­serve the work­place visit, we need to ask about the prob­a­bil­ity of the work­place visit, given that the two were ro­man­ti­cally at­tracted and that they vis­ited a mu­seum. This might sug­gest that if two non-at­tracted peo­ple visit a mu­seum to­gether for what­ever rea­son, they don’t just have the de­fault prob­a­bil­ity of a non-at­tracted cou­ple of mak­ing a work­place visit. In other words:

$$\mathbb P({workplace}\mid \neg {romance} \wedge {museum}) \neq \mathbb P({workplace}\mid \neg {romance})$$

Naive Bayes, in con­trast, would try to ap­prox­i­mate the quan­tity $$\mathbb P({museum} \wedge {workplace} \mid \neg {romance})$$ as the product of $$\mathbb P({museum}\mid \neg {romance}) \cdot \mathbb P({workplace}\mid \neg {romance}).$$ This is what the par­ti­ci­pants did when they mul­ti­plied by a 112 like­li­hood ra­tio twice.

The re­sult was a kind of dou­ble-count­ing of the ev­i­dence — they took into ac­count the prior im­prob­a­bil­ity of a ran­dom non-ro­man­tic cou­ple “go­ing places to­gether” twice in a row, for the two pieces of ev­i­dence, and ended up perform­ing a to­tal up­date that was much too strong.

Naive Bayes spam filters of­ten end up as­sign­ing lu­dicrously ex­treme odds, on the or­der of googols to one, that an email is spam or per­sonal; and then they’re some­times wrong any­ways. If an email con­tains the phrase “phar­ma­ceu­ti­cal” and “phar­macy”, a spam filter will dou­ble-count the im­prob­a­bil­ity of a per­sonal email talk­ing about phar­ma­cies, rather than con­sid­er­ing that if I ac­tu­ally do get a per­sonal email talk­ing about a phar­macy, it is much more likely to con­tain the word “phar­ma­ceu­ti­cal” as well. So be­cause of the Naive Bayes as­sump­tion, naive Bayesian spam filters are not any­thing re­motely like well-cal­ibrated, and they up­date much too ex­tremely on the ev­i­dence. On the other hand, they’re of­ten ex­treme in the cor­rect qual­i­ta­tive di­rec­tion — some­thing as­signed googol-to-one odds of be­ing spam isn’t always spam but it might be spam, say, 99.999% of the time.

To do non-naive Bayesian up­dates on mul­ti­ple pieces of ev­i­dence, just re­mem­ber to men­tally in­habit the world where the hy­poth­e­sis is true, and then ask about the like­li­hood of each suc­ces­sive piece of ev­i­dence, in the world where the hy­poth­e­sis is true and the pre­vi­ous pieces of ev­i­dence were ob­served. Don’t ask, “What is the like­li­hood that a non-ro­man­tic cou­ple would visit one per­son’s work­place?” but “What is the like­li­hood that a non-ro­man­tic cou­ple which pre­vi­ously vis­ited a mu­seum for some un­known rea­son would also visit the work­place?”

In our ex­am­ple with the coins in the bath­tub, the like­li­hoods of the ev­i­dence were in­de­pen­dent on each step—as­sum­ing a coin to be fair, it’s no more or less likely to pro­duce heads on the sec­ond flip af­ter pro­duc­ing heads on the first flip. So in our bath­tub-coins ex­am­ple, the Naive Bayes as­sump­tion was ac­tu­ally true.

Parents:

• Bayes' rule

Bayes’ rule is the core the­o­rem of prob­a­bil­ity the­ory say­ing how to re­vise our be­liefs when we make a new ob­ser­va­tion.

• Are there go­ing to be vi­sual ex­pla­na­tions put here for the ex­am­ples? I found that quite helpful in the former pages. I’d say this is the first part of the new Bayes Guide that feels very similar (in terms of clar­ity) to the old one. Although, I might be bi­ased as I’ve found I much pre­fer vi­sual ex­pla­na­tions of things.

• Where did the ’16′ come from in (12/​16:3/​16:1/​16) ?

• It’s 12 + 3 + 1. I’ll edit to make clearer, but your com­ment ex­posed a bug in our LaTeX pars­ing so I’m wait­ing to edit un­til that re­solves. :)

• What’s the bath­tub coins ex­am­ple? I’ve read the en­tire ad­vanced se­quence up to here and I don’t re­mem­ber read­ing about that. Maybe it was ed­ited and re­moved? (Or maybe I wasn’t pay­ing at­ten­tion or some­thing?)

• I be­lieve that this should be $$(2 : 3 : 1)$$ rather than $$(3 : 2 : 1)$$.

• I be­lieve it should be, “the two were not ro­man­ti­cally at­tracted” as that is con­sis­tent with the for­mula be­low.

• Ac­tu­ally, there should be di­ag­o­nal ma­tri­ces in­stead of vec­tors. Cross product doesn’t work like this, and dot product gives us a sum of co­or­di­nates of the vec­tor we need in­stead of the vec­tor it­self, so we can’t con­tinue up­dat­ing our prob­a­bil­ities (or make any sense of the re­sult). Di­ag­o­nal ma­tri­ces, on the other hand, do ex­actly what we need: $$C = AB; c_{ii} = a_{ii} \* b_{ii}; ∀ i ≠ j, c_{ij} = 0$$.

• Ar­guendo: more ran­dom non-com­mon latin. Con­sider “For the sake of ar­gu­ment” or “Per­haps”

• I find the en­tire ex­pla­na­tion de­scribed be­low very mis­lead­ing and per­haps even largely in­cor­rect. The work­shop par­ti­ci­pants had it wrong mostly for two rea­sons:

1. They did not con­sider what is the like­li­hood of vis­it­ing a mu­seum /​ work­place given any other al­ter­na­tive (mu­tu­ally ex­clu­sive) re­la­tion­ship - not strangers but also not ro­man­ti­cally in­volved; i.e, friends. Be­ing ac­quain­tances is not a rele­vant type of a re­la­tion­ship as it is not mu­tu­ally ex­clu­sive with a ro­man­tic re­la­tion­ship (a pair can be both dat­ing and work­ing to­gether).

2. They did not know the prior prob­a­bil­ity of an ar­bi­trary pair of peo­ple be­ing ro­man­ti­cally in­volved. A naive as­sump­tion of 50% of them be­ing ro­man­ti­cally in­volved is wrong, and should be made by ob­serv­ing the pro­por­tions of ro­man­tic re­la­tion­ships in the pop­u­la­tion.

In terms of the pre­vi­ous coins-fair­ness ex­am­ple, they (a) only con­sid­ered that one type of coin (fair) is 2 times as likely to turn up heads as an­other type of coin (tail-bi­ased), but did not con­sider how likely are the other type of coins (head-bi­ased) to turn up heads; and (b) they did not know the pro­por­tions of coin types in the bath­tub.

The ex­pla­na­tion be­low also fails to men­tion the im­por­tant as­sump­tion that the trait be­ing as­sessed in all of the ex­am­ples (coins, emails, work­shop) is con­stant and doesn’t change over time. It is im­por­tant to men­tion be­cause it may not be so triv­ial for ev­ery ex­am­ple, yet it re­duces the com­plex­ity of the es­ti­ma­tions tremen­dously. A coin is not ex­pected to change its bias sig­nifi­cantly over time, yet a re­la­tion­ship does, and so does the mag­ni­tude of “spam­ness” in a given mail for a given per­son (for in­stance, when I get older I may be more in­ter­ested in phar­ma­ceu­ti­cal ads).

• I be­lieve it is es­sen­tial to ex­plain why it is in­de­pen­dent in the case of the bath­tub ex­am­ple and not in the other ex­am­ples.

In the bath­tub ex­am­ple, the ev­i­dence pre­sents an event which is di­rectly de­scribed by the as­sessed trait; i.e, the fair­ness of a coin is di­rectly con­cerned with the ap­pear­ance of ei­ther heads or tails. In con­trast, the defi­ni­tion of the de­gree of “spam­ness” in an email is not di­rectly con­cerned with the ap­pear­ance of a word in the email, but is rather con­cerned with the ab­stract con­cept of the mean­ing a per­son as­signs to the email.

The ap­pear­ance of a word in an email is hence only an at­tempt of es­ti­mat­ing the de­gree of “spam­ness”, a proxy. In the case of a proxy, we need to con­sider the op­tion that the proxy is flawed in a way which makes it so that the ev­i­dences are in fact de­pen­den­cies of one an­other. This is not nec­es­sar­ily true, but it is pos­si­ble, un­like in the case of hy­po­thet­i­cal coins (in re­al­ity, a coin toss might ac­tu­ally be phys­i­cally af­fected by the pre­vi­ous toss).