Bayes' rule: Vector form

todo: This page con­flates two con­cepts: (1) You can perform a Bayesian up­date on mul­ti­ple hy­pothe­ses at once, by rep­re­sent­ing hy­pothe­ses via vec­tors; and (2) you can perform mul­ti­ple Bayesian up­dates by mul­ti­ply­ing by all the like­li­hood func­tions (and only nor­mal­iz­ing once at the end). We should prob­a­bly have one page for each con­cept, and we should pos­si­bly split this page in or­der to make them. (It’s not yet clear whether we want one unified page for both ideas, as this one cur­rently is.)
com­ment: Com­ment from ESY: it seems to me that these two con­cepts are suffi­ciently closely re­lated, and suffi­ciently com­bined in their demon­stra­tion, that we want to ex­plain them on the same page. They could ar­guably have differ­ent con­cept pages, though.

Bayes’ rule in the odds form says that for ev­ery pair of hy­pothe­ses, their rel­a­tive prior odds, times the rel­a­tive like­li­hood of the ev­i­dence, equals the rel­a­tive pos­te­rior odds.

Let \(\mathbf H\) be a vec­tor of hy­pothe­ses \(H_1, H_2, \ldots\) Be­cause Bayes’ rule holds be­tween ev­ery pair of hy­pothe­ses in \(\mathbf H,\) we can sim­ply mul­ti­ply an odds vec­tor by a like­li­hood vec­tor in or­der to get the cor­rect pos­te­rior vec­tor:

$$\mathbb O(\mathbf H) \times \mathcal L_e(\mathbf H) = \mathbb O(\mathbf H \mid e)$$

com­ment: Com­ment from EN: It seems to me that the dot product would be more ap­pro­pri­ate.

where \(\mathbb O(\mathbf H)\) is the vec­tor of rel­a­tive prior odds be­tween all the \(H_i\), \(\mathcal L_e(\mathbf H)\) is the vec­tor of rel­a­tive like­li­hoods with which each \(H_i\) pre­dicted \(e,\) and \(\mathbb O(\mathbf H \mid e)\) is the rel­a­tive pos­te­rior odds be­tween all the \(H_i.\)

In fact, we can keep mul­ti­ply­ing by like­li­hood vec­tors to perform mul­ti­ple up­dates at once:

$$\begin{array}{r} \mathbb O(\mathbf H) \\ \times\ \mathcal L_{e_1}(\mathbf H) \\ \times\ \mathcal L_{e_2}(\mathbf H \wedge e_1) \\ \times\ \mathcal L_{e_3}(\mathbf H \wedge e_1 \wedge e_2) \\ = \mathbb O(\mathbf H \mid e_1 \wedge e_2 \wedge e_3) \end{array}$$

For ex­am­ple, sup­pose there’s a bath­tub full of coins. Half of the coins are “fair” and have a 50% prob­a­bil­ity of pro­duc­ing heads on each coin­flip. A third of the coins is bi­ased to­wards heads and pro­duces heads 75% of the time. The re­main­ing coins are bi­ased against heads, which they pro­duce only 25% of the time. You pull out a coin at ran­dom, flip it 3 times, and get the re­sult THT. What’s the chance that this was a fair coin?

We have three hy­pothe­ses, which we’ll call \(H_{fair},\) \(H_{heads}\), and \(H_{tails}\) and re­spec­tively, with rel­a­tive odds of \((1/2 : 13 : 16).\) The rel­a­tive like­li­hoods that these three hy­pothe­ses as­sign to a coin land­ing heads is \((2 : 3 : 1)\); the rel­a­tive like­li­hoods that they as­sign to a coin land­ing tails is \((2 : 1 : 3).\) Thus, the pos­te­rior odds for all three hy­pothe­ses are:

$$\begin{array}{rll} (1/2 : 13 : 16) = & (3 : 2 : 1) & \\ \times & (2 : 1 : 3) & \\ \times & (2 : 3 : 1) & \\ \times & (2 : 1 : 3) & \\ = & (24 : 6 : 9) & = (8 : 2 : 3) = (8/13 : 213 : 313) \end{array}$$

…so there is an 813 or ~62% prob­a­bil­ity that the coin is fair.

If you were only fa­mil­iar with the prob­a­bil­ity form of Bayes’ rule, which only works for one hy­poth­e­sis at a time and which only uses prob­a­bil­ities (and so nor­mal­izes the odds into prob­a­bil­ities at ev­ery step)…

$$\mathbb P(H_i\mid e) = \dfrac{\mathbb P(e\mid H_i)P(H_i)}{\sum_k \mathbb P(e\mid H_k)P(H_k)}$$

…then you might have had some gra­tu­itous difficulty solv­ing this prob­lem.

Also, if you hear the idiom of “con­vert to odds, mul­ti­ply lots and lots of things, con­vert back to prob­a­bil­ities” and think “hmm, this sounds like a place where trans­form­ing into log-space (where all mul­ti­pli­ca­tions be­come ad­di­tions) might yield effi­ciency gains,” then con­grat­u­la­tions, you just in­vented the log-odds from of Bayes’ rule. Not only is it effi­cient, it also gives rise to a nat­u­ral unit of mea­sure for “strength of ev­i­dence” and “strength of be­lief”.

Naive Bayes

Mul­ti­ply­ing an ar­ray of odds by an ar­ray of like­li­hoods is the idiom used in Bayesian spam filters. Sup­pose that there are three cat­e­gories of email, “Busi­ness”, “Per­sonal”, and “Spam”, and that the user hand-la­bel­ing the last 100 emails has la­beled 50 as Busi­ness, 30 as Per­sonal, and 20 as spam. The word “buy” has ap­peared in 10 Busi­ness emails, 3 Per­sonal emails, and 10 spam emails. The word “ra­tio­nal­ity” has ap­peared in 30 Busi­ness emails, 15 Per­sonal emails, and 1 spam email.

First, we as­sume that the fre­quen­cies in our data are rep­re­sen­ta­tive of the ‘true’ fre­quen­cies. (Taken liter­ally, if we see a word we’ve never seen be­fore, we’ll be mul­ti­ply­ing by a zero prob­a­bil­ity. Good-Tur­ing fre­quency es­ti­ma­tion would do bet­ter.)

Se­cond, we make the naive Bayes as­sump­tion that a spam email which con­tains the word “buy” is no more or less likely than any other spam email to con­tain the word “ra­tio­nal­ity”, and so on with the other cat­e­gories.

Then we’d filter a mes­sage con­tain­ing the phrase “buy ra­tio­nal­ity” as fol­lows:

Prior odds: \((5 : 3 : 2)\)

Like­li­hood ra­tio for “buy”:

$$\left(\frac{10}{50} : \frac{3}{30} : \frac{10}{20}\right) = \left(\frac{1}{5} : \frac{1}{10} : \frac{1}{2}\right) = (2 : 1 : 5)$$

Like­li­hood ra­tio for “ra­tio­nal­ity”:

$$\left(\frac{30}{50} : \frac{15}{30} : \frac{1}{20}\right) = \left(\frac{3}{5} : \frac{1}{2} : \frac{1}{20}\right) = (12 : 10 : 1)$$

Pos­te­rior odds:

$$(5 : 3 : 2) \times (2 : 1 : 5) \times (12 : 10 : 1) = (120 : 30 : 10) = \left(\frac{12}{16} : \frac{3}{16} : \frac{1}{16}\right)$$

com­ment: 1216 is in­ten­tion­ally not in low­est form so that the 12 : 3 : 1 ra­tio can be clear.

This email would be 75% likely to be a busi­ness email, if the Naive Bayes as­sump­tions are true. They’re al­most cer­tainly not true, for rea­sons dis­cussed in more de­tail be­low. But while Naive Bayes calcu­la­tions are usu­ally quan­ti­ta­tively wrong, they of­ten point in the right qual­i­ta­tive di­rec­tion—this email may in­deed be more likely than not to be a busi­ness email.

(An ac­tual im­ple­men­ta­tion should add log-like­li­hoods rather than mul­ti­ply­ing by ra­tios, so as not to risk float­ing-point overflow or un­der­flow.)

Non-naive mul­ti­ple updates

To do a mul­ti­ple up­date less naively, we must do the equiv­a­lent of ask­ing about the prob­a­bil­ity that a Busi­ness email con­tains the word “ra­tio­nal­ity”, given that it con­tained the word “buy”.

As a real-life ex­am­ple, in a cer­tain ra­tio­nal­ity work­shop, one par­ti­ci­pant was ob­served to have taken an­other par­ti­ci­pant to a mu­seum, and also, on a differ­ent day, to see their work­place. A bet­ting mar­ket soon de­vel­oped on whether the two were ro­man­ti­cally in­volved. One par­ti­ci­pant ar­gued that, as an eye­ball es­ti­mate, some­one was 12 times as likely to take a fel­low par­ti­ci­pant to a mu­seum, or to their work­place, if they were ro­man­ti­cally in­volved, vs. just be­ing strangers. They then mul­ti­plied their prior odds by a 12 : 1 like­li­hood ra­tio for the mu­seum trip and an­other 12 : 1 like­li­hood ra­tio for the work­place trip, and con­cluded that these two were al­most cer­tainly ro­man­ti­cally at­tracted.

It later turned out that the two were child­hood ac­quain­tances who were not ro­man­ti­cally in­volved. What went wrong?

If we want to up­date hy­pothe­ses on mul­ti­ple pieces of ev­i­dence, we need to men­tally stay in­side the world of each hy­poth­e­sis, and con­di­tion the like­li­hood of fu­ture ev­i­dence on the ev­i­dence already ob­served. Sup­pose the two are not ro­man­ti­cally at­tracted. We ob­serve them visit a mu­seum. Ar­guendo, we might in­deed sup­pose that this has a prob­a­bil­ity of, say, 1% (we don’t usu­ally ex­pect strangers to visit mu­se­ums to­gether) which might be about 112 the prob­a­bil­ity of mak­ing that ob­ser­va­tion if the two were ro­man­ti­cally in­volved.

But af­ter this, when we ob­serve the work­place visit, we need to ask about the prob­a­bil­ity of the work­place visit, given that the two were ro­man­ti­cally at­tracted and that they vis­ited a mu­seum. This might sug­gest that if two non-at­tracted peo­ple visit a mu­seum to­gether for what­ever rea­son, they don’t just have the de­fault prob­a­bil­ity of a non-at­tracted cou­ple of mak­ing a work­place visit. In other words:

$$\mathbb P({workplace}\mid \neg {romance} \wedge {museum}) \neq \mathbb P({workplace}\mid \neg {romance})$$

Naive Bayes, in con­trast, would try to ap­prox­i­mate the quan­tity \(\mathbb P({museum} \wedge {workplace} \mid \neg {romance})\) as the product of \(\mathbb P({museum}\mid \neg {romance}) \cdot \mathbb P({workplace}\mid \neg {romance}).\) This is what the par­ti­ci­pants did when they mul­ti­plied by a 112 like­li­hood ra­tio twice.

The re­sult was a kind of dou­ble-count­ing of the ev­i­dence — they took into ac­count the prior im­prob­a­bil­ity of a ran­dom non-ro­man­tic cou­ple “go­ing places to­gether” twice in a row, for the two pieces of ev­i­dence, and ended up perform­ing a to­tal up­date that was much too strong.

Naive Bayes spam filters of­ten end up as­sign­ing lu­dicrously ex­treme odds, on the or­der of googols to one, that an email is spam or per­sonal; and then they’re some­times wrong any­ways. If an email con­tains the phrase “phar­ma­ceu­ti­cal” and “phar­macy”, a spam filter will dou­ble-count the im­prob­a­bil­ity of a per­sonal email talk­ing about phar­ma­cies, rather than con­sid­er­ing that if I ac­tu­ally do get a per­sonal email talk­ing about a phar­macy, it is much more likely to con­tain the word “phar­ma­ceu­ti­cal” as well. So be­cause of the Naive Bayes as­sump­tion, naive Bayesian spam filters are not any­thing re­motely like well-cal­ibrated, and they up­date much too ex­tremely on the ev­i­dence. On the other hand, they’re of­ten ex­treme in the cor­rect qual­i­ta­tive di­rec­tion — some­thing as­signed googol-to-one odds of be­ing spam isn’t always spam but it might be spam, say, 99.999% of the time.

To do non-naive Bayesian up­dates on mul­ti­ple pieces of ev­i­dence, just re­mem­ber to men­tally in­habit the world where the hy­poth­e­sis is true, and then ask about the like­li­hood of each suc­ces­sive piece of ev­i­dence, in the world where the hy­poth­e­sis is true and the pre­vi­ous pieces of ev­i­dence were ob­served. Don’t ask, “What is the like­li­hood that a non-ro­man­tic cou­ple would visit one per­son’s work­place?” but “What is the like­li­hood that a non-ro­man­tic cou­ple which pre­vi­ously vis­ited a mu­seum for some un­known rea­son would also visit the work­place?”

In our ex­am­ple with the coins in the bath­tub, the like­li­hoods of the ev­i­dence were in­de­pen­dent on each step—as­sum­ing a coin to be fair, it’s no more or less likely to pro­duce heads on the sec­ond flip af­ter pro­duc­ing heads on the first flip. So in our bath­tub-coins ex­am­ple, the Naive Bayes as­sump­tion was ac­tu­ally true.


  • Bayes' rule

    Bayes’ rule is the core the­o­rem of prob­a­bil­ity the­ory say­ing how to re­vise our be­liefs when we make a new ob­ser­va­tion.