Bayes' rule: Vector form

todo: This page conflates two concepts: (1) You can perform a Bayesian update on multiple hypotheses at once, by representing hypotheses via vectors; and (2) you can perform multiple Bayesian updates by multiplying by all the likelihood functions (and only normalizing once at the end). We should probably have one page for each concept, and we should possibly split this page in order to make them. (It’s not yet clear whether we want one unified page for both ideas, as this one currently is.)
comment: Comment from ESY: it seems to me that these two concepts are sufficiently closely related, and sufficiently combined in their demonstration, that we want to explain them on the same page. They could arguably have different concept pages, though.

Bayes’ rule in the odds form says that for every pair of hypotheses, their relative prior odds, times the relative likelihood of the evidence, equals the relative posterior odds.

Let \(\mathbf H\) be a vector of hypotheses \(H_1, H_2, \ldots\) Because Bayes’ rule holds between every pair of hypotheses in \(\mathbf H,\) we can simply multiply an odds vector by a likelihood vector in order to get the correct posterior vector:

$$\mathbb O(\mathbf H) \times \mathcal L_e(\mathbf H) = \mathbb O(\mathbf H \mid e)$$

comment: Comment from EN: It seems to me that the dot product would be more appropriate.

where \(\mathbb O(\mathbf H)\) is the vector of relative prior odds between all the \(H_i\), \(\mathcal L_e(\mathbf H)\) is the vector of relative likelihoods with which each \(H_i\) predicted \(e,\) and \(\mathbb O(\mathbf H \mid e)\) is the relative posterior odds between all the \(H_i.\)

In fact, we can keep multiplying by likelihood vectors to perform multiple updates at once:

$$\begin{array}{r} \mathbb O(\mathbf H) \\ \times\ \mathcal L_{e_1}(\mathbf H) \\ \times\ \mathcal L_{e_2}(\mathbf H \wedge e_1) \\ \times\ \mathcal L_{e_3}(\mathbf H \wedge e_1 \wedge e_2) \\ = \mathbb O(\mathbf H \mid e_1 \wedge e_2 \wedge e_3) \end{array}$$

For example, suppose there’s a bathtub full of coins. Half of the coins are “fair” and have a 50% probability of producing heads on each coinflip. A third of the coins is biased towards heads and produces heads 75% of the time. The remaining coins are biased against heads, which they produce only 25% of the time. You pull out a coin at random, flip it 3 times, and get the result THT. What’s the chance that this was a fair coin?

We have three hypotheses, which we’ll call \(H_{fair},\) \(H_{heads}\), and \(H_{tails}\) and respectively, with relative odds of \((1/2 : 13 : 16).\) The relative likelihoods that these three hypotheses assign to a coin landing heads is \((2 : 3 : 1)\); the relative likelihoods that they assign to a coin landing tails is \((2 : 1 : 3).\) Thus, the posterior odds for all three hypotheses are:

$$\begin{array}{rll} (1/2 : 13 : 16) = & (3 : 2 : 1) & \\ \times & (2 : 1 : 3) & \\ \times & (2 : 3 : 1) & \\ \times & (2 : 1 : 3) & \\ = & (24 : 6 : 9) & = (8 : 2 : 3) = (8/13 : 213 : 313) \end{array}$$

…so there is an 813 or ~62% probability that the coin is fair.

If you were only familiar with the probability form of Bayes’ rule, which only works for one hypothesis at a time and which only uses probabilities (and so normalizes the odds into probabilities at every step)…

$$\mathbb P(H_i\mid e) = \dfrac{\mathbb P(e\mid H_i)P(H_i)}{\sum_k \mathbb P(e\mid H_k)P(H_k)}$$

…then you might have had some gratuitous difficulty solving this problem.

Also, if you hear the idiom of “convert to odds, multiply lots and lots of things, convert back to probabilities” and think “hmm, this sounds like a place where transforming into log-space (where all multiplications become additions) might yield efficiency gains,” then congratulations, you just invented the log-odds from of Bayes’ rule. Not only is it efficient, it also gives rise to a natural unit of measure for “strength of evidence” and “strength of belief”.

Naive Bayes

Multiplying an array of odds by an array of likelihoods is the idiom used in Bayesian spam filters. Suppose that there are three categories of email, “Business”, “Personal”, and “Spam”, and that the user hand-labeling the last 100 emails has labeled 50 as Business, 30 as Personal, and 20 as spam. The word “buy” has appeared in 10 Business emails, 3 Personal emails, and 10 spam emails. The word “rationality” has appeared in 30 Business emails, 15 Personal emails, and 1 spam email.

First, we assume that the frequencies in our data are representative of the ‘true’ frequencies. (Taken literally, if we see a word we’ve never seen before, we’ll be multiplying by a zero probability. Good-Turing frequency estimation would do better.)

Second, we make the naive Bayes assumption that a spam email which contains the word “buy” is no more or less likely than any other spam email to contain the word “rationality”, and so on with the other categories.

Then we’d filter a message containing the phrase “buy rationality” as follows:

Prior odds: \((5 : 3 : 2)\)

Likelihood ratio for “buy”:

$$\left(\frac{10}{50} : \frac{3}{30} : \frac{10}{20}\right) = \left(\frac{1}{5} : \frac{1}{10} : \frac{1}{2}\right) = (2 : 1 : 5)$$

Likelihood ratio for “rationality”:

$$\left(\frac{30}{50} : \frac{15}{30} : \frac{1}{20}\right) = \left(\frac{3}{5} : \frac{1}{2} : \frac{1}{20}\right) = (12 : 10 : 1)$$

Posterior odds:

$$(5 : 3 : 2) \times (2 : 1 : 5) \times (12 : 10 : 1) = (120 : 30 : 10) = \left(\frac{12}{16} : \frac{3}{16} : \frac{1}{16}\right)$$

comment: 1216 is intentionally not in lowest form so that the 12 : 3 : 1 ratio can be clear.

This email would be 75% likely to be a business email, if the Naive Bayes assumptions are true. They’re almost certainly not true, for reasons discussed in more detail below. But while Naive Bayes calculations are usually quantitatively wrong, they often point in the right qualitative direction—this email may indeed be more likely than not to be a business email.

(An actual implementation should add log-likelihoods rather than multiplying by ratios, so as not to risk floating-point overflow or underflow.)

Non-naive multiple updates

To do a multiple update less naively, we must do the equivalent of asking about the probability that a Business email contains the word “rationality”, given that it contained the word “buy”.

As a real-life example, in a certain rationality workshop, one participant was observed to have taken another participant to a museum, and also, on a different day, to see their workplace. A betting market soon developed on whether the two were romantically involved. One participant argued that, as an eyeball estimate, someone was 12 times as likely to take a fellow participant to a museum, or to their workplace, if they were romantically involved, vs. just being strangers. They then multiplied their prior odds by a 12 : 1 likelihood ratio for the museum trip and another 12 : 1 likelihood ratio for the workplace trip, and concluded that these two were almost certainly romantically attracted.

It later turned out that the two were childhood acquaintances who were not romantically involved. What went wrong?

If we want to update hypotheses on multiple pieces of evidence, we need to mentally stay inside the world of each hypothesis, and condition the likelihood of future evidence on the evidence already observed. Suppose the two are not romantically attracted. We observe them visit a museum. Arguendo, we might indeed suppose that this has a probability of, say, 1% (we don’t usually expect strangers to visit museums together) which might be about 112 the probability of making that observation if the two were romantically involved.

But after this, when we observe the workplace visit, we need to ask about the probability of the workplace visit, given that the two were romantically attracted and that they visited a museum. This might suggest that if two non-attracted people visit a museum together for whatever reason, they don’t just have the default probability of a non-attracted couple of making a workplace visit. In other words:

$$\mathbb P({workplace}\mid \neg {romance} \wedge {museum}) \neq \mathbb P({workplace}\mid \neg {romance})$$

Naive Bayes, in contrast, would try to approximate the quantity \(\mathbb P({museum} \wedge {workplace} \mid \neg {romance})\) as the product of \(\mathbb P({museum}\mid \neg {romance}) \cdot \mathbb P({workplace}\mid \neg {romance}).\) This is what the participants did when they multiplied by a 112 likelihood ratio twice.

The result was a kind of double-counting of the evidence — they took into account the prior improbability of a random non-romantic couple “going places together” twice in a row, for the two pieces of evidence, and ended up performing a total update that was much too strong.

Naive Bayes spam filters often end up assigning ludicrously extreme odds, on the order of googols to one, that an email is spam or personal; and then they’re sometimes wrong anyways. If an email contains the phrase “pharmaceutical” and “pharmacy”, a spam filter will double-count the improbability of a personal email talking about pharmacies, rather than considering that if I actually do get a personal email talking about a pharmacy, it is much more likely to contain the word “pharmaceutical” as well. So because of the Naive Bayes assumption, naive Bayesian spam filters are not anything remotely like well-calibrated, and they update much too extremely on the evidence. On the other hand, they’re often extreme in the correct qualitative direction — something assigned googol-to-one odds of being spam isn’t always spam but it might be spam, say, 99.999% of the time.

To do non-naive Bayesian updates on multiple pieces of evidence, just remember to mentally inhabit the world where the hypothesis is true, and then ask about the likelihood of each successive piece of evidence, in the world where the hypothesis is true and the previous pieces of evidence were observed. Don’t ask, “What is the likelihood that a non-romantic couple would visit one person’s workplace?” but “What is the likelihood that a non-romantic couple which previously visited a museum for some unknown reason would also visit the workplace?”

In our example with the coins in the bathtub, the likelihoods of the evidence were independent on each step—assuming a coin to be fair, it’s no more or less likely to produce heads on the second flip after producing heads on the first flip. So in our bathtub-coins example, the Naive Bayes assumption was actually true.


  • Bayes' rule

    Bayes’ rule is the core theorem of probability theory saying how to revise our beliefs when we make a new observation.