Bayes' rule: Vector form

todo: This page conflates two concepts: (1) You can perform a Bayesian update on multiple hypotheses at once, by representing hypotheses via vectors; and (2) you can perform multiple Bayesian updates by multiplying by all the likelihood functions (and only normalizing once at the end). We should probably have one page for each concept, and we should possibly split this page in order to make them. (It’s not yet clear whether we want one unified page for both ideas, as this one currently is.)
comment: Comment from ESY: it seems to me that these two concepts are sufficiently closely related, and sufficiently combined in their demonstration, that we want to explain them on the same page. They could arguably have different concept pages, though.

Bayes’ rule in the odds form says that for every pair of hypotheses, their relative prior odds, times the relative likelihood of the evidence, equals the relative posterior odds.

Let $$\mathbf H$$ be a vector of hypotheses $$H_1, H_2, \ldots$$ Because Bayes’ rule holds between every pair of hypotheses in $$\mathbf H,$$ we can simply multiply an odds vector by a likelihood vector in order to get the correct posterior vector:

$$\mathbb O(\mathbf H) \times \mathcal L_e(\mathbf H) = \mathbb O(\mathbf H \mid e)$$

comment: Comment from EN: It seems to me that the dot product would be more appropriate.

where $$\mathbb O(\mathbf H)$$ is the vector of relative prior odds between all the $$H_i$$, $$\mathcal L_e(\mathbf H)$$ is the vector of relative likelihoods with which each $$H_i$$ predicted $$e,$$ and $$\mathbb O(\mathbf H \mid e)$$ is the relative posterior odds between all the $$H_i.$$

In fact, we can keep multiplying by likelihood vectors to perform multiple updates at once:

$$\begin{array}{r} \mathbb O(\mathbf H) \\ \times\ \mathcal L_{e_1}(\mathbf H) \\ \times\ \mathcal L_{e_2}(\mathbf H \wedge e_1) \\ \times\ \mathcal L_{e_3}(\mathbf H \wedge e_1 \wedge e_2) \\ = \mathbb O(\mathbf H \mid e_1 \wedge e_2 \wedge e_3) \end{array}$$

For example, suppose there’s a bathtub full of coins. Half of the coins are “fair” and have a 50% probability of producing heads on each coinflip. A third of the coins is biased towards heads and produces heads 75% of the time. The remaining coins are biased against heads, which they produce only 25% of the time. You pull out a coin at random, flip it 3 times, and get the result THT. What’s the chance that this was a fair coin?

We have three hypotheses, which we’ll call $$H_{fair},$$ $$H_{heads}$$, and $$H_{tails}$$ and respectively, with relative odds of $$(1/2 : 13 : 16).$$ The relative likelihoods that these three hypotheses assign to a coin landing heads is $$(2 : 3 : 1)$$; the relative likelihoods that they assign to a coin landing tails is $$(2 : 1 : 3).$$ Thus, the posterior odds for all three hypotheses are:

$$\begin{array}{rll} (1/2 : 13 : 16) = & (3 : 2 : 1) & \\ \times & (2 : 1 : 3) & \\ \times & (2 : 3 : 1) & \\ \times & (2 : 1 : 3) & \\ = & (24 : 6 : 9) & = (8 : 2 : 3) = (8/13 : 213 : 313) \end{array}$$

…so there is an 813 or ~62% probability that the coin is fair.

If you were only familiar with the probability form of Bayes’ rule, which only works for one hypothesis at a time and which only uses probabilities (and so normalizes the odds into probabilities at every step)…

$$\mathbb P(H_i\mid e) = \dfrac{\mathbb P(e\mid H_i)P(H_i)}{\sum_k \mathbb P(e\mid H_k)P(H_k)}$$

…then you might have had some gratuitous difficulty solving this problem.

Also, if you hear the idiom of “convert to odds, multiply lots and lots of things, convert back to probabilities” and think “hmm, this sounds like a place where transforming into log-space (where all multiplications become additions) might yield efficiency gains,” then congratulations, you just invented the log-odds from of Bayes’ rule. Not only is it efficient, it also gives rise to a natural unit of measure for “strength of evidence” and “strength of belief”.

Naive Bayes

Multiplying an array of odds by an array of likelihoods is the idiom used in Bayesian spam filters. Suppose that there are three categories of email, “Business”, “Personal”, and “Spam”, and that the user hand-labeling the last 100 emails has labeled 50 as Business, 30 as Personal, and 20 as spam. The word “buy” has appeared in 10 Business emails, 3 Personal emails, and 10 spam emails. The word “rationality” has appeared in 30 Business emails, 15 Personal emails, and 1 spam email.

First, we assume that the frequencies in our data are representative of the ‘true’ frequencies. (Taken literally, if we see a word we’ve never seen before, we’ll be multiplying by a zero probability. Good-Turing frequency estimation would do better.)

Second, we make the naive Bayes assumption that a spam email which contains the word “buy” is no more or less likely than any other spam email to contain the word “rationality”, and so on with the other categories.

Then we’d filter a message containing the phrase “buy rationality” as follows:

Prior odds: $$(5 : 3 : 2)$$

$$\left(\frac{10}{50} : \frac{3}{30} : \frac{10}{20}\right) = \left(\frac{1}{5} : \frac{1}{10} : \frac{1}{2}\right) = (2 : 1 : 5)$$

Likelihood ratio for “rationality”:

$$\left(\frac{30}{50} : \frac{15}{30} : \frac{1}{20}\right) = \left(\frac{3}{5} : \frac{1}{2} : \frac{1}{20}\right) = (12 : 10 : 1)$$

Posterior odds:

$$(5 : 3 : 2) \times (2 : 1 : 5) \times (12 : 10 : 1) = (120 : 30 : 10) = \left(\frac{12}{16} : \frac{3}{16} : \frac{1}{16}\right)$$

comment: 1216 is intentionally not in lowest form so that the 12 : 3 : 1 ratio can be clear.

This email would be 75% likely to be a business email, if the Naive Bayes assumptions are true. They’re almost certainly not true, for reasons discussed in more detail below. But while Naive Bayes calculations are usually quantitatively wrong, they often point in the right qualitative direction—this email may indeed be more likely than not to be a business email.

(An actual implementation should add log-likelihoods rather than multiplying by ratios, so as not to risk floating-point overflow or underflow.)

To do a multiple update less naively, we must do the equivalent of asking about the probability that a Business email contains the word “rationality”, given that it contained the word “buy”.

As a real-life example, in a certain rationality workshop, one participant was observed to have taken another participant to a museum, and also, on a different day, to see their workplace. A betting market soon developed on whether the two were romantically involved. One participant argued that, as an eyeball estimate, someone was 12 times as likely to take a fellow participant to a museum, or to their workplace, if they were romantically involved, vs. just being strangers. They then multiplied their prior odds by a 12 : 1 likelihood ratio for the museum trip and another 12 : 1 likelihood ratio for the workplace trip, and concluded that these two were almost certainly romantically attracted.

It later turned out that the two were childhood acquaintances who were not romantically involved. What went wrong?

If we want to update hypotheses on multiple pieces of evidence, we need to mentally stay inside the world of each hypothesis, and condition the likelihood of future evidence on the evidence already observed. Suppose the two are not romantically attracted. We observe them visit a museum. Arguendo, we might indeed suppose that this has a probability of, say, 1% (we don’t usually expect strangers to visit museums together) which might be about 112 the probability of making that observation if the two were romantically involved.

But after this, when we observe the workplace visit, we need to ask about the probability of the workplace visit, given that the two were romantically attracted and that they visited a museum. This might suggest that if two non-attracted people visit a museum together for whatever reason, they don’t just have the default probability of a non-attracted couple of making a workplace visit. In other words:

$$\mathbb P({workplace}\mid \neg {romance} \wedge {museum}) \neq \mathbb P({workplace}\mid \neg {romance})$$

Naive Bayes, in contrast, would try to approximate the quantity $$\mathbb P({museum} \wedge {workplace} \mid \neg {romance})$$ as the product of $$\mathbb P({museum}\mid \neg {romance}) \cdot \mathbb P({workplace}\mid \neg {romance}).$$ This is what the participants did when they multiplied by a 112 likelihood ratio twice.

The result was a kind of double-counting of the evidence — they took into account the prior improbability of a random non-romantic couple “going places together” twice in a row, for the two pieces of evidence, and ended up performing a total update that was much too strong.

Naive Bayes spam filters often end up assigning ludicrously extreme odds, on the order of googols to one, that an email is spam or personal; and then they’re sometimes wrong anyways. If an email contains the phrase “pharmaceutical” and “pharmacy”, a spam filter will double-count the improbability of a personal email talking about pharmacies, rather than considering that if I actually do get a personal email talking about a pharmacy, it is much more likely to contain the word “pharmaceutical” as well. So because of the Naive Bayes assumption, naive Bayesian spam filters are not anything remotely like well-calibrated, and they update much too extremely on the evidence. On the other hand, they’re often extreme in the correct qualitative direction — something assigned googol-to-one odds of being spam isn’t always spam but it might be spam, say, 99.999% of the time.

To do non-naive Bayesian updates on multiple pieces of evidence, just remember to mentally inhabit the world where the hypothesis is true, and then ask about the likelihood of each successive piece of evidence, in the world where the hypothesis is true and the previous pieces of evidence were observed. Don’t ask, “What is the likelihood that a non-romantic couple would visit one person’s workplace?” but “What is the likelihood that a non-romantic couple which previously visited a museum for some unknown reason would also visit the workplace?”

In our example with the coins in the bathtub, the likelihoods of the evidence were independent on each step—assuming a coin to be fair, it’s no more or less likely to produce heads on the second flip after producing heads on the first flip. So in our bathtub-coins example, the Naive Bayes assumption was actually true.

Parents:

• Bayes' rule

Bayes’ rule is the core theorem of probability theory saying how to revise our beliefs when we make a new observation.

• It might be good to see an example worked out correctly; all we see here is an incorrect example.

• Are there going to be visual explanations put here for the examples? I found that quite helpful in the former pages. I’d say this is the first part of the new Bayes Guide that feels very similar (in terms of clarity) to the old one. Although, I might be biased as I’ve found I much prefer visual explanations of things.

• Where did the ’16′ come from in (12/​16:3/​16:1/​16) ?

• It’s 12 + 3 + 1. I’ll edit to make clearer, but your comment exposed a bug in our LaTeX parsing so I’m waiting to edit until that resolves. :)

• What’s the bathtub coins example? I’ve read the entire advanced sequence up to here and I don’t remember reading about that. Maybe it was edited and removed? (Or maybe I wasn’t paying attention or something?)

• I believe that this should be $$(2 : 3 : 1)$$ rather than $$(3 : 2 : 1)$$.

• I believe it should be, “the two were not romantically attracted” as that is consistent with the formula below.

• Actually, there should be diagonal matrices instead of vectors. Cross product doesn’t work like this, and dot product gives us a sum of coordinates of the vector we need instead of the vector itself, so we can’t continue updating our probabilities (or make any sense of the result). Diagonal matrices, on the other hand, do exactly what we need: $$C = AB; c_{ii} = a_{ii} \* b_{ii}; ∀ i ≠ j, c_{ij} = 0$$.

• Arguendo: more random non-common latin. Consider “For the sake of argument” or “Perhaps”

• I find the entire explanation described below very misleading and perhaps even largely incorrect. The workshop participants had it wrong mostly for two reasons:

1. They did not consider what is the likelihood of visiting a museum /​ workplace given any other alternative (mutually exclusive) relationship - not strangers but also not romantically involved; i.e, friends. Being acquaintances is not a relevant type of a relationship as it is not mutually exclusive with a romantic relationship (a pair can be both dating and working together).

2. They did not know the prior probability of an arbitrary pair of people being romantically involved. A naive assumption of 50% of them being romantically involved is wrong, and should be made by observing the proportions of romantic relationships in the population.

In terms of the previous coins-fairness example, they (a) only considered that one type of coin (fair) is 2 times as likely to turn up heads as another type of coin (tail-biased), but did not consider how likely are the other type of coins (head-biased) to turn up heads; and (b) they did not know the proportions of coin types in the bathtub.

The explanation below also fails to mention the important assumption that the trait being assessed in all of the examples (coins, emails, workshop) is constant and doesn’t change over time. It is important to mention because it may not be so trivial for every example, yet it reduces the complexity of the estimations tremendously. A coin is not expected to change its bias significantly over time, yet a relationship does, and so does the magnitude of “spamness” in a given mail for a given person (for instance, when I get older I may be more interested in pharmaceutical ads).

• I believe it is essential to explain why it is independent in the case of the bathtub example and not in the other examples.

In the bathtub example, the evidence presents an event which is directly described by the assessed trait; i.e, the fairness of a coin is directly concerned with the appearance of either heads or tails. In contrast, the definition of the degree of “spamness” in an email is not directly concerned with the appearance of a word in the email, but is rather concerned with the abstract concept of the meaning a person assigns to the email.

The appearance of a word in an email is hence only an attempt of estimating the degree of “spamness”, a proxy. In the case of a proxy, we need to consider the option that the proxy is flawed in a way which makes it so that the evidences are in fact dependencies of one another. This is not necessarily true, but it is possible, unlike in the case of hypothetical coins (in reality, a coin toss might actually be physically affected by the previous toss).