Bayes' rule: Log-odds form

The odds form of Bayes’s Rule states that the prior odds times the likelihood ratio equals the posterior odds. We can take the log of both sides of this equation, yielding an equivalent equation which uses addition instead of multiplication.

Letting \(H_i\) and \(H_j\) denote hypotheses and \(e\) denote evidence, the log-odds form of Bayes’ rule states:

$$ \log \left ( \dfrac {\mathbb P(H_i\mid e)} {\mathbb P(H_j\mid e)} \right ) = \log \left ( \dfrac {\mathbb P(H_i)} {\mathbb P(H_j)} \right ) + \log \left ( \dfrac {\mathbb P(e\mid H_i)} {\mathbb P(e\mid H_j)} \right ). $$

This can be numerically efficient for when you’re carrying out lots of updates one after another. But a more important reason to think in log odds is to get a better grasp on the notion of ‘strength of evidence’.

Logarithms of likelihood ratios

Suppose you’re visiting your friends Andrew and Betty, who are a couple. They promised that one of them would pick you up from the airport when you arrive. You’re not sure which one is in fact going to pick you up (prior odds of 50:50), but you do know three things:

  1. They have both a blue car and a red car. Andrew prefers to drive the blue car, Betty prefers to drive the red car, but the correlation is relatively weak. (Sometimes, which car they drive depends on which one their child is using.) Andrew is 2x as likely to drive the blue car as Betty.

  2. Betty tends to honk the horn at you to get your attention. Andrew does this too, but less often. Betty is 4x as likely to honk as Andrew.

  3. Andrew tends to run a little late (more often than Betty). Betty is 2x as likely to have the car already at the airport when you arrive.

All three observations are independent as far as you know (that is, you don’t think Betty’s any more or less likely to be late if she’s driving the blue car, and so on).

Let’s say we see a blue car, already at the airport, which honks.

The odds form of this calculation would be a \((1 : 1)\) prior for Betty vs. Andrew, times likelihood ratios of \((1 : 2) \times (4 : 1) \times (2 : 1),\) yielding posterior odds of \((1 \times 4 \times 2 : 2 \times 1 \times 1) = (8 : 2) = (4 : 1)\), so it’s 45 = 80% likely to be Betty.

Here’s the log odds form of the same calculation, using 1 bit to denote each factor of \(2\) in belief or evidence:

  • Prior belief in Betty of \(\log_2 (\frac{1}{1}) = 0\) bits.

  • Evidence of \(\log_2 (\frac{1}{2}) = {-1}\) bits against Betty.

  • Evidence of \(\log_2 (\frac{4}{1}) = {+2}\) bits for Betty.

  • Evidence of \(\log_2 (\frac{2}{1}) = {+1}\) bit for Betty.

  • Posterior belief of \(0 + {^-1} + {^+2} + {^+1} = {^+2}\) bits that Betty is picking us up.

If your posterior belief is +2 bits, then your posterior odds are \((2^{+2} : 1) = (4 : 1),\) yielding a posterior probability of 80% that Betty is picking you up.

Evidence and belief represented this way is additive, which can make it an easier fit for intuitions about “strength of credence” and “strength of evidence”; we’ll soon develop this point in further depth.

The log-odds line

Imagine you start out thinking that the hypothesis \(H\) is just as likely as \(\lnot H,\) its negation. Then you get five separate independent \(2 : 1\) updates in favor of \(H.\) What happens to your probabilities?

Your odds (for \(H\)) go from \((1 : 1)\) to \((2 : 1)\) to \((4 : 1)\) to \((8 : 1)\) to \((16 : 1)\) to \((32 : 1).\)

Thus, your probabilities go from \(\frac{1}{2} = 50\%\) to \(\frac{2}{3} \approx 67\%\) to \(\frac{4}{5} = 80\%\) to \(\frac{8}{9} \approx 89\%\) to \(\frac{16}{17} \approx 94\%\) to \(\frac{32}{33} \approx 97\%.\)

Graphically representing these changing probabilities on a line that goes from 0 to 1:

0 updates
3 updates
5 updates

We observe that the probabilities approach 1 but never get there — they just keep stepping across a fraction of the remaining distance, eventually getting all scrunched up near the right end.

If we instead convert the probabilities into log-odds, the story is much nicer. 50% probability becomes 0 bits of credence, and every independent \((2 : 1)\) observation in favor of \(H\) shifts belief by one unit along the line.

the log-odds line

(As for what happens when we approach the end of the line, there isn’t one! 0% probability becomes \(-\infty\) bits of credence and 100% probability becomes \(+\infty\) bits of credence.

knows-requisite(Math 2): %%note: This un-scrunching of the interval \((0,1)\) into the entire real line is done by an application of the inverse logistic function.%%
)

Intuitions about the log-odds line

There are a number of ways in which this infinite log-odds line is a better place to anchor your intuitions about “belief” than the usual 1 probability interval. For example:

  • Evidence you are twice as likely to see if the hypothesis is true than if it is false is \({+1}\) bits of evidence and a \({^+1}\)-bit update, regardless of how confident or unconfident you were to start with—the strength of new evidence, and the distance we update, shouldn’t depend on our prior belief.

  • If your credence in something is 0 bits—neither positive or negative belief—then you think the odds are 1:1.

  • The distance between \(0.01\) and \(0.000001\) is much greater than the distance between \(0.11\) and \(0.100001.\)

To expand on the final point: on the 0-1 probability line, the difference between 0.01 (a 1% chance) and 0.000001 (a 1 in a million chance) is roughly the same as the distance between 11% and 10%. This doesn’t match our sense for the intuitive strength of a claim: The difference between “1 in 100!” and “1 in a million!” feels like a far bigger jump than the difference between “11% probability” and “a hair over 10% probabiility.”

On the log-odds line, a 1 in 100 credibility is \({^-2}\) orders of magnitude, and a “1 in a million” credibility is \({^-6}\) orders of magnitude. The distance between them is minus 4 orders of magnitude, that is, \(\log_{10}(10^{-6}) - \log_{10}(10^{-2})\) yields \({^-4}\) magnitudes, or roughly \({^-13.3}\) bits. On the other hand, 11% to 10% is \(\log_{10}(\frac{0.10}{0.90}) - \log_{10}(\frac{0.11}{0.89}) \approx {^-0.954}-{^-0.907} \approx {^-0.046}\) magnitudes, or \({^-0.153}\) bits.

The log-odds line doesn’t compress the vast differences available near the ends of the probability spectrum. Instead, it exhibits a “belief bar” carrying on indefinitely in both directions—every time you see evidence with a likelihood ratio of \(2 : 1,\) it adds one more bit of credibility.

The Weber-Fechner law says that most human sensory perceptions are logarithmic, in the sense that a factor-of-2 intensity change feels like around the same amount of increase no matter where you are on the scale. Doubling the physical intensity of a sound feels to a human like around the same amount of change in that sound whether the initial sound was 40 decibels or 60 decibels. That’s why there’s an exponential decibel scale of sound intensities in the first place!

Thus the log-odds form should be, in a certain sense, the most intuitive variant of Bayes’ rule to use: Just add the evidence-strength to the belief-strength! If you can make your feelings of evidence-strength and belief-strength be proportional to the logarithms of ratios, that is.

Finally, the log-odds representation gives us an even easier way to see how extraordinary claims require extraordinary evidence: If your prior belief in \(H\) is −30 bits, and you see evidence on the order of +5 bits for \(H\), then you’re going to wind up with −25 bits of belief in \(H\), which means you still think it’s far less likely than the alternatives.

Example: Blue oysters

Consider the blue oyster example problem:

You’re collecting exotic oysters in Nantucket, and there are two different bays from which you could harvest oysters.

  • In both bays, 11% of the oysters contain valuable pearls and 89% are empty.

  • In the first bay, 4% of the pearl-containing oysters are blue, and 8% of the non-pearl-containing oysters are blue.

  • In the second bay, 13% of the pearl-containing oysters are blue, and 26% of the non-pearl-containing oysters are blue.

Would you rather have a blue oyster from the first bay or the second bay? Well, we first note that the likelihood ratio from “blue oyster” to “full vs. empty” is \(1 : 2\) in either case, so both kinds of blue oyster are equally valuable. (Take a moment to reflect on how obvious this would not seem before learning about Bayes’ rule!)

But what’s the chance of (either kind of) a blue oyster containing a pearl? Hint: this would be a good time to convert your credences into bits (factors of 2).

89% is around 8 times as much as 11%, so we start out with \({^-3}\) bits of belief that a random oyster contains a pearl.

Full oysters are 12 as likely to be blue as empty oysters, so seeing that an oyster is blue is \({^-1}\) bits of evidence against it containing a pearl.

Posterior belief should be around \({^-4}\) bits or \((1 : 16)\) against, or a probability of 1/​17… so a bit more than 5% (1/​20) maybe? (Actually 5.88%.) <div><div>

Real-life example: HIV test

Find & cite the referenced study A study of Chinese blood donors noteCitation needed found that roughly 1 in 100,000 of them had HIV (as determined by a very reliable gold-standard test). The non-gold-standard test used for initial screening had a sensitivity of 99.7% and a specificity of 99.8%, meaning respectively that \(\mathbb P({positive}\mid {HIV}) = .997\) and \(\mathbb P({negative}\mid \neg {HIV}) = .998\), i.e., \(\mathbb P({positive} \mid \neg {HIV}) = .002.\)

That is: the prior odds are \(1 : 100,000\) against HIV, and a positive result in an initial screening favors HIV with a likelihood ratio of \(500 : 1.\)

Using log base 10 (because those are easier to do in your head):

  • The prior belief in HIV was about −5 magnitudes.

  • The evidence was a tad less than +3 magnitudes strong, since 500 is less than 1,000. ($\log_{10}(500) \approx 2.7$).

So the posterior belief in HIV is a tad underneath −2 magnitudes, i.e., less than a 1 in 100 chance of HIV.

Even though the screening had a \(500 : 1\) likelihood ratio in favor of HIV, someone with a positive screening result really should not panic!

Admittedly, this setup had people being screened randomly, in a relatively non-AIDS-stricken country. You’d need separate statistics for people who are getting tested for HIV because of specific worries or concerns, or in countries where HIV is highly prevalent. Nonetheless, the points that “only a tiny fraction of people have illness X” and that “preliminary observations Y may not have correspondingly tiny false positive rates” are worth remembering for many illnesses X and observations Y.

Exposing infinite credences

The log-odds representation exposes the degree to which \(0\) and \(1\) are very unusual among the classical probabilities. For example, if you ever assign probability absolutely 0 or 1 to a hypothesis, then no amount of evidence can change your mind about it, ever.

On the log-odds line, credences range from \(-\infty\) to \(+\infty,\) with the infinite extremes corresponding to probability \(0\) and \(1\) which can thereby be seen as “infinite credences”. That’s not to say that \(0\) and \(1\) probabilities should never be used. For an ideal reasoner, the probability \(\mathbb P(X) + \mathbb P(\lnot X)\) should be 1 (where \(\lnot X\) is the logical negation of \(X\)).noteFor us mere mortals, consider avoiding extreme probabilities even then. Nevertheless, these infinite credences of 0 and 1 behave like ‘special objects’ with a qualitatively different behavior from the ordinary credence spectrum. Statements like “After seeing a piece of strong evidence, my belief should never be exactly what it was previously” are false for extreme credences, just as statements like “subtracting 1 from a number produces a lower number” are false if you insist on regarding

knows-requisite(Math 2): \(\aleph_0\)
!knows-requisite(Math 2): infinity
as a number.

Evidence in decibels

E.T. Jaynes, in Probability Theory: The Logic of Science (section 4.2), reports that using decibels of evidence makes them easier to grasp and use by humans.

If an hypothesis has a likelihood ratio of \(o\), then its evidence in decibels is given by the formula \(e = 10\log_{10}(o)\).

In this scheme, multiplying the likelihood ratio by 2 means approximately adding 3dB. Multiplying by 10 means adding 10dB.

Jayne reports having used decimal logarithm first, for their ease of calculation and having tried to switch to natural logarithms with the advent of pocket calculators. But decimal logarithms were found to be easier to grasp.

Parents:

  • Bayes' rule

    Bayes’ rule is the core theorem of probability theory saying how to revise our beliefs when we make a new observation.