Bayes' rule: Log-odds form
The odds form of Bayes’s Rule states that the prior odds times the likelihood ratio equals the posterior odds. We can take the log of both sides of this equation, yielding an equivalent equation which uses addition instead of multiplication.
Letting \(H_i\) and \(H_j\) denote hypotheses and \(e\) denote evidence, the log-odds form of Bayes’ rule states:
This can be numerically efficient for when you’re carrying out lots of updates one after another. But a more important reason to think in log odds is to get a better grasp on the notion of ‘strength of evidence’.
Logarithms of likelihood ratios
Suppose you’re visiting your friends Andrew and Betty, who are a couple. They promised that one of them would pick you up from the airport when you arrive. You’re not sure which one is in fact going to pick you up (prior odds of 50:50), but you do know three things:
They have both a blue car and a red car. Andrew prefers to drive the blue car, Betty prefers to drive the red car, but the correlation is relatively weak. (Sometimes, which car they drive depends on which one their child is using.) Andrew is 2x as likely to drive the blue car as Betty.
Betty tends to honk the horn at you to get your attention. Andrew does this too, but less often. Betty is 4x as likely to honk as Andrew.
Andrew tends to run a little late (more often than Betty). Betty is 2x as likely to have the car already at the airport when you arrive.
All three observations are independent as far as you know (that is, you don’t think Betty’s any more or less likely to be late if she’s driving the blue car, and so on).
Let’s say we see a blue car, already at the airport, which honks.
The odds form of this calculation would be a \((1 : 1)\) prior for Betty vs. Andrew, times likelihood ratios of \((1 : 2) \times (4 : 1) \times (2 : 1),\) yielding posterior odds of \((1 \times 4 \times 2 : 2 \times 1 \times 1) = (8 : 2) = (4 : 1)\), so it’s 4⁄5 = 80% likely to be Betty.
Here’s the log odds form of the same calculation, using 1 bit to denote each factor of \(2\) in belief or evidence:
Prior belief in Betty of \(\log_2 (\frac{1}{1}) = 0\) bits.
Evidence of \(\log_2 (\frac{1}{2}) = {-1}\) bits against Betty.
Evidence of \(\log_2 (\frac{4}{1}) = {+2}\) bits for Betty.
Evidence of \(\log_2 (\frac{2}{1}) = {+1}\) bit for Betty.
Posterior belief of \(0 + {^-1} + {^+2} + {^+1} = {^+2}\) bits that Betty is picking us up.
If your posterior belief is +2 bits, then your posterior odds are \((2^{+2} : 1) = (4 : 1),\) yielding a posterior probability of 80% that Betty is picking you up.
Evidence and belief represented this way is additive, which can make it an easier fit for intuitions about “strength of credence” and “strength of evidence”; we’ll soon develop this point in further depth.
The log-odds line
Imagine you start out thinking that the hypothesis \(H\) is just as likely as \(\lnot H,\) its negation. Then you get five separate independent \(2 : 1\) updates in favor of \(H.\) What happens to your probabilities?
Your odds (for \(H\)) go from \((1 : 1)\) to \((2 : 1)\) to \((4 : 1)\) to \((8 : 1)\) to \((16 : 1)\) to \((32 : 1).\)
Thus, your probabilities go from \(\frac{1}{2} = 50\%\) to \(\frac{2}{3} \approx 67\%\) to \(\frac{4}{5} = 80\%\) to \(\frac{8}{9} \approx 89\%\) to \(\frac{16}{17} \approx 94\%\) to \(\frac{32}{33} \approx 97\%.\)
Graphically representing these changing probabilities on a line that goes from 0 to 1:
We observe that the probabilities approach 1 but never get there — they just keep stepping across a fraction of the remaining distance, eventually getting all scrunched up near the right end.
If we instead convert the probabilities into log-odds, the story is much nicer. 50% probability becomes 0 bits of credence, and every independent \((2 : 1)\) observation in favor of \(H\) shifts belief by one unit along the line.
(As for what happens when we approach the end of the line, there isn’t one! 0% probability becomes \(-\infty\) bits of credence and 100% probability becomes \(+\infty\) bits of credence.
Intuitions about the log-odds line
There are a number of ways in which this infinite log-odds line is a better place to anchor your intuitions about “belief” than the usual 1 probability interval. For example:
Evidence you are twice as likely to see if the hypothesis is true than if it is false is \({+1}\) bits of evidence and a \({^+1}\)-bit update, regardless of how confident or unconfident you were to start with—the strength of new evidence, and the distance we update, shouldn’t depend on our prior belief.
If your credence in something is 0 bits—neither positive or negative belief—then you think the odds are 1:1.
The distance between \(0.01\) and \(0.000001\) is much greater than the distance between \(0.11\) and \(0.100001.\)
To expand on the final point: on the 0-1 probability line, the difference between 0.01 (a 1% chance) and 0.000001 (a 1 in a million chance) is roughly the same as the distance between 11% and 10%. This doesn’t match our sense for the intuitive strength of a claim: The difference between “1 in 100!” and “1 in a million!” feels like a far bigger jump than the difference between “11% probability” and “a hair over 10% probabiility.”
On the log-odds line, a 1 in 100 credibility is \({^-2}\) orders of magnitude, and a “1 in a million” credibility is \({^-6}\) orders of magnitude. The distance between them is minus 4 orders of magnitude, that is, \(\log_{10}(10^{-6}) - \log_{10}(10^{-2})\) yields \({^-4}\) magnitudes, or roughly \({^-13.3}\) bits. On the other hand, 11% to 10% is \(\log_{10}(\frac{0.10}{0.90}) - \log_{10}(\frac{0.11}{0.89}) \approx {^-0.954}-{^-0.907} \approx {^-0.046}\) magnitudes, or \({^-0.153}\) bits.
The log-odds line doesn’t compress the vast differences available near the ends of the probability spectrum. Instead, it exhibits a “belief bar” carrying on indefinitely in both directions—every time you see evidence with a likelihood ratio of \(2 : 1,\) it adds one more bit of credibility.
The Weber-Fechner law says that most human sensory perceptions are logarithmic, in the sense that a factor-of-2 intensity change feels like around the same amount of increase no matter where you are on the scale. Doubling the physical intensity of a sound feels to a human like around the same amount of change in that sound whether the initial sound was 40 decibels or 60 decibels. That’s why there’s an exponential decibel scale of sound intensities in the first place!
Thus the log-odds form should be, in a certain sense, the most intuitive variant of Bayes’ rule to use: Just add the evidence-strength to the belief-strength! If you can make your feelings of evidence-strength and belief-strength be proportional to the logarithms of ratios, that is.
Finally, the log-odds representation gives us an even easier way to see how extraordinary claims require extraordinary evidence: If your prior belief in \(H\) is −30 bits, and you see evidence on the order of +5 bits for \(H\), then you’re going to wind up with −25 bits of belief in \(H\), which means you still think it’s far less likely than the alternatives.
Example: Blue oysters
Consider the blue oyster example problem:
You’re collecting exotic oysters in Nantucket, and there are two different bays from which you could harvest oysters.
In both bays, 11% of the oysters contain valuable pearls and 89% are empty.
In the first bay, 4% of the pearl-containing oysters are blue, and 8% of the non-pearl-containing oysters are blue.
In the second bay, 13% of the pearl-containing oysters are blue, and 26% of the non-pearl-containing oysters are blue.
Would you rather have a blue oyster from the first bay or the second bay? Well, we first note that the likelihood ratio from “blue oyster” to “full vs. empty” is \(1 : 2\) in either case, so both kinds of blue oyster are equally valuable. (Take a moment to reflect on how obvious this would not seem before learning about Bayes’ rule!)
But what’s the chance of (either kind of) a blue oyster containing a pearl? Hint: this would be a good time to convert your credences into bits (factors of 2).
Full oysters are 1⁄2 as likely to be blue as empty oysters, so seeing that an oyster is blue is \({^-1}\) bits of evidence against it containing a pearl.
Posterior belief should be around \({^-4}\) bits or \((1 : 16)\) against, or a probability of 1/17… so a bit more than 5% (1/20) maybe? (Actually 5.88%.) <div><div>
Real-life example: HIV test
Find & cite the referenced study A study of Chinese blood donors noteCitation needed found that roughly 1 in 100,000 of them had HIV (as determined by a very reliable gold-standard test). The non-gold-standard test used for initial screening had a sensitivity of 99.7% and a specificity of 99.8%, meaning respectively that \(\mathbb P({positive}\mid {HIV}) = .997\) and \(\mathbb P({negative}\mid \neg {HIV}) = .998\), i.e., \(\mathbb P({positive} \mid \neg {HIV}) = .002.\)
That is: the prior odds are \(1 : 100,000\) against HIV, and a positive result in an initial screening favors HIV with a likelihood ratio of \(500 : 1.\)
Using log base 10 (because those are easier to do in your head):
The prior belief in HIV was about −5 magnitudes.
The evidence was a tad less than +3 magnitudes strong, since 500 is less than 1,000. ($\log_{10}(500) \approx 2.7$).
So the posterior belief in HIV is a tad underneath −2 magnitudes, i.e., less than a 1 in 100 chance of HIV.
Even though the screening had a \(500 : 1\) likelihood ratio in favor of HIV, someone with a positive screening result really should not panic!
Admittedly, this setup had people being screened randomly, in a relatively non-AIDS-stricken country. You’d need separate statistics for people who are getting tested for HIV because of specific worries or concerns, or in countries where HIV is highly prevalent. Nonetheless, the points that “only a tiny fraction of people have illness X” and that “preliminary observations Y may not have correspondingly tiny false positive rates” are worth remembering for many illnesses X and observations Y.
Exposing infinite credences
The log-odds representation exposes the degree to which \(0\) and \(1\) are very unusual among the classical probabilities. For example, if you ever assign probability absolutely 0 or 1 to a hypothesis, then no amount of evidence can change your mind about it, ever.
On the log-odds line, credences range from \(-\infty\) to \(+\infty,\) with the infinite extremes corresponding to probability \(0\) and \(1\) which can thereby be seen as “infinite credences”. That’s not to say that \(0\) and \(1\) probabilities should never be used. For an ideal reasoner, the probability \(\mathbb P(X) + \mathbb P(\lnot X)\) should be 1 (where \(\lnot X\) is the logical negation of \(X\)).noteFor us mere mortals, consider avoiding extreme probabilities even then. Nevertheless, these infinite credences of 0 and 1 behave like ‘special objects’ with a qualitatively different behavior from the ordinary credence spectrum. Statements like “After seeing a piece of strong evidence, my belief should never be exactly what it was previously” are false for extreme credences, just as statements like “subtracting 1 from a number produces a lower number” are false if you insist on regarding
Evidence in decibels
E.T. Jaynes, in Probability Theory: The Logic of Science (section 4.2), reports that using decibels of evidence makes them easier to grasp and use by humans.
If an hypothesis has a likelihood ratio of \(o\), then its evidence in decibels is given by the formula \(e = 10\log_{10}(o)\).
In this scheme, multiplying the likelihood ratio by 2 means approximately adding 3dB. Multiplying by 10 means adding 10dB.
Jayne reports having used decimal logarithm first, for their ease of calculation and having tried to switch to natural logarithms with the advent of pocket calculators. But decimal logarithms were found to be easier to grasp.
Parents:
- Bayes' rule
Bayes’ rule is the core theorem of probability theory saying how to revise our beliefs when we make a new observation.
Note: easier to grasp than natural logarithms. Both log base e and log base 10 behave in basically the same way, and have the same arguable pitfall. However, note that this behavior is, I think, on purpose, and intended to convey the idea that 10x more probable is 10x more probable, no matter where you are on the scale of improbability. (You just have to remember where you are on the scale before making your final judgements.) Consider the opposite problem raised by the 0-1 scale, as discussed earlier in the page: 0.00001% isn’t that far from 1%, same as between 10.00001% and 11%; but intuitively, clearly 0.00001% is much more different from 1% than 10.00001% is from 11%. This is the non-intuitiveness that the log-form is intended to fix, I think.
Actually, both pairs differ by the same factor of about 1.85256. (Same for base 100k.) (In the first case, the evidence is around 9, rather than 8, which likely accounts for the difference you got.) I’m actually a little surprised this is true—log-form is used in order to deal with things additively, rather than multiplicatively, so I wouldn’t have guessed without calculation that changing the base would preserve proportions. (And I’d still want to do some math to confirm this in general.)
Sorry, I just have a question. In the problem with pearls I can calculate
P(pearl/blue) = P(pearl AND blue) / P(blue) = 0.11 * 0.04 / (0.11 * 0.04 + 0.89 * 0.08) = 0.058
or with log-odds
P(pearl/blue) = 2^(log2(11/89) + log2(1/2)) = 2^(-4.0163018123) = 0.061
Where does the discrepancy come from?
UPD: I looked up the solution and realized that 2^(-4.016) are odds so probability is 1 / (2^4.01 + 1) = 0.058
Honestly this whole section is confusing
I recommend rethinking the magnet metaphor, on the grounds that it is physically wrong. If you have two magnets on either end of a ruler, and one is twice as strong as the other, then an iron ball at the center of the ruler is going to roll all the way to the larger magnet (accelerating as it goes, because inverse square law), unless I’m missing something. Perhaps a better physical metaphor would be something like rubber bands, with each bit of evidence adding another rubber band from the belief level to pins at the ends of the ruler?
I don’t understand this sentence.
odds ratios?
The thing inside the log(this part) is an odds ratio, right?
I would expect this sentence only after another telling me that the observations were red car, honking, and punctuality. I think the next sentence should be broken apart and this should be inserted inside.
It would be nice to show how to go from 99.8% to the 500:1 ratio.
I don’t think these terms have been defined yet. The difference between “strength of credence” and “strength of evidence” isn’t obvious to me, but it seems like it’s assumed throughout the rest of the article that the reader knows what they mean.
Is “-1 against” the same as “+1 for”?
Expressing the first practical example entirely in terms of negative numbers seems like a poor pedagogical choice.
Phrasing as “3 bits against” and then “a further 1 bit against” may help.
Adding that the blue ones are not a great pick if you want pearls may help people understand the direction of “against”.
“Extreme credences” here should likely be “infinite credences”.
Even so, previous page made the exact counterpoint:
While strong evidence may not change your view of things, things, extreme evidence absolutely should make you revisit your estimate of even an infinite credence level.
One of these does log( prob/ 1 - prob) the other does log( prob) …
I get your point about orders of magnitude difference, but for me this ends up more confusing then anything.
Wrong, they are exactly the same distances. I read the next paragraph so I get where you were going with this, but I find it confusing to start off with a blatantly wrong claim, especially when the next line compares
0.11
to0.1
(11% to 10%) -- not to0.100001
-- in order to describe how the significance of0.00001
gets “lost in translation” when speaking in probabilities and not in bits.But that really gives a different magnitude to the evidence. Why not be consistent with the log base?
For example, if we were to use log base 2, the prior would be ~16.6 magnitudes strong and the evidence ~8. This means that the evidence would alter the prior by (slightly) less than half the order of magnitudes, where’s in the case of log base 10 the alteration is (slightly) more than half the order of magnitudes (5 vs 2.7).
Also, imagine the absurd choice of log base 100k. The prior would remain practically intact in terms of this kind of order of magnitudes.
It is really confusing to apply one of the initial steps of a study as evidence to a prior which is the result (last step) of the same study.
Easier to grasp perhaps, but dangerously misleading. Increasing the likelihood of an event from
10^-100
to10^-99
is very different and much less significant than increasing it from10^-2
(1%) to10^-1
(10%). I hope this is covered later in this guide.