Shift towards the hypothesis of least surprise
The log-odds form of Bayes’ rule says that strength of belief and strength of evidence can both be measured in bits. These evidence-bits can also be used to measure a quantity called “Bayesian surprise”, which yields One final, if this is the last thing in the pathyet another intuition for understanding Bayes’ rule.
Roughly speaking, we can measure how surprised a hypothesis \(H_i\) was by the evidence \(e\) by measuring how much probability it put on \(e.\) If \(H_i\) put 100% of its probability mass on \(e\), then \(e\) is completely unsurprising (to \(H_i\)). If \(H_i\) put 0% of its probability mass on \(e\), then \(e\) is as surprising as possible. Any measure of \(\mathbb P(e \mid H_i),\) the probability \(H_i\) assigned to \(e\), that obeys this property, is worthy of the label “surprise.” Bayesian surprise is \(-\!\log(\mathbb P(e \mid H_i)),\) which is a quantity that obeys these intuitive constraints and has some other interesting features.
Consider again the blue oysters problem. Consider the hypotheses \(H\) and \(\lnot H\), which say “the oyster will contain a pearl” and “no it won’t”, respectively. To keep the numbers easy, let’s say we draw an oyster from a third bay, where \(\frac{1}{8}\) of pearl-carrying oysters are blue and \(\frac{1}{4}\) of empty oysters are blue.
Imagine what happens when the oyster is blue.\(H\) predicted blueness with \(\frac{1}{8}\) of its probability mass, while \(\lnot H\) predicted blueness with \(\frac{1}{4}\) of its probability mass. Thus, \(\lnot H\) did better than \(H,\) and goes up in probability. Previously, we’ve been combining both \(\mathbb P(e \mid H)\) and \(\mathbb P(e \mid \lnot H)\) into unified likelihood ratios, like \(\left(\frac{1}{8} : \frac{1}{4}\right)\) \(=\) \((1 : 2),\) which says that the ‘blue’ observation carries 1 bit of evidence \(H.\) However, we can also take the logs first, and combine second.
Because \(H\) assigned only an eighth of its probability mass to the ‘blue’ observation, and because Bayesian update works by eliminating incorrect probability mass, we have to adjust our belief in \(H\) by \(\log_2\left(\frac{1}{8}\right) = -3\) bits away from \(H.\) (Each negative bit means “throw away half of \(H\)’s probability mass,” and we have to do that 3 times in order to remove the probability that \(H\) failed to assign to \(e\).)
Similarly, because \(\lnot H\) assigned only a quarter of its probability mass to the ‘blue’ observation, we have to adjust our belief in \(H\) by \(\log_2\left(\frac{1}{4}\right) = -2\) bits away from \(\lnot H.\)
Thus, when the ‘blue’ observation comes in, we move our belief (measured in bits) 3 notches away from \(H\) and then two notches back towards \(H.\) On net, our belief shifts 1 notch away from \(H\).
$H$ assigned 1/8th of its probability mass to blueness, so it emits \(-\!\log_2\left(\frac{1}{8}\right)=3\) bits of surprise pushing away from \(H\). \(\lnot H\) assigned 1/4th of its probability mass to blueness, so it emits \(-\!\log_2\left(\frac{1}{4}\right)=2\) bits of surprise pushing away from \(\lnot H\) (and towards \(H\)). Thus, belief in \(H\) moves 1 bit towards \(\lnot H\), on net.
If instead \(H\) predicted blue with probability 4% (penalty \(\log_2(0.04) \approx -4.64\)) and \(\lnot H\) predicted blue with probability 8% (penalty \(\log_2(0.08) \approx -3.64\)), then we would have shifted a bit over 4.6 notches towards \(\lnot H\) and a bit over 3.6 notches back towards \(H,\) but we would have shifted the same number of notches on net. This is why it’s only the relative difference between the number of bits docked from \(H\) and the number of bits docked from \(\lnot H\) that matters.
In general, given an observation \(e\) and a hypothesis \(H,\) the number of bits we need to dock from our belief in \(H\) is \(\log_2(\mathbb P(e \mid H)),\) that is, the log of the probability that \(H\) assigned to \(e.\) This quantity is never positive, because the logarithm of \(x\) for \(0 \le x \le 1\) is in the range \([-\infty, 0]\). If we negate it, we get a non-negative quantity that relates \(H\) to \(e\), which is 0 when \(H\) was certain that \(e\) was going to happen, and which is infinite when \(H\) was certain that \(e\) wasn’t going to happen, and which is measured in the same units as evidence and belief. Thus, this quantity is often called “surprise,” and intuitively, it measures how surprised the hypothesis \(H\) was by \(e\) (in bits).
There is some correlation between Bayesian surprise and the times when a human would feel surprised (at seeing something that they thought was unlikely), but, of course, the human emotion is quite different. (A human can feel surprised for other reasons than “my hypotheses failed to predict the data,” and humans are also great at ignoring evidence instead of feeling surprised.)
Given this definition of Bayesian surprise, we can view Bayes’ rule as saying that surprise repels belief. When you make an observation \(e,\) each hypothesis emits repulsive “surprise” signals, which shift your hypothesis. Referring again to the image above, when \(H\) predicts the observation you made with \(\frac{1}{8}\) of its probability mass, and \(\lnot H\) predicts it with \(\frac{1}{4}\) of its probability mass, we can imagine \(H\) emitting a surprise signal with a strength of 3 bits away from \(H\) and \(\lnot H\) emitting a surprise signal with a strength of 2 bits away from \(\lnot H\). Both those signals push the belief in \(H\) in different directions, and it ends up 1 bit closer to \(\lnot H\) (which emitted the weaker surprise signal).
In other words, whenever you find yourself feeling surprised by something you saw, think of the least surprising explanation for that evidence — and then award that hypothesis a few bits of belief.
Parents:
- Bayes' rule
Bayes’ rule is the core theorem of probability theory saying how to revise our beliefs when we make a new observation.
I was confused by this but think I’ve figured it out. I’m moderately numerate, but don’t have great math intuition.
I was thrown off by the fact that we’re talking about individual probabilities where we were talking about odds before. How can you just go to working with probabilities.
I realized that you can do this because log(X/Y) is log(X)-log(Y). So we’re moving right based on X (P(e|H)) and left based on Y (P(e|!H)).
It may help those in my boat to call this out. It may be basic, but from my vantage it was a jump that took me a couple hours to figure out.
One more: “probability mass” = “probability” but with a metaphorical bent? If this is correct, it might be worth calling out as well. It came across as an undefined technical term and threw me for a bit as well.
This is an amazing and helpful resource, thank you!
The log used to determine number of bits should probably be consistent throughout or clarified each time. Here, the log 2 scale is used, when elsewhere there is usage of the log 10 scale.