Shift towards the hypothesis of least surprise

The log-odds form of Bayes’ rule says that strength of be­lief and strength of ev­i­dence can both be mea­sured in bits. Th­ese ev­i­dence-bits can also be used to mea­sure a quan­tity called “Bayesian sur­prise”, which yields One fi­nal, if this is the last thing in the pa­thyet an­other in­tu­ition for un­der­stand­ing Bayes’ rule.

Roughly speak­ing, we can mea­sure how sur­prised a hy­poth­e­sis \(H_i\) was by the ev­i­dence \(e\) by mea­sur­ing how much prob­a­bil­ity it put on \(e.\) If \(H_i\) put 100% of its prob­a­bil­ity mass on \(e\), then \(e\) is com­pletely un­sur­pris­ing (to \(H_i\)). If \(H_i\) put 0% of its prob­a­bil­ity mass on \(e\), then \(e\) is as sur­pris­ing as pos­si­ble. Any mea­sure of \(\mathbb P(e \mid H_i),\) the prob­a­bil­ity \(H_i\) as­signed to \(e\), that obeys this prop­erty, is wor­thy of the la­bel “sur­prise.” Bayesian sur­prise is \(-\!\log(\mathbb P(e \mid H_i)),\) which is a quan­tity that obeys these in­tu­itive con­straints and has some other in­ter­est­ing fea­tures.

Con­sider again the blue oys­ters prob­lem. Con­sider the hy­pothe­ses \(H\) and \(\lnot H\), which say “the oys­ter will con­tain a pearl” and “no it won’t”, re­spec­tively. To keep the num­bers easy, let’s say we draw an oys­ter from a third bay, where \(\frac{1}{8}\) of pearl-car­ry­ing oys­ters are blue and \(\frac{1}{4}\) of empty oys­ters are blue.

Imag­ine what hap­pens when the oys­ter is blue.\(H\) pre­dicted blue­ness with \(\frac{1}{8}\) of its prob­a­bil­ity mass, while \(\lnot H\) pre­dicted blue­ness with \(\frac{1}{4}\) of its prob­a­bil­ity mass. Thus, \(\lnot H\) did bet­ter than \(H,\) and goes up in prob­a­bil­ity. Pre­vi­ously, we’ve been com­bin­ing both \(\mathbb P(e \mid H)\) and \(\mathbb P(e \mid \lnot H)\) into unified like­li­hood ra­tios, like \(\left(\frac{1}{8} : \frac{1}{4}\right)\) \(=\) \((1 : 2),\) which says that the ‘blue’ ob­ser­va­tion car­ries 1 bit of ev­i­dence \(H.\) How­ever, we can also take the logs first, and com­bine sec­ond.

Be­cause \(H\) as­signed only an eighth of its prob­a­bil­ity mass to the ‘blue’ ob­ser­va­tion, and be­cause Bayesian up­date works by elimi­nat­ing in­cor­rect prob­a­bil­ity mass, we have to ad­just our be­lief in \(H\) by \(\log_2\left(\frac{1}{8}\right) = -3\) bits away from \(H.\) (Each nega­tive bit means “throw away half of \(H\)’s prob­a­bil­ity mass,” and we have to do that 3 times in or­der to re­move the prob­a­bil­ity that \(H\) failed to as­sign to \(e\).)

Similarly, be­cause \(\lnot H\) as­signed only a quar­ter of its prob­a­bil­ity mass to the ‘blue’ ob­ser­va­tion, we have to ad­just our be­lief in \(H\) by \(\log_2\left(\frac{1}{4}\right) = -2\) bits away from \(\lnot H.\)

Thus, when the ‘blue’ ob­ser­va­tion comes in, we move our be­lief (mea­sured in bits) 3 notches away from \(H\) and then two notches back to­wards \(H.\) On net, our be­lief shifts 1 notch away from \(H\).

hypotheses emitting surprise

\(H\) as­signed 1/​8th of its prob­a­bil­ity mass to blue­ness, so it emits \(-\!\log_2\left(\frac{1}{8}\right)=3\) bits of sur­prise push­ing away from \(H\). \(\lnot H\) as­signed 1/​4th of its prob­a­bil­ity mass to blue­ness, so it emits \(-\!\log_2\left(\frac{1}{4}\right)=2\) bits of sur­prise push­ing away from \(\lnot H\) (and to­wards \(H\)). Thus, be­lief in \(H\) moves 1 bit to­wards \(\lnot H\), on net.

If in­stead \(H\) pre­dicted blue with prob­a­bil­ity 4% (penalty \(\log_2(0.04) \approx -4.64\)) and \(\lnot H\) pre­dicted blue with prob­a­bil­ity 8% (penalty \(\log_2(0.08) \approx -3.64\)), then we would have shifted a bit over 4.6 notches to­wards \(\lnot H\) and a bit over 3.6 notches back to­wards \(H,\) but we would have shifted the same num­ber of notches on net. This is why it’s only the rel­a­tive differ­ence be­tween the num­ber of bits docked from \(H\) and the num­ber of bits docked from \(\lnot H\) that mat­ters.

In gen­eral, given an ob­ser­va­tion \(e\) and a hy­poth­e­sis \(H,\) the num­ber of bits we need to dock from our be­lief in \(H\) is \(\log_2(\mathbb P(e \mid H)),\) that is, the log of the prob­a­bil­ity that \(H\) as­signed to \(e.\) This quan­tity is never pos­i­tive, be­cause the log­a­r­ithm of \(x\) for \(0 \le x \le 1\) is in the range \([-\infty, 0]\). If we negate it, we get a non-nega­tive quan­tity that re­lates \(H\) to \(e\), which is 0 when \(H\) was cer­tain that \(e\) was go­ing to hap­pen, and which is in­finite when \(H\) was cer­tain that \(e\) wasn’t go­ing to hap­pen, and which is mea­sured in the same units as ev­i­dence and be­lief. Thus, this quan­tity is of­ten called “sur­prise,” and in­tu­itively, it mea­sures how sur­prised the hy­poth­e­sis \(H\) was by \(e\) (in bits).

There is some cor­re­la­tion be­tween Bayesian sur­prise and the times when a hu­man would feel sur­prised (at see­ing some­thing that they thought was un­likely), but, of course, the hu­man emo­tion is quite differ­ent. (A hu­man can feel sur­prised for other rea­sons than “my hy­pothe­ses failed to pre­dict the data,” and hu­mans are also great at ig­nor­ing ev­i­dence in­stead of feel­ing sur­prised.)

Given this defi­ni­tion of Bayesian sur­prise, we can view Bayes’ rule as say­ing that sur­prise re­pels be­lief. When you make an ob­ser­va­tion \(e,\) each hy­poth­e­sis emits re­pul­sive “sur­prise” sig­nals, which shift your hy­poth­e­sis. Refer­ring again to the image above, when \(H\) pre­dicts the ob­ser­va­tion you made with \(\frac{1}{8}\) of its prob­a­bil­ity mass, and \(\lnot H\) pre­dicts it with \(\frac{1}{4}\) of its prob­a­bil­ity mass, we can imag­ine \(H\) emit­ting a sur­prise sig­nal with a strength of 3 bits away from \(H\) and \(\lnot H\) emit­ting a sur­prise sig­nal with a strength of 2 bits away from \(\lnot H\). Both those sig­nals push the be­lief in \(H\) in differ­ent di­rec­tions, and it ends up 1 bit closer to \(\lnot H\) (which emit­ted the weaker sur­prise sig­nal).

In other words, when­ever you find your­self feel­ing sur­prised by some­thing you saw, think of the least sur­pris­ing ex­pla­na­tion for that ev­i­dence — and then award that hy­poth­e­sis a few bits of be­lief.

Parents:

  • Bayes' rule

    Bayes’ rule is the core the­o­rem of prob­a­bil­ity the­ory say­ing how to re­vise our be­liefs when we make a new ob­ser­va­tion.