# Shift towards the hypothesis of least surprise

The log-odds form of Bayes’ rule says that strength of be­lief and strength of ev­i­dence can both be mea­sured in bits. Th­ese ev­i­dence-bits can also be used to mea­sure a quan­tity called “Bayesian sur­prise”, which yields One fi­nal, if this is the last thing in the pa­thyet an­other in­tu­ition for un­der­stand­ing Bayes’ rule.

Roughly speak­ing, we can mea­sure how sur­prised a hy­poth­e­sis $$H_i$$ was by the ev­i­dence $$e$$ by mea­sur­ing how much prob­a­bil­ity it put on $$e.$$ If $$H_i$$ put 100% of its prob­a­bil­ity mass on $$e$$, then $$e$$ is com­pletely un­sur­pris­ing (to $$H_i$$). If $$H_i$$ put 0% of its prob­a­bil­ity mass on $$e$$, then $$e$$ is as sur­pris­ing as pos­si­ble. Any mea­sure of $$\mathbb P(e \mid H_i),$$ the prob­a­bil­ity $$H_i$$ as­signed to $$e$$, that obeys this prop­erty, is wor­thy of the la­bel “sur­prise.” Bayesian sur­prise is $$-\!\log(\mathbb P(e \mid H_i)),$$ which is a quan­tity that obeys these in­tu­itive con­straints and has some other in­ter­est­ing fea­tures.

Con­sider again the blue oys­ters prob­lem. Con­sider the hy­pothe­ses $$H$$ and $$\lnot H$$, which say “the oys­ter will con­tain a pearl” and “no it won’t”, re­spec­tively. To keep the num­bers easy, let’s say we draw an oys­ter from a third bay, where $$\frac{1}{8}$$ of pearl-car­ry­ing oys­ters are blue and $$\frac{1}{4}$$ of empty oys­ters are blue.

Imag­ine what hap­pens when the oys­ter is blue.$$H$$ pre­dicted blue­ness with $$\frac{1}{8}$$ of its prob­a­bil­ity mass, while $$\lnot H$$ pre­dicted blue­ness with $$\frac{1}{4}$$ of its prob­a­bil­ity mass. Thus, $$\lnot H$$ did bet­ter than $$H,$$ and goes up in prob­a­bil­ity. Pre­vi­ously, we’ve been com­bin­ing both $$\mathbb P(e \mid H)$$ and $$\mathbb P(e \mid \lnot H)$$ into unified like­li­hood ra­tios, like $$\left(\frac{1}{8} : \frac{1}{4}\right)$$ $$=$$ $$(1 : 2),$$ which says that the ‘blue’ ob­ser­va­tion car­ries 1 bit of ev­i­dence $$H.$$ How­ever, we can also take the logs first, and com­bine sec­ond.

Be­cause $$H$$ as­signed only an eighth of its prob­a­bil­ity mass to the ‘blue’ ob­ser­va­tion, and be­cause Bayesian up­date works by elimi­nat­ing in­cor­rect prob­a­bil­ity mass, we have to ad­just our be­lief in $$H$$ by $$\log_2\left(\frac{1}{8}\right) = -3$$ bits away from $$H.$$ (Each nega­tive bit means “throw away half of $$H$$’s prob­a­bil­ity mass,” and we have to do that 3 times in or­der to re­move the prob­a­bil­ity that $$H$$ failed to as­sign to $$e$$.)

Similarly, be­cause $$\lnot H$$ as­signed only a quar­ter of its prob­a­bil­ity mass to the ‘blue’ ob­ser­va­tion, we have to ad­just our be­lief in $$H$$ by $$\log_2\left(\frac{1}{4}\right) = -2$$ bits away from $$\lnot H.$$

Thus, when the ‘blue’ ob­ser­va­tion comes in, we move our be­lief (mea­sured in bits) 3 notches away from $$H$$ and then two notches back to­wards $$H.$$ On net, our be­lief shifts 1 notch away from $$H$$.

$H$ as­signed 1/​8th of its prob­a­bil­ity mass to blue­ness, so it emits $$-\!\log_2\left(\frac{1}{8}\right)=3$$ bits of sur­prise push­ing away from $$H$$. $$\lnot H$$ as­signed 1/​4th of its prob­a­bil­ity mass to blue­ness, so it emits $$-\!\log_2\left(\frac{1}{4}\right)=2$$ bits of sur­prise push­ing away from $$\lnot H$$ (and to­wards $$H$$). Thus, be­lief in $$H$$ moves 1 bit to­wards $$\lnot H$$, on net.

If in­stead $$H$$ pre­dicted blue with prob­a­bil­ity 4% (penalty $$\log_2(0.04) \approx -4.64$$) and $$\lnot H$$ pre­dicted blue with prob­a­bil­ity 8% (penalty $$\log_2(0.08) \approx -3.64$$), then we would have shifted a bit over 4.6 notches to­wards $$\lnot H$$ and a bit over 3.6 notches back to­wards $$H,$$ but we would have shifted the same num­ber of notches on net. This is why it’s only the rel­a­tive differ­ence be­tween the num­ber of bits docked from $$H$$ and the num­ber of bits docked from $$\lnot H$$ that mat­ters.

In gen­eral, given an ob­ser­va­tion $$e$$ and a hy­poth­e­sis $$H,$$ the num­ber of bits we need to dock from our be­lief in $$H$$ is $$\log_2(\mathbb P(e \mid H)),$$ that is, the log of the prob­a­bil­ity that $$H$$ as­signed to $$e.$$ This quan­tity is never pos­i­tive, be­cause the log­a­r­ithm of $$x$$ for $$0 \le x \le 1$$ is in the range $$[-\infty, 0]$$. If we negate it, we get a non-nega­tive quan­tity that re­lates $$H$$ to $$e$$, which is 0 when $$H$$ was cer­tain that $$e$$ was go­ing to hap­pen, and which is in­finite when $$H$$ was cer­tain that $$e$$ wasn’t go­ing to hap­pen, and which is mea­sured in the same units as ev­i­dence and be­lief. Thus, this quan­tity is of­ten called “sur­prise,” and in­tu­itively, it mea­sures how sur­prised the hy­poth­e­sis $$H$$ was by $$e$$ (in bits).

There is some cor­re­la­tion be­tween Bayesian sur­prise and the times when a hu­man would feel sur­prised (at see­ing some­thing that they thought was un­likely), but, of course, the hu­man emo­tion is quite differ­ent. (A hu­man can feel sur­prised for other rea­sons than “my hy­pothe­ses failed to pre­dict the data,” and hu­mans are also great at ig­nor­ing ev­i­dence in­stead of feel­ing sur­prised.)

Given this defi­ni­tion of Bayesian sur­prise, we can view Bayes’ rule as say­ing that sur­prise re­pels be­lief. When you make an ob­ser­va­tion $$e,$$ each hy­poth­e­sis emits re­pul­sive “sur­prise” sig­nals, which shift your hy­poth­e­sis. Refer­ring again to the image above, when $$H$$ pre­dicts the ob­ser­va­tion you made with $$\frac{1}{8}$$ of its prob­a­bil­ity mass, and $$\lnot H$$ pre­dicts it with $$\frac{1}{4}$$ of its prob­a­bil­ity mass, we can imag­ine $$H$$ emit­ting a sur­prise sig­nal with a strength of 3 bits away from $$H$$ and $$\lnot H$$ emit­ting a sur­prise sig­nal with a strength of 2 bits away from $$\lnot H$$. Both those sig­nals push the be­lief in $$H$$ in differ­ent di­rec­tions, and it ends up 1 bit closer to $$\lnot H$$ (which emit­ted the weaker sur­prise sig­nal).

In other words, when­ever you find your­self feel­ing sur­prised by some­thing you saw, think of the least sur­pris­ing ex­pla­na­tion for that ev­i­dence — and then award that hy­poth­e­sis a few bits of be­lief.

Parents:

• Bayes' rule

Bayes’ rule is the core the­o­rem of prob­a­bil­ity the­ory say­ing how to re­vise our be­liefs when we make a new ob­ser­va­tion.

• The log used to de­ter­mine num­ber of bits should prob­a­bly be con­sis­tent through­out or clar­ified each time. Here, the log 2 scale is used, when el­se­where there is us­age of the log 10 scale.