# Strictly confused

A hy­poth­e­sis is “strictly con­fused” by the data if the hy­poth­e­sis does much worse at pre­dict­ing the data than it ex­pected to do. If, on av­er­age, you ex­pect to as­sign around 1% like­li­hood to the ex­act ob­ser­va­tion you see, and you ac­tu­ally see some­thing to which you as­signed 0.000001% like­li­hood, you are strictly con­fused.

knows-req­ui­site(Math 2): I.e., let­ting $$H$$ be a hy­poth­e­sis and $$e_0$$ be the data ob­served from some set $$E$$ of pos­si­ble ob­ser­va­tions, we say that $$H$$ is “strictly con­fused” when

$$\log \mathbb P(e_0 \mid H) \ll \sum_{e \in E} \mathbb P(e \mid H) \cdot \log \mathbb P(e \mid H)$$

<div>

# Mo­ti­va­tion and examples

In Bayesian rea­son­ing, the main rea­son to re­ject a hy­poth­e­sis is when we find a bet­ter hy­poth­e­sis. Sup­pose we think a coin is fair, and we flip it 100 times, and we see that the coin comes up “HHHHHHH…” or all heads. After do­ing this 100 times, the hy­poth­e­sis “This is a dou­ble-headed coin” has a like­li­hood ra­tio of $$2^{100} : 1$$ fa­vor­ing it over the “fair coin” hy­poth­e­sis, and the “dou­ble-headed coin” hy­poth­e­sis isn’t more im­prob­a­ble than $$2^{-100}$$ a pri­ori.

But this re­lies on the in­sight that there’s a sim­ple /​ a pri­ori plau­si­ble al­ter­na­tive hy­poth­e­sis that does bet­ter. What if the coin is pro­duc­ing TTHHTTHHTTHH and we just never hap­pen to think of ‘al­ter­nat­ing pairs of tails and heads’ as a hy­poth­e­sis? It’s pos­si­ble to do bet­ter by think­ing of a bet­ter hy­poth­e­sis, but so far as the ‘fair coin’ hy­poth­e­sis sees the world, TTHHTTHH… is no more or less likely than any other pos­si­ble se­quence it could en­counter; the first eight coin­flips have a prob­a­bil­ity of $$2^{-8}$$ and this would have been true no mat­ter which eight coin­flips were ob­served. After ob­serv­ing 100 coin­flips, the fair coin hy­poth­e­sis will as­sign them a col­lec­tive prob­a­bil­ity of $$2^{-100},$$ and in this sense, no se­quence of 100 coin­flips is any more ‘sur­pris­ing’ or ‘con­fus­ing’ than any other from within the per­spec­tive of the fair coin hy­poth­e­sis.

We can’t say that we’re ‘con­fused’ or ‘sur­prised’ on see­ing a long se­quence of coin­flips to which we as­signed some very low prob­a­bil­ity on the or­der of $$2^{-100} \approx 10^{-30},$$ be­cause we ex­pected to as­sign a prob­a­bil­ity that low.

On the other hand, sup­pose we think that a coin is bi­ased to pro­duce 90% heads and 10% tails, and we flip it 100 times and get some fair-look­ing se­quence like “THHTTTHTTTTHTHTHHH…” (cour­tesy of ran­dom.org). Then we ex­pected to as­sign the ob­served se­quence a prob­a­bil­ity in the range of $$0.9^{90} \cdot 0.1^{10} \approx 7\cdot 10^{-15},$$ but we ac­tu­ally saw a se­quence we as­signed prob­a­bil­ity around $$0.9^{50} \cdot 0.1^{50} \approx 5 \cdot 10^{-53}.$$ We don’t need to con­sider any other hy­pothe­ses to re­al­ize that we are very con­fused. We don’t need to have in­vented the con­cept of a ‘fair coin’, or know that the ‘fair coin’ hy­poth­e­sis would have as­signed a much higher like­li­hood in the re­gion of $$7 \cdot 10^{-31},$$ to re­al­ize that there’s some­thing wrong with the cur­rent hy­poth­e­sis.

In the case of the sup­posed fair coin that pro­duces HHHHHHH, we only do poorly rel­a­tive to a bet­ter hy­poth­e­sis ‘all heads’ that makes a su­pe­rior pre­dic­tion. In the case of the sup­posed 90%-heads coin that pro­duces a ran­dom-look­ing se­quence, we do poorly than we ex­pected to do from in­side the 90%-heads hy­poth­e­sis, so we are do­ing poorly in an ab­solute, non-rel­a­tive sense.

Be­ing strictly con­fused is a sign that tells us to look for some al­ter­na­tive hy­poth­e­sis in ad­vance of our hav­ing any idea what­so­ever what that al­ter­na­tive hy­poth­e­sis might be.

# Distinc­tion from fre­quen­tist p-values

The clas­si­cal fre­quen­tist test for re­ject­ing the null hy­poth­e­sis in­volves con­sid­er­ing the prob­a­bil­ity as­signed to par­tic­u­lar ‘ob­vi­ous’-seem­ing par­ti­tions of the data, and ask­ing if we ended up in­side a low-prob­a­bil­ity par­ti­tion.

Sup­pose you think some coin is fair, and you flip the coin 100 times and see a ran­dom-look­ing se­quence “THHTTTHTT…”

Some­one comes along and says, “You know, this re­sult is very sur­pris­ing, given your ‘fair coin’ the­ory. You re­ally didn’t ex­pect that to hap­pen.”

“How so?” you re­ply.

They say, “Well, among all se­quences of 1000 coins, only 1 in 16 such se­quences start with a string like THHT TTHTT, a pal­in­dromic quar­tet fol­lowed by a pal­in­dromic quin­tet. You con­fi­dently pre­dicted that had a 1516 chance of not hap­pen­ing, and then you were sur­prised.”

“Okay, look,” you re­ply, “if you’d writ­ten down that par­tic­u­lar pre­dic­tion in ad­vance and not a lot of oth­ers, I might be in­ter­ested. Like, if I’d already thought that way of par­ti­tion­ing the data — namely, ‘pal­in­drome quar­tet fol­lowed by pal­in­drome quin­tet’ vs. ‘not pal­in­drome quar­tet fol­lowed by pal­in­drome quin­tet’ — was a spe­cially in­ter­est­ing and dis­t­in­guished one, I might no­tice that I’d as­signed the sec­ond par­ti­tion 1516 prob­a­bil­ity and then it failed to ac­tu­ally hap­pen. As it is, it seems like you’re re­ally reach­ing.”

We can think of the fre­quen­tist tests for re­ject­ing the fair-coin hy­poth­e­sis as a small set of ‘in­ter­est­ing par­ti­tions’ that were writ­ten down in ad­vance, which are sup­posed to have low prob­a­bil­ity given the fair coin. For ex­am­ple, if a coin pro­duces HHHHH HTHHH HHTHH, a fre­quen­tist says, “Par­ti­tion­ing by num­ber of heads, the fair coin hy­poth­e­sis says that on 15 flips we should get be­tween 12 and 3 heads, in­clu­sive, with a prob­a­bil­ity of 98.6%. You are there­fore sur­prised be­cause this event you as­signed 98.6% prob­a­bil­ity failed to hap­pen. And yes, we’re just check­ing the num­ber of heads and a few other ob­vi­ous things, not for pal­in­dromic quar­tets fol­lowed by pal­in­dromic quin­tets.”

Part of the point of be­ing a Bayesian, how­ever, is that we try to only rea­son on the data we ac­tu­ally ob­served, and not put that data into par­tic­u­lar par­ti­tions and rea­son about those par­ti­tions. The par­ti­tion­ing pro­cess in­tro­duces po­ten­tial sub­jec­tivity, es­pe­cially in an aca­demic set­ting fraught with pow­er­ful in­cen­tives to pro­duce ‘statis­ti­cally sig­nifi­cant’ data—the equiv­a­lent of some­body in­sist­ing that pal­in­dromic quar­tets and quin­tets are spe­cial, or that count­ing heads isn’t spe­cial.

E.g., if we flip a coin six times and get HHHHHT, this is “statis­ti­cally sig­nifi­cant p < 0.05” if the re­searcher de­cided to flip coins un­til they got at least one T and then stop, in which case a fair coin has only a 132 prob­a­bil­ity of re­quiring six or more steps to pro­duce a T. If on the other hand the re­searcher de­cided to flip the coin six times and then count the num­ber of tails, the prob­a­bil­ity of get­ting 1 or fewer T in six flips is 764 which is not ‘statis­ti­cally sig­nifi­cant’.

The Bayesian says, “If I use the Rule of Suc­ces­sion to de­note the hy­poth­e­sis that the coin has an un­known bias be­tween 0 and 1, then the se­quence HHHHHT is as­signed 130 prob­a­bil­ity by the Rule of Suc­ces­sion and 164 prob­a­bil­ity by ‘fair coin’, so this is ev­i­dence with a like­li­hood ra­tio of ~ 2 : 1 fa­vor­ing the hy­poth­e­sis that the coin is bi­ased—not enough to over­come any sig­nifi­cant prior im­prob­a­bil­ity.”

The Bayesian ar­rives at this judg­ment by only con­sid­er­ing the par­tic­u­lar, ex­act data that was ob­served, and not any larger par­ti­tions of data. To com­pute the prob­a­bil­ity flow be­tween two hy­pothe­ses $$H_1$$ and $$H_2$$ we only need to know the like­li­hoods of our ex­act ob­ser­va­tion given those two hy­pothe­ses, not the like­li­hoods the hy­pothe­ses as­sign to any par­ti­tions into which that ob­ser­va­tion can be put, etcetera.

Similarly, the Bayesian looks at the se­quence HHHHH HTHHH HHTHH and says: this spe­cific, ex­act data that we ob­served gives us a like­li­hood ra­tio of (1/​1680 : 132768) ~ (19.5 : 1) fa­vor­ing “The coin has an un­known bias be­tween 0 and 1” over “The coin is fair”. With that already said, the Bayesian doesn’t see any need to talk about the to­tal prob­a­bil­ity of the fair coin hy­poth­e­sis pro­duc­ing data in­side a par­ti­tion of similar re­sults that could have been ob­served but weren’t.

But even though Bayesi­ans usu­ally try avoid think­ing in terms of re­ject­ing a null hy­poth­e­sis us­ing par­ti­tions, say­ing “I’m strictly con­fused!” gives a Bayesian a way of say­ing “Well, I know some­thing’s wrong…” that doesn’t re­quire already hav­ing the in­sight to pro­pose a bet­ter al­ter­na­tive, or even the in­sight to re­al­ize that some par­tic­u­lar par­ti­tion­ing of the data is worth spe­cial at­ten­tion.

Parents:

• Bayesian reasoning

A prob­a­bil­ity-the­ory-based view of the world; a co­her­ent way of chang­ing prob­a­bil­is­tic be­liefs based on ev­i­dence.

1. I pro­pose that this con­cept be called “un­ex­pected sur­prise” rather than “strictly con­fused”:

• “Strictly con­fused” sug­gests log­i­cal in­co­her­ence.

• “Un­ex­pected sur­prise” can be mo­ti­vated the fol­low­ing way: let $$s(d) = \textrm{surprise}(d \mid H) = - \log \Pr (d \mid H)$$$be how sur­pris­ing data $$d$$ is on hy­poth­e­sis $$H$$. Then one is “strictly con­fused” if the ob­served $$s$$ is larger than than one would ex­pect as­sum­ing a $$H$$ holds. This ter­minol­ogy is nice be­cause the av­er­age of $$s$$ un­der $$H$$ is the en­tropy or ex­pected sur­prise in $$(d \mid H)$$. It also con­nects with Bayes, since $$\textrm{log-likelihood} = -\textrm{surprise}$$$ is the ev­i­den­tial sup­port $$d$$ gives $$H$$.

1. The sec­tion on “Distinc­tion from fre­quen­tist p-val­ues” is, I think, both tech­ni­cally in­cor­rect and a bit un­char­i­ta­ble.

• It’s tech­ni­cally in­cor­rect be­cause the fol­low­ing isn’t true:

The clas­si­cal fre­quen­tist test for re­ject­ing the null hy­poth­e­sis in­volves con­sid­er­ing the prob­a­bil­ity as­signed to par­tic­u­lar ‘ob­vi­ous’-seem­ing par­ti­tions of the data, and ask­ing if we ended up in­side a low-prob­a­bil­ity par­ti­tion.

Ac­tu­ally, the clas­si­cal fre­quen­tist test in­volves spec­i­fy­ing an ob­vi­ous-seem­ing mea­sure of sur­prise $$t(d)$$, and see­ing whether $$t$$ is higher than ex­pected on $$H$$. This is even more ar­bi­trary than the above.

• On the other hand, it’s un­char­i­ta­ble be­cause it’s widely ac­knowl­edged one should try to choose $$t$$ to be suffi­cient, which is ex­actly the con­di­tion that the par­ti­tion in­duced by $$t$$ is “com­pat­i­ble” with $$\Pr(d \mid H)$$ for differ­ent $$H$$, in the sense that $$\Pr(H \mid d) = \Pr(H \mid t(d))$$\$ for all the con­sid­ered $$H$$.

Clearly $$s$$ is suffi­cient in this sense. But there might be sim­pler func­tions of $$d$$ that do the job too (“min­i­mal suffi­cient statis­tics”).

Note that $$t$$ be­ing suffi­cient doesn’t make it non-ar­bi­trary, as it may not be a mono­tone func­tion of $$s$$.

2. Fi­nally, I think that this con­cept is clearly “ex­tra-Bayesian”, in the sense that it’s about non-prob­a­bil­is­tic (“Knigh­tian”) un­cer­tainty over $$H$$, and one is con­sid­er­ing prob­a­bil­ities at­tached to un­ob­served $$d$$ (i.e., not con­di­tion­ing on the ob­served $$d$$).

I don’t think be­ing “ex­tra-Bayesian” in this sense is prob­le­matic. But I think it should be owned-up to.

Ac­tu­ally, “un­ex­pected sur­prise” re­veals a nice con­nec­tion be­tween Bayesian and sam­pling-based un­cer­tainty in­ter­vals:

• To get a (HPD) cred­ible in­ter­val, ex­clude those $$H$$ that are rel­a­tively sur­prised by the ob­served $$d$$ (or which are a pri­ori sur­pris­ing).

• To get a (nice) con­fi­dence in­ter­val, ex­clude those $$H$$ that are “un­ex­pect­edly sur­prised” by $$d$$.

• In the para­graph 4th from last, page says the se­quence HHHHHT “is as­signed 130 prob­a­bil­ity by the Rule of Suc­ces­sion”. Where does this num­ber come from? They don’t ex­plain. I do un­der­stand the part about that same se­quence be­ing as­signed 164 by the fair coin hy­poth­e­sis, but the part about the rule of suc­ces­sion isn’t so clear to me.

The sec­ond ex­am­ple, in the para­graph 2nd from last, is also con­fus­ing to me: the part that says that the se­quence HHHHH HTHHH HHTHH gives the Bayesian a 19.5 : 1 chance of the coin be­ing bi­ased vs it be­ing fair.