Strictly confused

A hy­poth­e­sis is “strictly con­fused” by the data if the hy­poth­e­sis does much worse at pre­dict­ing the data than it ex­pected to do. If, on av­er­age, you ex­pect to as­sign around 1% like­li­hood to the ex­act ob­ser­va­tion you see, and you ac­tu­ally see some­thing to which you as­signed 0.000001% like­li­hood, you are strictly con­fused.

knows-req­ui­site(Math 2): I.e., let­ting \(H\) be a hy­poth­e­sis and \(e_0\) be the data ob­served from some set \(E\) of pos­si­ble ob­ser­va­tions, we say that \(H\) is “strictly con­fused” when

$$ \log \mathbb P(e_0 \mid H) \ll \sum_{e \in E} \mathbb P(e \mid H) \cdot \log \mathbb P(e \mid H)$$

<div>

Mo­ti­va­tion and examples

In Bayesian rea­son­ing, the main rea­son to re­ject a hy­poth­e­sis is when we find a bet­ter hy­poth­e­sis. Sup­pose we think a coin is fair, and we flip it 100 times, and we see that the coin comes up “HHHHHHH…” or all heads. After do­ing this 100 times, the hy­poth­e­sis “This is a dou­ble-headed coin” has a like­li­hood ra­tio of \(2^{100} : 1\) fa­vor­ing it over the “fair coin” hy­poth­e­sis, and the “dou­ble-headed coin” hy­poth­e­sis isn’t more im­prob­a­ble than \(2^{-100}\) a pri­ori.

But this re­lies on the in­sight that there’s a sim­ple /​ a pri­ori plau­si­ble al­ter­na­tive hy­poth­e­sis that does bet­ter. What if the coin is pro­duc­ing TTHHTTHHTTHH and we just never hap­pen to think of ‘al­ter­nat­ing pairs of tails and heads’ as a hy­poth­e­sis? It’s pos­si­ble to do bet­ter by think­ing of a bet­ter hy­poth­e­sis, but so far as the ‘fair coin’ hy­poth­e­sis sees the world, TTHHTTHH… is no more or less likely than any other pos­si­ble se­quence it could en­counter; the first eight coin­flips have a prob­a­bil­ity of \(2^{-8}\) and this would have been true no mat­ter which eight coin­flips were ob­served. After ob­serv­ing 100 coin­flips, the fair coin hy­poth­e­sis will as­sign them a col­lec­tive prob­a­bil­ity of \(2^{-100},\) and in this sense, no se­quence of 100 coin­flips is any more ‘sur­pris­ing’ or ‘con­fus­ing’ than any other from within the per­spec­tive of the fair coin hy­poth­e­sis.

We can’t say that we’re ‘con­fused’ or ‘sur­prised’ on see­ing a long se­quence of coin­flips to which we as­signed some very low prob­a­bil­ity on the or­der of \(2^{-100} \approx 10^{-30},\) be­cause we ex­pected to as­sign a prob­a­bil­ity that low.

On the other hand, sup­pose we think that a coin is bi­ased to pro­duce 90% heads and 10% tails, and we flip it 100 times and get some fair-look­ing se­quence like “THHTTTHTTTTHTHTHHH…” (cour­tesy of ran­dom.org). Then we ex­pected to as­sign the ob­served se­quence a prob­a­bil­ity in the range of \(0.9^{90} \cdot 0.1^{10} \approx 7\cdot 10^{-15},\) but we ac­tu­ally saw a se­quence we as­signed prob­a­bil­ity around \(0.9^{50} \cdot 0.1^{50} \approx 5 \cdot 10^{-53}.\) We don’t need to con­sider any other hy­pothe­ses to re­al­ize that we are very con­fused. We don’t need to have in­vented the con­cept of a ‘fair coin’, or know that the ‘fair coin’ hy­poth­e­sis would have as­signed a much higher like­li­hood in the re­gion of \(7 \cdot 10^{-31},\) to re­al­ize that there’s some­thing wrong with the cur­rent hy­poth­e­sis.

In the case of the sup­posed fair coin that pro­duces HHHHHHH, we only do poorly rel­a­tive to a bet­ter hy­poth­e­sis ‘all heads’ that makes a su­pe­rior pre­dic­tion. In the case of the sup­posed 90%-heads coin that pro­duces a ran­dom-look­ing se­quence, we do poorly than we ex­pected to do from in­side the 90%-heads hy­poth­e­sis, so we are do­ing poorly in an ab­solute, non-rel­a­tive sense.

Be­ing strictly con­fused is a sign that tells us to look for some al­ter­na­tive hy­poth­e­sis in ad­vance of our hav­ing any idea what­so­ever what that al­ter­na­tive hy­poth­e­sis might be.

Distinc­tion from fre­quen­tist p-values

The clas­si­cal fre­quen­tist test for re­ject­ing the null hy­poth­e­sis in­volves con­sid­er­ing the prob­a­bil­ity as­signed to par­tic­u­lar ‘ob­vi­ous’-seem­ing par­ti­tions of the data, and ask­ing if we ended up in­side a low-prob­a­bil­ity par­ti­tion.

Sup­pose you think some coin is fair, and you flip the coin 100 times and see a ran­dom-look­ing se­quence “THHTTTHTT…”

Some­one comes along and says, “You know, this re­sult is very sur­pris­ing, given your ‘fair coin’ the­ory. You re­ally didn’t ex­pect that to hap­pen.”

“How so?” you re­ply.

They say, “Well, among all se­quences of 1000 coins, only 1 in 16 such se­quences start with a string like THHT TTHTT, a pal­in­dromic quar­tet fol­lowed by a pal­in­dromic quin­tet. You con­fi­dently pre­dicted that had a 1516 chance of not hap­pen­ing, and then you were sur­prised.”

“Okay, look,” you re­ply, “if you’d writ­ten down that par­tic­u­lar pre­dic­tion in ad­vance and not a lot of oth­ers, I might be in­ter­ested. Like, if I’d already thought that way of par­ti­tion­ing the data — namely, ‘pal­in­drome quar­tet fol­lowed by pal­in­drome quin­tet’ vs. ‘not pal­in­drome quar­tet fol­lowed by pal­in­drome quin­tet’ — was a spe­cially in­ter­est­ing and dis­t­in­guished one, I might no­tice that I’d as­signed the sec­ond par­ti­tion 1516 prob­a­bil­ity and then it failed to ac­tu­ally hap­pen. As it is, it seems like you’re re­ally reach­ing.”

We can think of the fre­quen­tist tests for re­ject­ing the fair-coin hy­poth­e­sis as a small set of ‘in­ter­est­ing par­ti­tions’ that were writ­ten down in ad­vance, which are sup­posed to have low prob­a­bil­ity given the fair coin. For ex­am­ple, if a coin pro­duces HHHHH HTHHH HHTHH, a fre­quen­tist says, “Par­ti­tion­ing by num­ber of heads, the fair coin hy­poth­e­sis says that on 15 flips we should get be­tween 12 and 3 heads, in­clu­sive, with a prob­a­bil­ity of 98.6%. You are there­fore sur­prised be­cause this event you as­signed 98.6% prob­a­bil­ity failed to hap­pen. And yes, we’re just check­ing the num­ber of heads and a few other ob­vi­ous things, not for pal­in­dromic quar­tets fol­lowed by pal­in­dromic quin­tets.”

Part of the point of be­ing a Bayesian, how­ever, is that we try to only rea­son on the data we ac­tu­ally ob­served, and not put that data into par­tic­u­lar par­ti­tions and rea­son about those par­ti­tions. The par­ti­tion­ing pro­cess in­tro­duces po­ten­tial sub­jec­tivity, es­pe­cially in an aca­demic set­ting fraught with pow­er­ful in­cen­tives to pro­duce ‘statis­ti­cally sig­nifi­cant’ data—the equiv­a­lent of some­body in­sist­ing that pal­in­dromic quar­tets and quin­tets are spe­cial, or that count­ing heads isn’t spe­cial.

E.g., if we flip a coin six times and get HHHHHT, this is “statis­ti­cally sig­nifi­cant p < 0.05” if the re­searcher de­cided to flip coins un­til they got at least one T and then stop, in which case a fair coin has only a 132 prob­a­bil­ity of re­quiring six or more steps to pro­duce a T. If on the other hand the re­searcher de­cided to flip the coin six times and then count the num­ber of tails, the prob­a­bil­ity of get­ting 1 or fewer T in six flips is 764 which is not ‘statis­ti­cally sig­nifi­cant’.

The Bayesian says, “If I use the Rule of Suc­ces­sion to de­note the hy­poth­e­sis that the coin has an un­known bias be­tween 0 and 1, then the se­quence HHHHHT is as­signed 130 prob­a­bil­ity by the Rule of Suc­ces­sion and 164 prob­a­bil­ity by ‘fair coin’, so this is ev­i­dence with a like­li­hood ra­tio of ~ 2 : 1 fa­vor­ing the hy­poth­e­sis that the coin is bi­ased—not enough to over­come any sig­nifi­cant prior im­prob­a­bil­ity.”

The Bayesian ar­rives at this judg­ment by only con­sid­er­ing the par­tic­u­lar, ex­act data that was ob­served, and not any larger par­ti­tions of data. To com­pute the prob­a­bil­ity flow be­tween two hy­pothe­ses \(H_1\) and \(H_2\) we only need to know the like­li­hoods of our ex­act ob­ser­va­tion given those two hy­pothe­ses, not the like­li­hoods the hy­pothe­ses as­sign to any par­ti­tions into which that ob­ser­va­tion can be put, etcetera.

Similarly, the Bayesian looks at the se­quence HHHHH HTHHH HHTHH and says: this spe­cific, ex­act data that we ob­served gives us a like­li­hood ra­tio of (1/​1680 : 132768) ~ (19.5 : 1) fa­vor­ing “The coin has an un­known bias be­tween 0 and 1” over “The coin is fair”. With that already said, the Bayesian doesn’t see any need to talk about the to­tal prob­a­bil­ity of the fair coin hy­poth­e­sis pro­duc­ing data in­side a par­ti­tion of similar re­sults that could have been ob­served but weren’t.

But even though Bayesi­ans usu­ally try avoid think­ing in terms of re­ject­ing a null hy­poth­e­sis us­ing par­ti­tions, say­ing “I’m strictly con­fused!” gives a Bayesian a way of say­ing “Well, I know some­thing’s wrong…” that doesn’t re­quire already hav­ing the in­sight to pro­pose a bet­ter al­ter­na­tive, or even the in­sight to re­al­ize that some par­tic­u­lar par­ti­tion­ing of the data is worth spe­cial at­ten­tion.

Parents:

  • Bayesian reasoning

    A prob­a­bil­ity-the­ory-based view of the world; a co­her­ent way of chang­ing prob­a­bil­is­tic be­liefs based on ev­i­dence.