Identifying ambiguous inductions

One of the old fables in ma­chine learn­ing is the story of the “tank clas­sifier”—a neu­ral net­work that had sup­pos­edly been trained to de­tect en­emy tanks hid­ing in a for­est. It turned out that all the pho­tos of en­emy tanks had been taken on sunny days and all the pho­tos of the same field with­out the tanks had been taken on cloudy days, mean­ing that the neu­ral net had re­ally just trained it­self to rec­og­nize the differ­ence be­tween sunny and cloudy days (or just the differ­ence be­tween bright and dim pic­tures). (Source.)

We could view this prob­lem as fol­lows: A hu­man look­ing at the la­beled data might have seen sev­eral con­cepts that some­one might be try­ing to point at—tanks vs. no tanks, cloudy vs. sunny days, or bright vs. dim pic­tures. A hu­man might then ask, “Which of these pos­si­ble cat­e­gories did you mean?” and de­scribe the differ­ence us­ing words; or, if it was eas­ier for them to gen­er­ate pic­tures than to talk, gen­er­ate new pic­tures that dis­t­in­guished among the pos­si­ble con­cepts that could have been meant. Since learn­ing a sim­ple bound­ary that sep­a­rates pos­i­tive from nega­tive in­stances in the train­ing data is a form of in­duc­tion, we could call this prob­lem notic­ing “in­duc­tive am­bi­gui­ties” or “am­bigu­ous in­duc­tions”.

This prob­lem bears some re­sem­blance to nu­mer­ous se­tups in com­puter sci­ence where we can query an or­a­cle about how to clas­sify in­stances and we want to learn the con­cept bound­ary us­ing a min­i­mum num­ber of in­stances. How­ever, iden­ti­fy­ing an “in­duc­tive am­bi­guity” doesn’t seem to be ex­actly the same prob­lem, or at least, it’s not ob­vi­ously the same prob­lem. Sup­pose we con­sider the tank-clas­sifier prob­lem. Dist­in­guish­ing lev­els of illu­mi­na­tion in the pic­ture is a very sim­ple con­cept, so it would prob­a­bly be the first one learned; then, treat­ing the prob­lem in clas­si­cal or­a­cle-query terms, we might imag­ine the AI pre­sent­ing the user with var­i­ous ran­dom pixel fields at in­ter­me­di­ate lev­els of illu­mi­na­tion. The user, not hav­ing any idea what’s go­ing on, clas­sifies these in­ter­me­di­ate lev­els of illu­mi­na­tion as ‘not tanks’, and so the AI soon learns that only quite sunny lev­els of illu­mi­na­tion are re­quired.

Per­haps what we want is less like “figure out ex­actly where the con­cept bound­ary lies by query­ing the edge cases to the or­a­cle, as­sum­ing our ba­sic idea about the bound­ary is cor­rect” and more like “no­tice when there’s more than one plau­si­ble idea that de­scribes the bound­ary” or “figure out if the user could have been try­ing to com­mu­ni­cate more than one plau­si­ble idea us­ing the train­ing dataset”.

Pos­si­ble approaches

Some pos­si­bly rele­vant ap­proaches that might feed into the no­tion of “iden­ti­fy­ing in­duc­tive am­bi­gui­ties”:

  • Con­ser­vatism. Can we draw a much nar­rower, but some­what more com­pli­cated, bound­ary around the train­ing data?

  • Can we get a con­cept that more strongly pre­dicts or more tightly pre­dicts the train­ing cases we saw? (Closely re­lated to con­ser­vatism—if we sup­pose there’s a gen­er­a­tor for the train­ing cases, then a more con­ser­va­tive gen­er­a­tor con­cen­trates more prob­a­bil­ity den­sity into the train­ing cases we hap­pened to see.)

  • Can we de­tect com­mon­al­ities in the pos­i­tive train­ing cases that aren’t already pre­sent in the con­cept we’ve learned?

  • This might be a good fit for some­thing like a gen­er­a­tive ad­ver­sar­ial ap­proach, where we gen­er­ate ran­dom in­stances of the con­cept we learned, then ask if we can de­tect the differ­ence be­tween those ran­dom in­stances and the ac­tual pos­i­tively la­beled train­ing cases.

  • Is there a way to blank out the con­cept we’ve already learned so that it doesn’t just get learned again, and ask if there’s a differ­ent con­cept that’s learn­able in­stead? That is, what­ever al­gorithm we’re us­ing, is there a good way to tell it “Don’t learn this con­cept, now try to learn” and see if it can learn some­thing sub­stan­tially differ­ent?

  • Some­thing some­thing Gricean im­pli­ca­tion.

Rele­vance in value alignment

Since in­duc­tive am­bi­gui­ties are meant to be referred to the user for re­s­olu­tion rather than re­solved au­to­mat­i­cally (the whole point is that the nec­es­sary data for an au­to­matic re­s­olu­tion isn’t there), they’re in­stances of “user queries” and all stan­dard wor­ries about user queries would ap­ply.

The hope about a good al­gorithm for iden­ti­fy­ing in­duc­tive am­bi­gui­ties is that it would help catch edge in­stan­ti­a­tions and un­fore­seen max­i­mums, and maybe just sim­ple er­rors of com­mu­ni­ca­tion.


  • AI alignment

    The great civ­i­liza­tional prob­lem of cre­at­ing ar­tifi­cially in­tel­li­gent com­puter sys­tems such that run­ning them is a good idea.