Look where I'm pointing, not at my finger

Ex­am­ple problem

Sup­pose we’re try­ing to give a Task AGI the task, “Make there be a straw­berry on the pedestal in front of your we­b­cam.” For ex­am­ple, a hu­man could fulfill this task by buy­ing a straw­berry from the su­per­mar­ket and putting it on the pedestal.

As part of al­ign­ing a Task AGI on this goal, we’d need to iden­tify straw­ber­ries and the pedestal.

One pos­si­ble ap­proach to com­mu­ni­cat­ing the con­cept of “straw­berry” is through a train­ing set of hu­man-se­lected cases of things that are and aren’t straw­ber­ries, on and off the pedestal.

For the sake of dis­t­in­guish­ing causal roles, let’s say that one hu­man, User1, is se­lect­ing train­ing cases of ob­jects and putting them in front of the AI’s we­b­cam. A differ­ent hu­man, User2, is look­ing at the scene and push­ing a but­ton when they see some­thing that looks like a straw­berry on the pedestal. The in­ten­tion is that press­ing the but­ton will la­bel pos­i­tive in­stances of the goal con­cept, namely straw­ber­ries on the pedestal. In ac­tual use af­ter train­ing, the AI will be able to gen­er­ate its own ob­jects to put in­side the room, pos­si­bly with fur­ther feed­back from User2. We want these ob­jects to be in­stances of our in­tended goal con­cept, aka, ac­tual straw­ber­ries.

We could draw an in­tu­itive causal model for this situ­a­tion as fol­lows:

strawberry diagram

Sup­pose that dur­ing the use phase, the AI ac­tu­ally cre­ates a re­al­is­tic plas­tic straw­berry, one that will fool User2 into press­ing the but­ton. Or, similarly, sup­pose the AI cre­ates a small robot that sprouts tiny legs and runs over to User2′s but­ton and presses the but­ton di­rectly.

Nei­ther of these are the goal con­cept that we wanted the AI to learn, but any test of the hy­poth­e­sis “Is this event clas­sified as a pos­i­tive in­stance of the goal con­cept?” will re­turn “Yes, the but­ton was pressed.”

com­ment: (If you imag­ine some other User3 watch­ing this and press­ing an over­ride but­ton to tell the AI that this fake straw­berry wasn’t re­ally a pos­i­tive in­stance of the in­tended goal con­cept, imag­ine the AI mod­el­ing and then ma­nipu­lat­ing or by­pass­ing User3, etcetera.)

More gen­er­ally, the hu­man is try­ing to point to their in­tu­itive “straw­berry” con­cept, but there may be other causal con­cepts that also sep­a­rate the train­ing data well into pos­i­tive and nega­tive in­stances, such as “ob­jects which come from straw­berry farms”, “ob­jects which cause (the AI’s psy­cholog­i­cal model of) User2 to think that some­thing is a straw­berry”, or “any chain of events lead­ing up to the pos­i­tive-in­stance but­ton be­ing pressed”.

com­ment: move this to sen­sory iden­ti­fi­ca­tion sec­tion: How­ever, in a case like this, it’s not like the ac­tual phys­i­cal glove is in­side the AGI’s mem­ory. Rather, we’d be, say, putting the glove in front of the AGI’s we­b­cam, and then (for the sake of sim­plified ar­gu­ment) press­ing a but­ton which is meant to la­bel that thing as a “pos­i­tive in­stance”. If we want our AGI to achieve par­tic­u­lar states of the en­vi­ron­ment, we’ll want it to rea­son about the causes of the image it sees on the we­b­cam and iden­tify a con­cept over those causes—have a goal over ‘gloves’ and not just ‘images which look like gloves’. In the lat­ter case, it could just as well fulfill its goal by set­ting up a re­al­is­tic mon­i­tor in front of its we­b­cam and dis­play­ing a glove image. So we want the AGI to iden­tify its task over the causes of its sen­sory data, not just pixel fields.

Ab­stract problem

To state the above po­ten­tial difficulty more gen­er­ally:

The “look where I’m point­ing, not at my finger” prob­lem is that the la­bels on the train­ing data are pro­duced by a com­pli­cated causal lat­tice, e.g., (straw­berry farm) → (straw­berry) → (User1 takes straw­berry to pedestal) → (Straw­berry is on pedestal) → (User2 sees straw­berry) → (User2 clas­sifies straw­berry) → (User2 presses ‘pos­i­tive in­stance’ but­ton). We want to point to the “straw­berry” part of the lat­tice of causal­ity, but the finger we use to point there is User2′s psy­cholog­i­cal clas­sifi­ca­tion of the train­ing cases and User2′s hand press­ing the pos­i­tive-in­stance but­ton.

Worse, when it comes to which model best sep­a­rates the train­ing cases, con­cepts that are fur­ther down­stream in the chain of causal­ity should clas­sify the train­ing data bet­ter, if the AI is smart enough to un­der­stand those parts of the causal lat­tice.

Sup­pose that at one point User2 slips on a ba­nana peel, and her finger slips and ac­ci­den­tally clas­sifies a scarf as a pos­i­tive in­stance of “straw­berry”. From the AI’s per­spec­tive there’s no good way of ac­count­ing for this ob­ser­va­tion in terms of straw­ber­ries, straw­berry farms, or even User2′s psy­chol­ogy. To max­i­mize pre­dic­tive ac­cu­racy over the train­ing cases, the AI’s rea­son­ing must take into ac­count that things are more likely to be pos­i­tive in­stances of the goal con­cept when there’s a ba­nana peel on the con­trol room floor. Similarly, if some de­cep­tively straw­berry-shaped ob­jects slip into the train­ing cases, or are gen­er­ated by the AI query­ing the user, the best bound­ary that sep­a­rates ‘but­ton pressed’ from ‘but­ton not pressed’ la­beled in­stances will in­clude a model of what makes a hu­man be­lieve that some­thing is a straw­berry.

A learned con­cept that’s ‘about’ lay­ers of the causal lat­tice that are fur­ther down­stream of the straw­berry, like User2′s psy­chol­ogy or me­chan­i­cal force be­ing ap­plied to the but­ton, will im­plic­itly take into ac­count the up­stream lay­ers of causal­ity. To the ex­tent that some­thing be­ing straw­berry-shaped causes a hu­man to press the but­ton, it’s im­plic­itly part of the cat­e­gory of “events that end ap­ply­ing me­chan­i­cal force to the ‘pos­i­tive-in­stance’ but­ton”). Con­versely, a con­cept that’s about up­stream lay­ers of the causal lat­tice can’t take into ac­count events down­stream. So if you’re look­ing for pure pre­dic­tive ac­cu­racy, the best model of the la­beled train­ing data—given suffi­cient AGI un­der­stand­ing of the world and the more com­pli­cated parts of the causal lat­tice—will always be “what­ever makes the pos­i­tive-in­stance but­ton be pressed”.

This is a prob­lem be­cause what we ac­tu­ally want is for there to be a straw­berry on the pedestal, not for there to be an ob­ject that looks like a straw­berry, or for User2′s brain to be rewrit­ten to think the ob­ject is a straw­berry, or for the AGI to seize the con­trol room and press the pos­i­tive in­stance but­ton.

This sce­nario may qual­ify as a con­text dis­aster if the AGI only un­der­stands straw­ber­ries in its de­vel­op­ment phase, but comes to un­der­stand User2′s psy­chol­ogy later. Then the more com­pli­cated causal model, in which the down­stream con­cept of User2′s psy­chol­ogy sep­a­rates the data bet­ter than rea­son­ing about prop­er­ties of straw­ber­ries di­rectly, first be­comes an is­sue only when the AI is over a high thresh­old level of in­tel­li­gence.


con­ser­vatism would try to al­ign the AGI to plan out goal-achieve­ment events that were as similar as pos­si­ble to the par­tic­u­lar goal-achieve­ment events la­beled pos­i­tively in the train­ing data. If the hu­man got the straw­berry from the su­per­mar­ket in all train­ing in­stances, the AGI will try to get the same brand of straw­berry from the same su­per­mar­ket.

Am­bi­guity iden­ti­fi­ca­tion would fo­cus on try­ing to get the AGI to ask us whether we meant ‘things that make hu­mans think they’re straw­ber­ries’ or ‘straw­berry’. This ap­proach might need to go through re­solv­ing am­bi­gui­ties by the AGI ex­plic­itly sym­bol­i­cally com­mu­ni­cat­ing with us about the al­ter­na­tive pos­si­ble goal con­cepts, or gen­er­at­ing suffi­ciently de­tailed mul­ti­ple-view de­scrip­tions of a hy­po­thet­i­cal case, not the AGI try­ing real ex­am­ples. Test­ing al­ter­na­tive hy­pothe­ses us­ing real ex­am­ples always says that the la­bel is gen­er­ated fur­ther causally down­stream; if you are suffi­ciently in­tel­li­gent to con­struct a fake plas­tic straw­berry that fools a hu­man, try­ing out the hy­poth­e­sis will pro­duce the re­sponse “Yes, this is a pos­i­tive in­stance of the goal con­cept.” If the AGI tests the hy­poth­e­sis that the ‘real’ ex­pla­na­tion of the pos­i­tive in­stance la­bel is ‘what­ever makes the but­ton be pressed’ rather than ‘what­ever makes User2 think of a straw­berry’ by car­ry­ing out the dis­t­in­guish­ing ex­per­i­ment of press­ing the but­ton in a case where User2 doesn’t think some­thing is a straw­berry, the AGI will find that the ex­per­i­men­tal re­sult fa­vors the ‘it’s just what­ever presses the but­ton hy­poth­e­sis’. Some modes of am­bi­guity iden­ti­fi­ca­tion break for suffi­ciently ad­vanced AIs, since the AI’s ex­per­i­ment in­terferes with the causal chan­nel that we’d in­tended to re­turn in­for­ma­tion about our in­tended goal con­cept.

Spe­cial­ized ap­proaches to the point­ing-finger prob­lem in par­tic­u­lar might try to define a su­per­vised learn­ing al­gorithm that tends to in­ter­nally dis­till, in a pre­dictable way, some model of causal events, such that the al­gorithm could be in­structed some­how to try learn­ing a sim­ple or di­rect re­la­tion be­tween the pos­i­tive “straw­berry on pedestal” in­stances, and the ob­served la­bels of the “sen­sory but­ton” node within the train­ing cases; with this re­la­tion not be­ing al­lowed to pass through the causal model of User2 or me­chan­i­cal force be­ing ap­plied to the but­ton, be­cause we know how to say “those things are too com­pli­cated” or “those things are too far causally down­stream” rel­a­tive to the al­gorithm’s in­ter­nal model.

strawberry diagram

This spe­cial­ized ap­proach seems po­ten­tially sus­cep­ti­ble to ini­tial ap­proach within mod­ern ma­chine learn­ing al­gorithms.

But to restate the es­sen­tial difficulty from an ad­vanced-safety per­spec­tive: in the limit of ad­vanced in­tel­li­gence, the best pos­si­ble clas­sifier of the re­la­tion be­tween the train­ing cases and the ob­served but­ton la­bels will always pass through User2 and any­thing else that might phys­i­cally press the but­ton. Try­ing to ‘for­bid’ the AI from us­ing the most effec­tive clas­sifier for the re­la­tion be­tween Straw­berry? and ob­served val­ues of But­ton! seems po­ten­tially sub­ject to a Near­est Un­blocked prob­lem, where the ‘real’ sim­plest re­la­tion re-emerges in the ad­vanced phase af­ter be­ing sup­pressed dur­ing the train­ing phase. Maybe the AI rea­sons about cer­tain very com­pli­cated prop­er­ties of the ma­te­rial ob­ject on the pedestal… in fact, these prop­er­ties are so com­pli­cated that they turn out to con­tain im­plicit mod­els of User2′s psy­chol­ogy, again be­cause this pro­duces a bet­ter sep­a­ra­tion of the la­beled train­ing data. That is, we can’t al­low the ‘straw­berry’ con­cept to in­clude com­pli­cated log­i­cal prop­er­ties of the straw­berry-ob­ject that in effect in­clude a psy­cholog­i­cal model of User2 re­act­ing to the straw­berry, im­ply­ing that if User2 can be fooled by a fake plas­tic model, that must be a straw­berry. Even though this richer model will pro­duce a more ac­cu­rate clas­sifi­ca­tion of the train­ing data, and any ac­tual ex­per­i­ments performed will re­turn re­sults fa­vor­ing the richer model.

Even so, this doesn’t seem im­pos­si­ble to nav­i­gate as a ma­chine learn­ing prob­lem; an al­gorithm might be able to rec­og­nize when an up­stream causal mode starts to con­tain pred­i­cates that be­long in a down­stream causal node; or an al­gorithm might con­tain strong reg­u­lariza­tion rules that col­lect all in­fer­ence about User2 into the User2 node rather than let­ting it slop over any­where else; or it might be pos­si­ble to im­pose a con­straint, af­ter the straw­berry cat­e­gory has been learned suffi­ciently well, that the cur­rent level of straw­berry com­plex­ity is the most com­plex­ity al­lowed; or the gran­u­lar­ity of the AI’s causal model might not al­low such com­plex pred­i­cates to be se­cretly packed into the part of the causal graph we’re iden­ti­fy­ing, with­out visi­ble and trans­par­ent con­se­quences when we mon­i­tor how the al­gorithm is learn­ing the goal pred­i­cate.

A toy model of this setup ought to in­clude analogues of User2 that some­times make mis­takes in a reg­u­lar way, and ac­tions the AI can po­ten­tially take to di­rectly press the la­bel­ing but­ton; this would test the abil­ity to point an al­gorithm to learn about the com­pact prop­er­ties of the straw­berry in par­tic­u­lar, and not other con­cepts causally down­stream that could po­ten­tially sep­a­rate the train­ing data bet­ter, or bet­ter ex­plain the re­sults of ex­per­i­ments. A toy model might also in­tro­duce new dis­cov­er­able reg­u­lar­i­ties of the User2 analogue, or new op­tions to ma­nipu­late the la­bel­ing but­ton, as part of the test data, in or­der to simu­late the pro­gres­sion of an ad­vanced agent gain­ing new ca­pa­bil­ities.


  • Task identification problem

    If you have a task-based AGI (Ge­nie) then how do you pin­point ex­actly what you want it to do (and not do)?