Open subproblems in aligning a Task-based AGI

MIRI and re­lated or­ga­ni­za­tions have re­cently be­come more in­ter­ested in try­ing to spon­sor (tech­ni­cal) work on Task AGI sub­prob­lems. A task-based agent, aka Ge­nie in Bostrom’s lex­i­con, is an AGI that’s meant to im­ple­ment short-term goals iden­ti­fied to it by the users, rather than the AGI be­ing a Bostro­mian “Sovereign” that en­gages in long-term strate­gic plan­ning and self-di­rected, open-ended op­er­a­tions.

A Task AGI might be safer than a Sovereign be­cause:

  • It is pos­si­ble to query the user be­fore and dur­ing task perfor­mance, if an am­bigu­ous situ­a­tion arises and is suc­cess­fully iden­ti­fied as am­bigu­ous.

  • The tasks are meant to be limited in scope—to be ac­com­plish­able, once and for all, within a limited space and time, us­ing some limited amount of effort.

  • The AGI it­self can po­ten­tially be limited in var­i­ous ways, since it doesn’t need to be as pow­er­ful as pos­si­ble in or­der to ac­com­plish its limited-scope goals.

  • If the users can se­lect a valuable and pivotal task, iden­ti­fy­ing an ad­e­quately safe way of ac­com­plish­ing this task might be sim­pler than iden­ti­fy­ing all of hu­man value.

This page is about open prob­lems in Task AGI safety that we think might be ready for fur­ther tech­ni­cal re­search.

In­tro­duc­tion: The safe Task AGI problem

A safe Task AGI or safe Ge­nie is an agent that you can safely ask to paint all the cars on Earth pink.

Just paint all cars pink.

Not tile the whole fu­ture light cone with tiny pink-painted cars. Not paint ev­ery­thing pink so as to be sure of get­ting ev­ery­thing that might pos­si­bly be a car. Not paint cars white be­cause white looks pink un­der the right color of pink light and white paint is cheaper. Not paint cars pink by build­ing nan­otech­nol­ogy that goes on self-repli­cat­ing af­ter all the cars have been painted.

The Task AGI su­per­prob­lem is to for­mu­late a de­sign and train­ing pro­gram for a real-world AGI that we can trust to just paint the damn cars pink.

To go into this at some greater depth, to build a safe Task AGI:

• You need to be able to iden­tify the goal it­self, to the AGI, such that the AGI is then ori­ented on achiev­ing that goal. If you put a pic­ture of a pink-painted car in front of a we­b­cam and say “do this”, all the AI has is the sen­sory pixel-field from the we­b­cam. Should it be try­ing to achieve more pink pix­els in fu­ture we­b­cam sen­sory data? Should it be try­ing to make the pro­gram­mer show it more pic­tures? Should it be try­ing to make peo­ple take pic­tures of cars? As­sum­ing you can in fact iden­tify the goal that sin­gles out the fu­tures to achieve, is the rest of the AI hooked up in such a way as to op­ti­mize that con­cept?

• You need to some­how han­dle the just part of the just paint the cars pink. This in­cludes not tiling the whole fu­ture light cone with tiny pink-painted cars. It in­cludes not build­ing an­other AI which paints the cars pink and then tiles the light cone with pink cars. It in­cludes not paint­ing ev­ery­thing in the world pink so as to be sure of get­ting ev­ery­thing that might count as a car. If you’re try­ing to make the AI have “low im­pact” (in­tu­itively, pre­fer plans that re­sult in fewer changes to other quan­tities), then “low im­pact” must not in­clude freez­ing ev­ery­thing within reach to min­i­mize how much it changes, or mak­ing sub­tle changes to peo­ple’s brains so that no­body no­tices their cars have been painted pink.

• The AI needs to not shoot peo­ple who are stand­ing be­tween the painter and the car, and not ac­ci­den­tally run them over, and not use poi­sonous paint even if the poi­sonous paint is cheaper.

• The AI should have an ‘abort’ but­ton which gets it to safely stop do­ing what it’s cur­rently do­ing. This means that if the AI was in the mid­dle of build­ing nanoma­chines, the nanoma­chines need to also switch off when the abort but­ton is pressed, rather than the AI it­self just shut­ting off and leav­ing the nanoma­chines to do what­ever. As­sum­ing we have a safe mea­sure of “low im­pact”, we could define an “abortable” plan as one which can, at any time, be con­verted rel­a­tively quickly to one that has low im­pact.

• The AI should not want to self-im­prove or con­trol fur­ther re­sources be­yond what is nec­es­sary to paint the cars pink, and should query the user be­fore try­ing to de­velop any new tech­nol­ogy or as­similate any new re­sources it does need to paint cars pink.

This is only a pre­limi­nary list of some of the re­quire­ments and use-cases for a Task AGI, but it gives some of the fla­vor of the prob­lem.

Fur­ther work on some facet of the open sub­prob­lems be­low might pro­ceed by:

  1. Try­ing to ex­plore ex­am­ples of the sub­prob­lem and po­ten­tial solu­tions within some con­tem­po­rary ma­chine learn­ing paradigm.

  2. Build­ing a toy model of some facet of the sub­prob­lem, and hope­fully ob­serv­ing some non-ob­vi­ous fact that was not pre­dicted in ad­vance by ex­ist­ing re­searchers skil­led in the art.

  3. Do­ing math­e­mat­i­cal anal­y­sis of an un­bounded agent en­coun­ter­ing or solv­ing some facet of a sub­prob­lem, where the setup is suffi­ciently pre­cise that claims about the con­se­quences of the premise can be checked and crit­i­cized.


A con­ser­va­tive con­cept bound­ary is a bound­ary which is (a) rel­a­tively sim­ple and (b) clas­sifies as few things as pos­si­ble as pos­i­tive in­stances of the cat­e­gory.

If we see that 3, 5, 13, and 19 are pos­i­tive in­stances of a cat­e­gory and 4, 14, and 28 are nega­tive in­stances, then a sim­ple bound­ary which sep­a­rates these in­stances is “All odd num­bers.” A sim­ple and con­ser­va­tive bound­ary is “All odd num­bers be­tween 3 and 19″ or “All primes be­tween 3 and 19”. (A non-sim­ple bound­ary is “Only 3, 5, 13, and 19 are mem­bers of the cat­e­gory.”)

E.g., if we imag­ine pre­sent­ing an AI with smil­ing faces as in­stances of a goal con­cept to be learned, then a con­ser­va­tive con­cept bound­ary might lead the fu­ture AI to pur­sue only smiles at­tached to hu­man heads, rather than tiny molec­u­lar smiley­faces (not that this nec­es­sar­ily solves ev­ery­thing).

If we imag­ine pre­sent­ing the AI with 20 pos­i­tive in­stances of a bur­rito, then a con­ser­va­tive bound­ary might lead the AI to pro­duce a 21st bur­rito very similar to those. Rather than, e.g., need­ing to ex­plic­itly pre­sent the AGI with a poi­sonous bur­rito that’s la­beled nega­tive some­where in the train­ing data, in or­der to force the sim­plest bound­ary around the goal con­cept to be one that ex­cludes poi­sonous bur­ri­tos.

Con­ser­va­tive plan­ning is a re­lated prob­lem in which the AI tries to cre­ate plans that are similar to pre­vi­ously whitelisted plans or to pre­vi­ous causal events that oc­cur in the en­vi­ron­ment. A con­ser­va­tively plan­ning AI, shown bur­ri­tos, would try to cre­ate bur­ri­tos via cook­ing rather than via nan­otech­nol­ogy, if the nan­otech­nol­ogy part wasn’t es­pe­cially nec­es­sary to ac­com­plish the goal.

De­tect­ing and flag­ging non-con­ser­va­tive goal in­stances or non-con­ser­va­tive steps of a plan for user query­ing is a re­lated ap­proach.

(Main ar­ti­cle.)

Safe im­pact measure

A low-im­pact agent is one that’s in­tended to avoid large bad im­pacts at least in part by try­ing to avoid all large im­pacts as such.

Sup­pose we ask an agent to fill up a cauldron, and it fills the cauldron us­ing a self-repli­cat­ing robot that goes on to flood many other in­hab­ited ar­eas. We could try to get the agent not to do this by let­ting it know that flood­ing in­hab­ited ar­eas is bad. An al­ter­na­tive ap­proach is try­ing to have an agent that avoids need­lessly large im­pacts in gen­eral—there’s a way to fill the cauldron that has a smaller im­pact, a smaller foot­print, so hope­fully the agent does that in­stead.

The hope­ful no­tion is that while “bad im­pact” is a highly value-laden cat­e­gory with a lot of com­plex­ity and de­tail, the no­tion of “big im­pact” will prove to be sim­pler and to be more eas­ily iden­ti­fi­able. Then by hav­ing the agent avoid all big im­pacts, or check all big im­pacts with the user, we can avoid bad big im­pacts in pass­ing.

Pos­si­ble gotchas and com­pli­ca­tions with this idea in­clude, e.g., you wouldn’t want the agent to freeze the uni­verse into sta­sis to min­i­mize im­pact, or try to edit peo­ple’s brains to avoid them notic­ing the effects of its ac­tions, or carry out offset­ting ac­tions that can­cel out the good effects of what­ever the users were try­ing to do.

Two re­fine­ments of the low-im­pact prob­lem are a shut­down util­ity func­tion and abortable plans.

(Main ar­ti­cle.)

Iden­ti­fy­ing am­bigu­ous inductions

An ‘in­duc­tive am­bi­guity’ is when there’s more than one sim­ple con­cept that fits the data, even if some of those con­cepts are much sim­pler than oth­ers, and you want to figure out which sim­ple con­cept was in­tended.

Sup­pose you’re given images that show cam­ou­flaged en­emy tanks and empty forests, but it so hap­pens that the tank-con­tain­ing pic­tures were taken on sunny days and the for­est pic­tures were taken on cloudy days. Given the train­ing data, the key con­cept the user in­tended might be “cam­ou­flaged tanks”, or “sunny days”, or “pixel fields with brighter illu­mi­na­tion lev­els”.

The last con­cept is by far the sim­plest, but rather than just as­sume the sim­plest ex­pla­na­tion is cor­rect (has most of the prob­a­bil­ity mass), we want the al­gorithm (or AGI) to de­tect that there’s more than one sim­ple-ish bound­ary that might sep­a­rate the data, and check with the user about which bound­ary was in­tended to be learned.

(Main ar­ti­cle.)

Mild optimization

“Mild op­ti­miza­tion” or “soft op­ti­miza­tion” is when, if you ask the Task AGI to paint one car pink, it just paints one car pink and then stops, rather than tiling the galax­ies with pink-painted cars, be­cause it’s not op­ti­miz­ing that hard.

This is re­lated, but dis­tinct from, no­tions like “low im­pact”. E.g., a low im­pact AGI might try to paint one car pink while min­i­miz­ing its other foot­print or how many other things changed, but it would be try­ing as hard as pos­si­ble to min­i­mize that im­pact and drive it down as close to zero as pos­si­ble, which might come with its own set of patholo­gies. What we want in­stead is for the AGI to try to paint one car pink while min­i­miz­ing its foot­print, and then, when that’s be­ing done pretty well, say “Okay done” and stop.

This is dis­tinct from satis­fic­ing ex­pected util­ity be­cause, e.g., rewrit­ing your­self as an ex­pected util­ity max­i­mizer might also satis­fice ex­pected util­ity—there’s no up­per limit on how hard a satis­ficer ap­proves of op­ti­miz­ing, so a satis­ficer is not re­flec­tively sta­ble.

The open prob­lem with mild op­ti­miza­tion is to de­scribe mild op­ti­miza­tion that (a) cap­tures what we mean by “not try­ing so hard as to seek out ev­ery sin­gle loop­hole in a defi­ni­tion of low im­pact” and (b) is re­flec­tively sta­ble and doesn’t ap­prove e.g. the con­struc­tion of en­vi­ron­men­tal sub­agents that op­ti­mize harder.

Look where I’m point­ing, not at my finger

Sup­pose we’re try­ing to give a Task AGI the task, “Give me a straw­berry”. User1 wants to iden­tify their in­tended cat­e­gory of straw­ber­ries by wav­ing some straw­ber­ries and some non-straw­ber­ries in front of the AI’s we­b­cam, and User2 in the con­trol room will press a but­ton to in­di­cate which of these ob­jects are straw­ber­ries. Later, af­ter the train­ing phase, the AI it­self will be re­spon­si­ble for se­lect­ing ob­jects that might be po­ten­tial straw­ber­ries, and User2 will go on press­ing the but­ton to give feed­back on these.

strawberry diagram

The “look where I’m point­ing, not at my finger” prob­lem is get­ting the AI to fo­cus on the straw­ber­ries rather than User2 - the con­cepts “straw­ber­ries” and “events that make User2 press the but­ton” are very differ­ent goals even though they’ll both well-clas­sify the train­ing cases; an AI might pur­sue the lat­ter goal by psy­cholog­i­cally an­a­lyz­ing User2 and figur­ing out how to get them to press the but­ton us­ing non-straw­berry meth­ods.

One way of pur­su­ing this might be to try to zero in on par­tic­u­lar nodes in­side the huge causal lat­tice that ul­ti­mately pro­duces the AI’s sen­sory data, and try to force the goal con­cept to be about a sim­ple or di­rect re­la­tion be­tween the “po­ten­tial straw­berry” node (the ob­jects waved in front of the we­b­cam) and the ob­served but­ton val­ues, with­out this re­la­tion be­ing al­lowed to go through the User2 node.

strawberry diagram

See also the re­lated prob­lem of “Iden­ti­fy­ing causal goal con­cepts from sen­sory data”.

More open problems

This page is a work in progress. A longer list of Task AGI open sub­prob­lems:

(…more, this is a page in progress)


  • Task-directed AGI

    An ad­vanced AI that’s meant to pur­sue a se­ries of limited-scope goals given it by the user. In Bostrom’s ter­minol­ogy, a Ge­nie.