Ambitious vs. narrow value learning

Sup­pose I’m try­ing to build an AI sys­tem that “learns what I want” and helps me get it. I think that peo­ple some­times use differ­ent in­ter­pre­ta­tions of this goal. At two ex­tremes of a spec­trum of pos­si­ble in­ter­pre­ta­tions:

  • The AI learns my prefer­ences over (very) long-term out­comes. If I were to die to­mor­row, it could con­tinue pur­su­ing my goals with­out me; if hu­man­ity were to dis­ap­pear to­mor­row, it could re­build the kind of civ­i­liza­tion we would want; etc.The AI might pur­sue rad­i­cally differ­ent sub­goals than I would on the scale of months and years, if it thinks that those sub­goals bet­ter achieve what I re­ally want.

  • The AI learns the nar­rower sub­goals and in­stru­men­tal val­ues I am pur­su­ing. It learns that I am try­ing to sched­ule an ap­point­ment for Tues­day and that I want to avoid in­con­ve­nienc­ing any­one, or that I am try­ing to fix a par­tic­u­lar bug with­out in­tro­duc­ing new prob­lems, etc. It does not make any effort to pur­sue wildly differ­ent short-term goals than I would in or­der to bet­ter re­al­ize my long-term val­ues, though it may help me cor­rect some er­rors that I would be able to rec­og­nize as such.

I think that many re­searchers in­ter­ested in AI safety per se mostly think about the former. I think that re­searchers with a more prac­ti­cal ori­en­ta­tion mostly think about the lat­ter.

The am­bi­tious approach

The max­i­mally am­bi­tious ap­proach has a nat­u­ral the­o­ret­i­cal ap­peal, but it also seems quite hard. It re­quires un­der­stand­ing hu­man prefer­ences in do­mains where hu­mans are typ­i­cally very un­cer­tain, and where our an­swers to sim­ple ques­tions are of­ten in­con­sis­tent, like how we should bal­ance our own welfare with the welfare of oth­ers, or what kinds of ac­tivi­ties we re­ally want to pur­sue vs. en­joy­ing in the mo­ment. (It seems un­likely to me that there is a unified no­tion of “what I want” in many of these cases.) It also re­quires ex­trap­o­la­tion to rad­i­cally un­fa­mil­iar do­mains, where we will need to make de­ci­sions about is­sues like pop­u­la­tion ethics, what kinds of crea­tures do we care about, and un­fore­seen new tech­nolo­gies.

I have writ­ten about this prob­lem, point­ing out that it is un­clear how you would solve it even with an un­limited amount of com­put­ing power. My im­pres­sion is that most prac­ti­tion­ers don’t think of this prob­lem even as a long-term re­search goal — it’s a qual­i­ta­tively differ­ent pro­ject with­out di­rect rele­vance to the kinds of prob­lems they want to solve.

The nar­row approach

The nar­row ap­proach looks rel­a­tively tractable and well-mo­ti­vated by ex­ist­ing prob­lems. We want to build ma­chines that helps us do the things we want to do, and to that end they need to be able to un­der­stand what we are try­ing to do and what in­stru­men­tal val­ues guide our be­hav­ior. To the ex­tent that our “prefer­ences” are un­der­de­ter­mined or in­con­sis­tent, we are happy if our sys­tems at least do as well as a hu­man, and make the kinds of im­prove­ments that hu­mans would re­li­ably con­sider im­prove­ments.

But it’s not clear that any­thing short of the max­i­mally am­bi­tious ap­proach can solve the prob­lem we ul­ti­mately care about. A suffi­ciently clever ma­chine will be able to make long-term plans that are sig­nifi­cantly bet­ter than hu­man plans. In the long run, we will want to be able to use AI abil­ities to make these im­proved plans, and to gen­er­ally perform tasks in ways that hu­mans would never think of perform them — go­ing far be­yond cor­rect­ing sim­ple er­rors that can be eas­ily rec­og­nized as such.

In defense of the nar­row approach

I think that the nar­row ap­proach prob­a­bly takes us much fur­ther than it at first ap­pears. I’ve writ­ten about these ar­gu­ments be­fore, which are for the most part similar to the rea­sons that ap­proval-di­rected agents or di­rectly mimick­ing hu­man be­hav­ior might work, but I’ll quickly sum­ma­rize them again:

In­stru­men­tal goals

Hu­mans have many clear in­stru­men­tal goals like “re­main­ing in effec­tive con­trol of the AI sys­tems I de­ploy,” “ac­quiring re­sources and other in­fluence in the world,” or “bet­ter un­der­stand­ing the world and what I want.” A value learner may able to learn ro­bust prefer­ences like these and pur­sue those in­stru­men­tal goals us­ing all of its in­ge­nu­ity. Such AI’s would not nec­es­sar­ily be at a sig­nifi­cant dis­ad­van­tage with re­spect to nor­mal com­pe­ti­tion, yet the re­sources they ac­quired would re­main un­der mean­ingful hu­man con­trol (if that’s what their users would pre­fer).

This re­quires learn­ing ro­bust for­mu­la­tions of con­cepts like “mean­ingful con­trol,” but it does not re­quire mak­ing in­fer­ences about cases where hu­mans have con­flict­ing in­tu­itions, nor con­sid­er­ing cases which are rad­i­cally differ­ent from those en­coun­tered in train­ing — AI sys­tems can con­tinue to gather train­ing data and query their users even as the na­ture of hu­man-AI in­ter­ac­tions changes (if that’s what their users would pre­fer).


Even if we can’t in­fer hu­man prefer­ences over very dis­tant ob­jects, we might be able to in­fer hu­man prefer­ences well enough to guide a pro­cess of de­liber­a­tion (real or hy­po­thet­i­cal). Us­ing the in­ferred prefer­ences of the hu­man could help elimi­nate some of the er­rors that a hu­man would tra­di­tion­ally make dur­ing de­liber­a­tion. Pre­sum­ably these er­rors run counter to a de­liber­a­tor’s short-term ob­jec­tives, if those ob­jec­tives are prop­erly un­der­stood, and this judg­ment doesn’t re­quire a di­rect un­der­stand­ing of the de­liber­a­tor’s big-pic­ture val­ues.

This kind of er­ror-cor­rec­tion could be used as a com­ple­ment to other kinds of ideal­iza­tion, like pro­vid­ing the hu­man a lot of time, al­low­ing them to con­sult a large com­mu­nity of ad­vi­sors, or al­low­ing them to use au­to­mated tools.

Such a pro­cess of er­ror-cor­rected de­liber­a­tion could it­self be used to provide a more ro­bust defi­ni­tion of val­ues or a more for­ward look­ing crite­rion of ac­tion, such as “an out­come/​ac­tion is valuable to the ex­tent that I would/​did judge it valuable af­ter ex­ten­sive de­liber­a­tion.”


By in­ter­act­ing with AI as­sis­tants, hu­mans can po­ten­tially form and ex­e­cute very so­phis­ti­cated plans; if so, sim­ply helping them achieve their short-term goals may be all that is needed. For some dis­cus­sion of this idea, see these three posts.


I think that re­searchers in­ter­ested in scal­able AI con­trol have been too quick to dis­miss “nar­row” value learn­ing as un­re­lated to their core challenge. Over­all I ex­pect that the availa­bil­ity of effec­tive nar­row value learn­ing would sig­nifi­cantly sim­plify the AI con­trol prob­lem even for su­per­in­tel­li­gent sys­tems, though at the mo­ment we don’t un­der­stand the re­la­tion­ship very well.

(Thanks to An­dreas Stuh­lmüller and Owain Evans for helpful dis­cus­sion.)