A ‘cor­rigible’ agent is one that doesn’t in­terfere with what we would in­tu­itively see as at­tempts to ‘cor­rect’ the agent, or ‘cor­rect’ our mis­takes in build­ing it; and per­mits these ‘cor­rec­tions’ de­spite the ap­par­ent in­stru­men­tally con­ver­gent rea­son­ing say­ing oth­er­wise.

More ab­stractly:

  • A cor­rigible agent ex­pe­riences no prefer­ence or in­stru­men­tal pres­sure to in­terfere with at­tempts by the pro­gram­mers or op­er­a­tors to mod­ify the agent, im­pede its op­er­a­tion, or halt its ex­e­cu­tion.

  • A cor­rigible agent does not at­tempt to ma­nipu­late or de­ceive its op­er­a­tors, es­pe­cially with re­spect to prop­er­ties of the agent that might oth­er­wise cause its op­er­a­tors to mod­ify it.

  • A cor­rigible agent does not try to ob­scure its thought pro­cesses from its pro­gram­mers or op­er­a­tors.

  • A cor­rigible agent is mo­ti­vated to pre­serve the cor­rigi­bil­ity of the larger sys­tem if that agent self-mod­ifies, con­structs sub-agents in the en­vi­ron­ment, or offloads part of its cog­ni­tive pro­cess­ing to ex­ter­nal sys­tems; or al­ter­na­tively, the agent has no prefer­ence to ex­e­cute any of those gen­eral ac­tivi­ties.

A stronger form of cor­rigi­bil­ity would re­quire the AI to pos­i­tively co­op­er­ate or as­sist, such that the AI would re­build the shut­down but­ton if it were de­stroyed, or ex­pe­rience a pos­i­tive prefer­ence not to self-mod­ify if self-mod­ifi­ca­tion could lead to in­cor­rigi­bil­ity. But this is not part of the pri­mary speci­fi­ca­tion since it’s pos­si­ble that we would not want the AI try­ing to ac­tively be helpful in as­sist­ing our at­tempts to shut it down, and would in fact pre­fer the AI to be pas­sive about this.

Good pro­pos­als for achiev­ing cor­rigi­bil­ity in spe­cific re­gards are open prob­lems in AI al­ign­ment. Some ar­eas of ac­tive cur­rent re­search are Utility in­differ­ence and in­ter­rupt­ibil­ity.

Achiev­ing to­tal cor­rigi­bil­ity ev­ery­where via some sin­gle, gen­eral men­tal state in which the AI “knows that it is still un­der con­struc­tion” or “be­lieves that the pro­gram­mers know more than it does about its own goals” is termed ‘the hard prob­lem of cor­rigi­bil­ity’.


De­cep­tion and ma­nipu­la­tion by default

By de­fault, most sets of prefer­ences are such that an agent act­ing ac­cord­ing to those prefer­ences will pre­fer to re­tain its cur­rent prefer­ences. For ex­am­ple, imag­ine an agent which is at­tempt­ing to col­lect stamps. Al­ter­ing the agent so that it prefers to col­lect bot­tle caps would lead to fu­tures where the agent has fewer stamps, and so al­low­ing this event to oc­cur is dis­preferred (un­der the cur­rent, stamp-col­lect­ing prefer­ences).

More gen­er­ally, as noted by in­stru­men­tally con­ver­gent strate­gies, most util­ity func­tions give an agent strong in­cen­tives to re­tain its cur­rent util­ity func­tion: imag­ine an agent con­structed so that it acts ac­cord­ing to the util­ity func­tion U, and imag­ine fur­ther that its op­er­a­tors think they built the agent to act ac­cord­ing to a differ­ent util­ity func­tion U’. If the agent learns this fact, then it has in­cen­tives to ei­ther de­ceive its pro­gram­mers (pre­vent them from notic­ing that the agent is act­ing ac­cord­ing to U in­stead of U’) or ma­nipu­late its pro­gram­mers (into be­liev­ing that they ac­tu­ally pre­fer U to U’, or by co­erc­ing them into leav­ing its util­ity func­tion in­tact).

A cor­rigible agent must avoid these de­fault in­cen­tives to ma­nipu­late and de­ceive, but spec­i­fy­ing some set of prefer­ences that avoids de­cep­tion/​ma­nipu­la­tion in­cen­tives re­mains an open prob­lem.

Trou­ble with util­ity func­tion uncertainty

A first at­tempt at de­scribing a cor­rigible agent might in­volve spec­i­fy­ing a util­ity max­i­miz­ing agent that is un­cer­tain about its util­ity func­tion. How­ever, while this could al­low the agent to make some changes to its prefer­ences as a re­sult of ob­ser­va­tions, the agent would still be in­cor­rigible when it came time for the pro­gram­mers to at­tempt to cor­rect what they see as mis­takes in their at­tempts to for­mu­late how the “cor­rect” util­ity func­tion should be de­ter­mined from in­ter­ac­tion with the en­vi­ron­ment.

As an overly sim­plis­tic ex­am­ple, imag­ine an agent at­tempt­ing to max­i­mize the in­ter­nal hap­piness of all hu­mans, but which has un­cer­tainty about what that means. The op­er­a­tors might be­lieve that if the agent does not act as in­tended, they can sim­ply ex­press their dis­satis­fac­tion and cause it to up­date. How­ever, if the agent is rea­son­ing ac­cord­ing to an im­pov­er­ished hy­poth­e­sis space of util­ity func­tions, then it may be­have quite in­cor­rigibly: say it has nar­rowed down its con­sid­er­a­tion to two differ­ent hy­pothe­ses, one be­ing that a cer­tain type of opi­ate causes hu­mans to ex­pe­rience max­i­mal plea­sure, and the other is that a cer­tain type of stim­u­lant causes hu­mans to ex­pe­rience max­i­mal plea­sure. If the agent be­gins ad­minis­ter­ing opi­ates to hu­mans, and the hu­mans re­sist, then the agent may “up­date” and start ad­minis­ter­ing stim­u­lants in­stead. But the agent would still be in­cor­rigible — it would re­sist at­tempts by the pro­gram­mers to turn it off so that it stops drug­ging peo­ple.

It does not seem that cor­rigi­bil­ity can be triv­ially solved by spec­i­fy­ing agents with un­cer­tainty about their util­ity func­tion. A cor­rigible agent must some­how also be able to rea­son about the fact that the hu­mans them­selves might have been con­fused or in­cor­rect when spec­i­fy­ing the pro­cess by which the util­ity func­tion is iden­ti­fied, and so on.

Trou­ble with penalty terms

A sec­ond at­tempt at de­scribing a cor­rigible agent might at­tempt to spec­ify a util­ity func­tion with “penalty terms” for bad be­hav­ior. This is un­likely to work for a num­ber of rea­sons. First, there is the Near­est un­blocked strat­egy prob­lem: if a util­ity func­tion gives an agent strong in­cen­tives to ma­nipu­late its op­er­a­tors, then adding a penalty for “ma­nipu­la­tion” to the util­ity func­tion will tend to give the agent strong in­cen­tives to cause its op­er­a­tors to do what it would have ma­nipu­lated them to do, with­out tak­ing any ac­tion that tech­ni­cally trig­gers the “ma­nipu­la­tion” cause. It is likely ex­tremely difficult to spec­ify con­di­tions for “de­cep­tion” and “ma­nipu­la­tion” that ac­tu­ally rule out all un­de­sir­able be­hav­ior, es­pe­cially if the agent is smarter than us or grow­ing in ca­pa­bil­ity.

More gen­er­ally, it does not seem like a good policy to con­struct an agent that searches for pos­i­tive-util­ity ways to de­ceive and ma­nipu­late the pro­gram­mers, even if those searches are ex­pected to fail. The goal of cor­rigi­bil­ity is not to de­sign agents that want to de­ceive but can’t. Rather, the goal is to con­struct agents that have no in­cen­tives to de­ceive or ma­nipu­late in the first place: a cor­rigible agent is one that rea­sons as if it is in­com­plete and po­ten­tially flawed in dan­ger­ous ways.

Open problems

Some open prob­lems in cor­rigi­bil­ity are:

Hard prob­lem of corrigibility

On a hu­man, in­tu­itive level, it seems like there’s a cen­tral idea be­hind cor­rigi­bil­ity that seems sim­ple to us: un­der­stand that you’re flawed, that your meta-pro­cesses might also be flawed, and that there’s an­other cog­ni­tive sys­tem over there (the pro­gram­mer) that’s less flawed, so you should let that cog­ni­tive sys­tem cor­rect you even if that doesn’t seem like the first-or­der right thing to do. You shouldn’t dis­assem­ble that other cog­ni­tive sys­tem to up­date your model in a Bayesian fash­ion on all pos­si­ble in­for­ma­tion that other cog­ni­tive sys­tem con­tains; you shouldn’t model how that other cog­ni­tive sys­tem might op­ti­mally cor­rect you and then carry out the cor­rec­tion your­self; you should just let that other cog­ni­tive sys­tem mod­ify you, with­out at­tempt­ing to ma­nipu­late how it mod­ifies you to be a bet­ter form of ‘cor­rec­tion’.

For­mal­iz­ing the hard prob­lem of cor­rigi­bil­ity seems like it might be a prob­lem that is hard (hence the name). Pre­limi­nary re­search might talk about some ob­vi­ous ways that we could model A as be­liev­ing that B has some form of in­for­ma­tion that A’s prefer­ence frame­work des­ig­nates as im­por­tant, and show­ing what these al­gorithms ac­tu­ally do and how they fail to solve the hard prob­lem of cor­rigi­bil­ity.

Utility indifference

ex­plain util­ity indifference

The cur­rent state of tech­nol­ogy on this is that the AI be­haves as if there’s an ab­solutely fixed prob­a­bil­ity of the shut­down but­ton be­ing pressed, and there­fore doesn’t try to mod­ify this prob­a­bil­ity. But then the AI will try to use the shut­down but­ton as an out­come pump. Is there any way to avert this?


Do­ing some­thing in the top 0.1% of all ac­tions. This is ac­tu­ally a Limited AI paradigm and ought to go there, not un­der Cor­rigi­bil­ity.

Con­ser­va­tive strategies

Do some­thing that’s as similar as pos­si­ble to other out­comes and strate­gies that have been whitelisted. Also ac­tu­ally a Limited AI paradigm.

This seems like some­thing that could be in­ves­ti­gate in prac­tice on e.g. a chess pro­gram.

Low im­pact measure

(Also re­ally a Limited AI paradigm.)

Figure out a mea­sure of ‘im­pact’ or ‘side effects’ such that if you tell the AI to paint all cars pink, it just paints all cars pink, and doesn’t trans­form Jupiter into a com­puter to figure out how to paint all cars pink, and doesn’t dump toxic runoff from the paint into ground­wa­ter; and also doesn’t cre­ate util­ity fog to make it look to peo­ple like the cars haven’t been painted pink (in or­der to min­i­mize this ‘side effect’ of paint­ing the cars pink), and doesn’t let the car-paint­ing ma­chines run wild af­ter­ward in or­der to min­i­mize its own ac­tions on the car-paint­ing ma­chines. Roughly, try to ac­tu­ally for­mal­ize the no­tion of “Just paint the cars pink with a min­i­mum of side effects, dammit.”

It seems likely that this prob­lem could turn out to be FAI-com­plete, if for ex­am­ple “Cure can­cer, but then it’s okay if that causes hu­man re­search in­vest­ment into cur­ing can­cer to de­crease” is only dis­t­in­guish­able by us as an okay side effect be­cause it doesn’t re­sult in ex­pected util­ity de­crease un­der our own de­sires.

It still seems like it might be good to, e.g., try to define “low side effect” or “low im­pact” in­side the con­text of a generic Dy­namic Bayes Net, and see if maybe we can find some­thing af­ter all that yields our in­tu­itively de­sired be­hav­ior or helps to get closer to it.

Am­bi­guity identification

When there’s more than one thing the user could have meant, ask the user rather than op­ti­miz­ing the mix­ture. Even if A is in some sense a ‘sim­pler’ con­cept to clas­sify the data than B, no­tice if B is also a ‘very plau­si­ble’ way to clas­sify the data, and ask the user if they meant A or B. The goal here is to, in the clas­sic ‘tank clas­sifier’ prob­lem where the tanks were pho­tographed in lower-level illu­mi­na­tion than the non-tanks, have some­thing that asks the user, “Did you mean to de­tect tanks or low light or ‘tanks and low light’ or what?”

Safe out­come pre­dic­tion and description

Com­mu­ni­cate the AI’s pre­dicted re­sult of some ac­tion to the user, with­out putting the user in­side an un­shielded argmax of max­i­mally effec­tive com­mu­ni­ca­tion.

Com­pe­tence aversion

To build e.g. a be­hav­iorist ge­nie, we need to have the AI e.g. not ex­pe­rience an in­stru­men­tal in­cen­tive to get bet­ter at mod­el­ing minds, or re­fer mind-mod­el­ing prob­lems to sub­agents, etcetera. The gen­eral sub­prob­lem might be ‘avert­ing the in­stru­men­tal pres­sure to be­come good at mod­el­ing a par­tic­u­lar as­pect of re­al­ity’. A toy prob­lem might be an AI that in gen­eral wants to get the gold in a Wum­pus prob­lem, but doesn’t ex­pe­rience an in­stru­men­tal pres­sure to know the state of the up­per-right-hand-cor­ner cell in par­tic­u­lar.


  • Programmer deception
  • Utility indifference

    How can we make an AI in­differ­ent to whether we press a but­ton that changes its goals?

  • Averting instrumental pressures

    Al­most-any util­ity func­tion for an AI, whether the tar­get is di­a­monds or pa­per­clips or eu­daimo­nia, im­plies sub­goals like rapidly self-im­prov­ing and re­fus­ing to shut down. Can we make that not hap­pen?

  • Averting the convergent instrumental strategy of self-improvement

    We prob­a­bly want the first AGI to not im­prove as fast as pos­si­ble, but im­prov­ing as fast as pos­si­ble is a con­ver­gent strat­egy for ac­com­plish­ing most things.

  • Shutdown problem

    How to build an AGI that lets you shut it down, de­spite the ob­vi­ous fact that this will in­terfere with what­ever the AGI’s goals are.

  • User manipulation

    If not oth­er­wise averted, many of an AGI’s de­sired out­comes are likely to in­ter­act with users and hence im­ply an in­cen­tive to ma­nipu­late users.

  • Hard problem of corrigibility

    Can you build an agent that rea­sons as if it knows it­self to be in­com­plete and sym­pa­thizes with your want­ing to re­build or cor­rect it?

  • Problem of fully updated deference

    Why moral un­cer­tainty doesn’t stop an AI from defend­ing its off-switch.

  • Interruptibility

    A sub­prob­lem of cor­rigi­bil­ity un­der the ma­chine learn­ing paradigm: when the agent is in­ter­rupted, it must not learn to pre­vent fu­ture in­ter­rup­tions.


  • AI alignment

    The great civ­i­liza­tional prob­lem of cre­at­ing ar­tifi­cially in­tel­li­gent com­puter sys­tems such that run­ning them is a good idea.