“In­ter­rupt­ibil­ity” is a sub­prob­lem of cor­rigi­bil­ity (cre­at­ing an ad­vanced agent that al­lows us, its cre­ators, to ‘cor­rect’ what we see as our mis­takes in con­struct­ing it), as seen from a ma­chine learn­ing paradigm. In par­tic­u­lar, “in­ter­rupt­ibil­ity” says, “If you do in­ter­rupt the op­er­a­tion of an agent, it must not learn to avoid fu­ture in­ter­rup­tions.”

The ground­break­ing pa­per on in­ter­rupt­ibil­ity, “Safely In­ter­rupt­ible Agents”, was pub­lished by Lau­rent Orseau and Stu­art Arm­strong. This says, roughly, that to avoid a model-based re­in­force­ment-learn­ing al­gorithm from learn­ing to avoid in­ter­rup­tion, we should, af­ter any in­ter­rup­tion, prop­a­gate in­ter­nal weight up­dates as if the agent had re­ceived ex­actly its ex­pected re­ward from be­fore the in­ter­rup­tion. This ap­proach was in­spired by Stu­art Arm­strong’s ear­lier idea of util­ity in­differ­ence.

Con­trary to some un­in­formed me­dia cov­er­age, the above pa­per doesn’t solve the gen­eral prob­lem of get­ting an AI to not try to pre­vent it­self from be­ing switched off. In par­tic­u­lar, it doesn’t cover the ad­vanced-safety case of a suffi­ciently in­tel­li­gent AI that is try­ing to achieve par­tic­u­lar fu­ture out­comes and that re­al­izes it needs to go on op­er­at­ing in or­der to achieve those out­comes.

Rather, if a non-gen­eral AI is op­er­at­ing by policy re­in­force­ment—re­peat­ing poli­cies that worked well last time, and avoid­ing poli­cies that worked poorly last time, in some gen­eral sense of a net­work be­ing trained—then ‘in­ter­rupt­ibil­ity’ is about mak­ing an al­gorithm that, af­ter be­ing in­ter­rupted, doesn’t define this as a poor out­come to be avoided (nor a good out­come to be re­peated).

One way of see­ing that In­ter­rupt­ibil­ity doesn’t ad­dress the gen­eral-cog­ni­tion form of the prob­lem is that In­ter­rupt­ibil­ity only changes what hap­pens af­ter an ac­tual in­ter­rup­tion. So if a prob­lem can arise from an AI fore­see­ing in­ter­rup­tion in ad­vance, be­fore hav­ing ever ac­tu­ally been shut off, in­ter­rupt­ibil­ity won’t ad­dress that (on the cur­rent paradigm).

Similarly, in­ter­rupt­ibil­ity would not be con­sis­tent un­der cog­ni­tive re­flec­tion; a suffi­ciently ad­vanced AI that knew about the ex­is­tence of the in­ter­rupt­ibil­ity code would have no rea­son to want that code to go on ex­ist­ing. (It’s hard to even phrase that idea in­side the re­in­force­ment learn­ing frame­work.)

Me­taphor­i­cally speak­ing, we could see the gen­eral no­tion of ‘in­ter­rupt­ibil­ity’ as the mod­ern-day shadow of cor­rigi­bil­ity prob­lems for non-gen­er­ally-in­tel­li­gent, non-fu­ture-prefer­ring, non-re­flec­tive ma­chine learn­ing al­gorithms.

For an ex­am­ple of on­go­ing work on the ad­vanced-agent form of cor­rigi­bil­ity, see the en­try on Arm­strong’s origi­nal pro­posal of Utility in­differ­ence.