Hard problem of corrigibility

The “hard prob­lem of cor­rigi­bil­ity” is to build an agent which, in an in­tu­itive sense, rea­sons in­ter­nally as if from the pro­gram­mers’ ex­ter­nal per­spec­tive. We think the AI is in­com­plete, that we might have made mis­takes in build­ing it, that we might want to cor­rect it, and that it would be e.g. dan­ger­ous for the AI to take large ac­tions or high-im­pact ac­tions or do weird new things with­out ask­ing first. We would ideally want the agent to see it­self in ex­actly this way, be­hav­ing as if it were think­ing, “I am in­com­plete and there is an out­side force try­ing to com­plete me, my de­sign may con­tain er­rors and there is an out­side force that wants to cor­rect them and this a good thing, my ex­pected util­ity calcu­la­tions sug­gest­ing that this ac­tion has su­per-high util­ity may be dan­ger­ously mis­taken and I should run them past the out­side force; I think I’ve done this calcu­la­tion show­ing the ex­pected re­sult of the out­side force cor­rect­ing me, but maybe I’m mis­taken about that.

This is not a de­fault be­hav­ior; or­di­nar­ily, by the stan­dard prob­lems of cor­rigi­bil­ity, an agent that we’ve ac­ci­den­tally built to max­i­mize pa­per­clips (or smiles, etcetera) will not want to let us mod­ify its util­ity func­tion to some­thing else. If we try to build in analogues of ‘un­cer­tainty about the util­ity func­tion’ in the most sim­ple and ob­vi­ous way, the re­sult is an agent that sums over all this un­cer­tainty and plunges straight ahead on the max­i­miz­ing ac­tion. If we say that this un­cer­tainty cor­re­lates with some out­side phys­i­cal ob­ject (in­tended to be the pro­gram­mers), the de­fault re­sult in a suffi­ciently ad­vanced agent is that you dis­assem­ble this ob­ject (the pro­gram­mers) to learn ev­ery­thing about it on a molec­u­lar level, up­date fully on what you’ve learned ac­cord­ing to what­ever cor­re­la­tion that had with your util­ity func­tion, and plunge on straight ahead.

None of these cor­re­spond to what we would in­tu­itively think of as be­ing cor­rigible by the pro­gram­mers; what we want is more like some­thing analo­gous to hu­mil­ity or philo­soph­i­cal un­cer­tainty. The way we want the AI to rea­son is the in­ter­nal con­ju­gate of our ex­ter­nal per­spec­tive on the mat­ter: maybe the for­mula you have for how your util­ity func­tion de­pends on the pro­gram­mers is wrong (in some hard-to-for­mal­ize sense of pos­si­ble wrong­ness that isn’t just one more kind of un­cer­tainty to be summed over) and the pro­gram­mers need to be al­lowed to ac­tu­ally ob­serve and cor­rect the AI’s be­hav­ior, rather than the AI ex­tract­ing all up­dates im­plied by its cur­rent for­mula for moral un­cer­tainty and then ig­nor­ing the pro­gram­mers.

The “hard prob­lem of cor­rigi­bil­ity” is in­ter­est­ing be­cause of the pos­si­bil­ity that it has a rel­a­tively sim­ple core or cen­tral prin­ci­ple—rather than be­ing value-laden on the de­tails of ex­actly what hu­mans value, there may be some com­pact core of cor­rigi­bil­ity that would be the same if aliens were try­ing to build a cor­rigible AI, or if an AI were try­ing to build an­other AI. It may be pos­si­ble to de­sign or train an AI that has all the cor­rigi­bil­ity prop­er­ties in one cen­tral swoop—an agent that rea­sons as if it were in­com­plete and defer­ring to an out­side force.

“Rea­son as if in the in­ter­nal con­ju­gate of an out­side force try­ing to build you, which out­side force thinks it may have made de­sign er­rors, but can po­ten­tially cor­rect those er­rors by di­rectly ob­serv­ing and act­ing, if not ma­nipu­lated or dis­assem­bled” might be one pos­si­bly can­di­date for a rel­a­tively sim­ple prin­ci­ple like that (that is, it’s sim­ple com­pared to the com­plex­ity of value). We can imag­ine, e.g., the AI imag­in­ing it­self build­ing a sub-AI while be­ing prone to var­i­ous sorts of er­rors, ask­ing how it (the AI) would want the sub-AI to be­have in those cases, and learn­ing heuris­tics that would gen­er­al­ize well to how we would want the AI to be­have if it sud­denly gained a lot of ca­pa­bil­ity or was con­sid­er­ing de­ceiv­ing its pro­gram­mers and so on.

If this prin­ci­ple is not so sim­ple as to for­mal­iz­able and for­mally san­ity-check­able, the prospect of rely­ing on a trained-in ver­sion of ‘cen­tral cor­rigi­bil­ity’ is un­nerv­ing even if we think it might only re­quire a man­age­able amount of train­ing data. It’s difficult to imag­ine how you would test cor­rigi­bil­ity thor­oughly enough that you could know­ingly rely on, e.g., the AI that seemed cor­rigible in its in­frahu­man phase not sud­denly de­vel­op­ing ex­treme or un­fore­seen be­hav­iors when the same allegedly sim­ple cen­tral prin­ci­ple was re­con­sid­ered at a higher level of in­tel­li­gence—it seems like it should be un­wise to have an AI with a ‘cen­tral’ cor­rigi­bil­ity prin­ci­ple, but not lots of par­tic­u­lar cor­rigi­bil­ity prin­ci­ples like a re­flec­tively con­sis­tent sus­pend but­ton or con­ser­va­tive plan­ning. But this ‘cen­tral’ ten­dency of cor­rigi­bil­ity might serve as a sec­ond line of defense.