Effability principle

A pro­posed prin­ci­ple of AI al­ign­ment stat­ing, “The more in­sight you have into the deep struc­ture of an AI’s cog­ni­tive op­er­a­tions, the more likely you are to suc­ceed in al­ign­ing that AI.”

As an ex­am­ple of in­creased ef­fa­bil­ity, con­sider the differ­ence be­tween hav­ing the idea of ex­pected util­ity while build­ing your AI, ver­sus hav­ing never heard of ex­pected util­ity. The idea of ex­pected util­ity is so well-known that it may not seem salient as an in­sight any­more, but con­sider the differ­ence be­tween hav­ing this idea and not hav­ing it.

Star­ing at the ex­pected util­ity prin­ci­ple and how it de­com­poses into a util­ity func­tion and a prob­a­bil­ity dis­tri­bu­tion, leads to a po­ten­tially ob­vi­ous-sound­ing but still rather im­por­tant in­sight:

Rather than all be­hav­iors and poli­cies and goals need­ing to be up-for-grabs in or­der for an agent to adapt it­self to a chang­ing and un­known world, the agent can have a sta­ble util­ity func­tion and chang­ing prob­a­bil­ity dis­tri­bu­tion.

E.g., when the agent tries to grab the cheese and dis­cov­ers that the cheese is too high, we can view this as an up­date to the agent’s be­liefs about how to get cheese, with­out chang­ing the fact that the agent wants cheese.

Similarly, if we want su­per­hu­man perfor­mance at play­ing chess, we can ask for an AI that has a known, sta­ble, un­der­stand­able prefer­ence to win chess games; but a prob­a­bil­ity dis­tri­bu­tion that has been re­fined to greater-than-hu­man ac­cu­racy about which poli­cies yield a greater prob­a­bil­is­tic ex­pec­ta­tion of win­ning chess po­si­tions.

Then con­trast this to the state of mind where you haven’t de­com­posed your un­der­stand­ing of cog­ni­tion into prefer­ence-ish parts and be­lief-ish parts. In this state of mind, for all you know, ev­ery as­pect of the AI’s be­hav­ior, ev­ery goal it has, must po­ten­tially need to change in or­der for the AI to deal with a chang­ing world; oth­er­wise the AI will just be stuck ex­e­cut­ing the same be­hav­iors over and over… right? Ob­vi­ously, this no­tion of AI with un­change­able prefer­ences is just a fool’s er­rand. Any AI like that would be too stupid to make a ma­jor differ­ence for good or bad. noteThe idea of In­stru­men­tal con­ver­gence is also im­por­tant here; e.g. that sci­en­tific cu­ri­os­ity is already an in­stru­men­tal strat­egy for ‘make as many pa­per­clips as pos­si­ble’, rather than an AI need­ing a sep­a­rate ter­mi­nal prefer­ence about sci­en­tific cu­ri­os­ity in or­der to ever en­gage in it.

(This ar­gu­ment has in­deed been en­coun­tered in the wild many times.)

Prob­a­bil­ity dis­tri­bu­tions and util­ity func­tions have now been known for a rel­a­tively long time and are un­der­stood rel­a­tively well; peo­ple have made many, many at­tempts to poke at their struc­ture and imag­ine po­ten­tial vari­a­tions and show what goes wrong with those vari­a­tions. There is now known an enor­mous fam­ily of co­her­ence the­o­rems stat­ing that “Strate­gies which are not qual­i­ta­tively dom­i­nated can be viewed as co­her­ent with some con­sis­tent prob­a­bil­ity dis­tri­bu­tion and util­ity func­tion.” This sug­gests that we can in a broad sense ex­pect that, as a suffi­ciently ad­vanced AI’s be­hav­ior is more heav­ily op­ti­mized for not qual­i­ta­tively shoot­ing it­self in the foot, that AI will end up ex­hibit­ing some as­pects of ex­pected-util­ity rea­son­ing. We have some idea of why a suffi­ciently ad­vanced AI would have ex­pected-util­ity-ish things go­ing on some­where in­side it, or at least be­have that way so far as we could tell by look­ing at the AI’s ex­ter­nal ac­tions.

So we can say, “Look, if you don’t ex­plic­itly write in a util­ity func­tion, the AI is prob­a­bly go­ing to end up with some­thing like a util­ity func­tion any­way, you just won’t know where it is. It seems con­sid­er­ably wiser to know what that util­ity func­tion says and write in on pur­pose. Heck, even if you say you ex­plic­itly don’t want your AI to have a sta­ble util­ity func­tion, you’d need to know all the co­her­ence the­o­rems you’re try­ing to defy by say­ing that!”

The Ef­fa­bil­ity Prin­ci­ple states (or rather hopes) that as we get marginally more of this gen­eral kind of in­sight into an AI’s op­er­a­tions, we be­come marginally more likely to be able to al­ign the AI.

The ex­am­ple of ex­pected util­ity ar­guably sug­gests that if there are any more ideas like that ly­ing around, which we don’t yet have, our lack of those ideas may en­tirely doom the AI al­ign­ment pro­ject or at least make it far more difficult. We can in prin­ci­ple imag­ine some­one who is just us­ing a big re­in­force­ment learner to try to ex­e­cute some large pivotal act, who has no idea where the AI is keep­ing its con­se­quen­tial­ist prefer­ences or what those prefer­ences are; and yet this per­son was so para­noid and had the re­sources to put in so much mon­i­tor­ing and had so many trip­wires and safe­guards and was so con­ser­va­tive in how lit­tle they tried to do, that they suc­ceeded any­way. But it doesn’t sound like a good idea to try in real life.

The search for in­creased ef­fa­bil­ity has gen­er­ally mo­ti­vated the “Agent Foun­da­tions” agenda of re­search within MIRI. While not the only as­pect of AI al­ign­ment, a con­cern is that this kind of deep in­sight may be a heav­ily se­ri­ally-loaded task in which re­searchers need to de­velop one idea af­ter an­other, com­pared to rel­a­tively shal­low ideas in AI al­ign­ment that re­quire less se­rial time to cre­ate. That is, this kind of re­search is among the most im­por­tant kinds of re­search to start early.

The chief ri­val to ef­fa­bil­ity is the Su­per­vis­abil­ity Prin­ci­ple, which, while not di­rectly op­posed to ef­fa­bil­ity, tends to fo­cus our un­der­stand­ing of the AI at a much larger grain size. For ex­am­ple, the Su­per­vis­abil­ity Prin­ci­ple says, “Since the AI’s be­hav­iors are the only thing we can train by di­rect com­par­i­son with some­thing we know to be already al­igned, namely hu­man be­hav­iors, we should fo­cus on en­sur­ing the great­est pos­si­ble fidelity at that point, rather than any smaller pieces whose al­ign­ment can­not be di­rectly de­ter­mined and tested in the same way.” Note that both prin­ci­ples agree that it’s im­por­tant to un­der­stand cer­tain facts about the AI as well as pos­si­ble, but they dis­agree about what should be our de­sign pri­or­ity for ren­der­ing max­i­mally un­der­stand­able.