Goodness estimate biaser

A “good­ness es­ti­mate bi­aser” is a sys­tem setup or phe­nomenon that seems fore­see­ably likely to cause the ac­tual good­ness of some AI plan to be sys­tem­at­i­cally lower than the AI’s es­ti­mate of that plan’s good­ness. We want the AI’s es­ti­mate to be un­bi­ased.

Or­di­nary examples

Sub­tle and un­sub­tle es­ti­mate-bi­as­ing is­sues in ma­chine learn­ing are well-known and ap­pear far short of ad­vanced agency:

● A ma­chine learn­ing al­gorithm’s perfor­mance on the train­ing data is not an un­bi­ased es­ti­mate of its perfor­mance on the test data. Some of what the al­gorithm seems to learn may be par­tic­u­lar to noise in the train­ing data. This fit­ted noise will not be fit­ted within the test data. So test perfor­mance is not just un­equal to, but sys­tem­at­i­cally lower than, train­ing perfor­mance; if we were treat­ing the train­ing perfor­mance as an es­ti­mate of test perfor­mance, it would not be an un­bi­ased es­ti­mate.

● The Win­ner’s Curse from auc­tion the­ory ob­serves that if bid­ders have noise in their un­bi­ased es­ti­mates of the auc­tioned item’s value, then the high­est bid­der, who re­ceives the item, is more likely to have up­ward noise in their in­di­vi­d­u­ally un­bi­ased es­ti­mate, con­di­tional on their hav­ing won. (E.g., three bid­ders with Gaus­sian noise in their value es­ti­mates sub­mit bids on an item whose true value to them is 1.0; the win­ning bid­der is likely to have val­ued the item at more than 1.0.)

The analo­gous Op­ti­mizer’s Curse ob­serves that if we make lo­cally un­bi­ased but noisy es­ti­mates of the sub­jec­tive ex­pected util­ity of sev­eral plans, then se­lect­ing the plan with ‘high­est ex­pected util­ity’ is likely to se­lect an es­ti­mate with up­ward noise. Bar­ring com­pen­satory ad­just­ments, this means that ac­tual util­ity will be sys­tem­at­i­cally lower than ex­pected util­ity, even if all ex­pected util­ity es­ti­mates are in­di­vi­d­u­ally un­bi­ased. Worse, if we have 10 plans whose ex­pected util­ity can be un­bi­as­edly es­ti­mated with low noise, plus 10 plans whose ex­pected util­ity can be un­bi­as­edly es­ti­mated with high noise, then se­lect­ing the plan with ap­par­ently high­est ex­pected util­ity fa­vors the nois­iest es­ti­mates!

In AI alignment

We can see many of the alleged fore­see­able difficul­ties in AI al­ign­ment as in­volv­ing similar pro­cesses that allegedly pro­duce sys­tem­atic down­ward bi­ases in what we see as ac­tual good­ness, com­pared to an AI’s es­ti­mate of good­ness:

Edge in­stan­ti­a­tion sug­gests that if we take an im­perfectly or in­com­pletely learned value func­tion, then look­ing for the max­i­mum or ex­treme of that value func­tion is much more likely than usual to mag­nify what we see as the gaps or im­perfec­tions (be­cause of frag­ility of value, plus the Op­ti­mizer’s Curse); or de­stroy what­ever as­pects of value the AI didn’t learn about (be­cause op­ti­miz­ing a sub­set of prop­er­ties is li­able to set all other prop­er­ties to ex­treme val­ues).

We can see this as im­ply­ing both “The AI’s ap­par­ent good­ness in non-ex­treme cases is an up­ward-bi­ased es­ti­mate of its good­ness in ex­treme cases” and “If the AI learns its good­ness es­ti­ma­tor less than perfectly, the AI’s es­ti­mates of the good­ness of its best plans will sys­tem­at­i­cally over­es­ti­mate what we see as the ac­tual good­ness.”

Near­est un­blocked strat­egy gen­er­ally, and es­pe­cially over in­stru­men­tally con­ver­gent in­cor­rigi­bil­ity, sug­gests that if there are nat­u­rally-aris­ing AI be­hav­iors we see as bad (e.g. rout­ing around shut­down), there may emerge a pseudo-ad­ver­sar­ial se­lec­tion of strate­gies that route around our at­tempted patches to those prob­lems. E.g., the AI con­structs an en­vi­ron­men­tal sub­agent to con­tinue car­ry­ing on its goals, while cheer­fully obey­ing ‘the let­ter of the law’ by al­low­ing its cur­rent hard­ware to be shut down. This pseudo-ad­ver­sar­ial se­lec­tion (though the AI does not have an ex­plicit goal of thwart­ing us or se­lect­ing low-good­ness strate­gies per se) again im­plies that ac­tual good­ness is likely to be sys­tem­at­i­cally lower than the AI’s es­ti­mate of what it’s learned as ‘good­ness’; again to an in­creas­ing de­gree as the AI be­comes smarter and searches a wider policy space.

Mild op­ti­miza­tion and con­ser­va­tive strate­gies can be seen as pro­pos­als to ‘reg­u­larize’ pow­er­ful op­ti­miza­tion in a way that de­creases the de­gree to which good­ness in train­ing is a bi­ased (over)es­ti­mate of good­ness in ex­e­cu­tion.


  • Advanced safety

    An agent is re­ally safe when it has the ca­pac­ity to do any­thing, but chooses to do what the pro­gram­mer wants.