Goodhart's Curse

Good­hart’s Curse is a ne­ol­o­gism for the com­bi­na­tion of the Op­ti­mizer’s Curse and Good­hart’s Law, par­tic­u­larly as ap­plied to the value al­ign­ment prob­lem for Ar­tifi­cial In­tel­li­gences.

Good­hart’s Curse in this form says that a pow­er­ful agent neu­trally op­ti­miz­ing a proxy mea­sure U that we hoped to al­ign with true val­ues V, will im­plic­itly seek out up­ward di­ver­gences of U from V.

In other words: pow­er­fully op­ti­miz­ing for a util­ity func­tion is strongly li­able to blow up any­thing we’d re­gard as an er­ror in defin­ing that util­ity func­tion.

Win­ner’s Curse, Op­ti­mizer’s Curse, and Good­hart’s Law

Win­ner’s Curse

The Win­ner’s Curse in auc­tion the­ory says that if mul­ti­ple bid­ders all bid their un­bi­ased es­ti­mate of an item’s value, the win­ner is likely to be some­one whose es­ti­mate con­tained an up­ward er­ror.

That is: If we have lots of bid­ders on an item, and each bid­der is in­di­vi­d­u­ally un­bi­ased on av­er­age, se­lect­ing the win­ner se­lects some­body who prob­a­bly made a mis­take this par­tic­u­lar time and over­bid. They are likely to ex­pe­rience post-auc­tion re­gret sys­tem­at­i­cally, not just oc­ca­sion­ally and ac­ci­den­tally.

For ex­am­ple, let’s say that the true value of an item is $10 to all bid­ders. Each bid­der bids the true value, $10, plus some Gaus­sian noise. Each in­di­vi­d­ual bid­der is as likely to over­bid $2 as to un­der­bid $2, so each in­di­vi­d­ual bid­der’s av­er­age ex­pected bid is $10; in­di­vi­d­u­ally, their bid is an un­bi­ased es­ti­ma­tor of the true value. But the win­ning bid­der is prob­a­bly some­body who over­bid $2, not some­body who un­der­bid $2. So if we know that Alice won the auc­tion, our re­vised guess should be that Alice made an up­ward er­ror in her bid.

Op­ti­mizer’s Curse

The Op­ti­mizer’s Curse in de­ci­sion anal­y­sis gen­er­al­izes this ob­ser­va­tion to an agent that es­ti­mates the ex­pected util­ity of ac­tions, and ex­e­cutes the ac­tion with the high­est ex­pected util­ity. Even if each util­ity es­ti­mate is lo­cally un­bi­ased, the ac­tion with seem­ingly high­est util­ity is more likely, in our pos­te­rior es­ti­mate, to have an up­ward er­ror in its ex­pected util­ity.

Worse, the Op­ti­mizer’s Curse means that ac­tions with high-var­i­ance es­ti­mates are se­lected for. Sup­pose we’re con­sid­er­ing 5 pos­si­ble ac­tions which in fact have util­ity $10 each, and our es­ti­mates of those 5 util­ities are Gaus­sian-noisy with a stan­dard de­vi­a­tion of $2. Another 5 pos­si­ble ac­tions in fact have util­ity of -$20, and our es­ti­mate of each of these 5 ac­tions is in­fluenced by un­bi­ased Gaus­sian noise with a stan­dard de­vi­a­tion of $100. We are likely to pick one of the bad five ac­tions whose enor­mously un­cer­tain value es­ti­mates hap­pened to pro­duce a huge up­ward er­ror.

The Op­ti­mizer’s Curse grows worse as a larger policy space is im­plic­itly searched; the more op­tions we con­sider, the higher the av­er­age er­ror in what­ever policy is se­lected. To effec­tively rea­son about a large policy space, we need to ei­ther have a good prior over policy good­ness and to know the var­i­ance in our es­ti­ma­tors; or we need very pre­cise es­ti­mates; or we need mostly cor­re­lated and lit­tle un­cor­re­lated noise; or we need the high­est real points in the policy space to have an ad­van­tage big­ger than the un­cer­tainty in our es­ti­mates.

The Op­ti­mizer’s Curse is not ex­actly similar to the Win­ner’s Curse be­cause the Op­ti­mizer’s Curse po­ten­tially ap­plies to im­plicit se­lec­tion over large search spaces. Per­haps we’re search­ing by gra­di­ent as­cent rather than ex­plic­itly con­sid­er­ing each el­e­ment of an ex­po­nen­tially vast space of pos­si­ble poli­cies. We are still im­plic­itly se­lect­ing over some effec­tive search space, and this method will still seek out up­ward er­rors. If we’re im­perfectly es­ti­mat­ing the value func­tion to get the gra­di­ent, then gra­di­ent as­cent is im­plic­itly fol­low­ing and am­plify­ing any up­ward er­rors in the es­ti­ma­tor.

The pro­posers of the Op­ti­mizer’s Curse also de­scribed a Bayesian rem­edy in which we have a prior on the ex­pected util­ities and var­i­ances and we are more skep­ti­cal of very high es­ti­mates. This how­ever as­sumes that the prior it­self is perfect, as are our es­ti­mates of var­i­ance. If the prior or var­i­ance-es­ti­mates con­tain large flaws some­where, a search over a very wide space of pos­si­bil­ities would be ex­pected to seek out and blow up any flaws in the prior or the es­ti­mates of var­i­ance.

Good­hart’s Law

Good­hart’s Law is named af­ter the economist Charles Good­hart. A stan­dard for­mu­la­tion is “When a mea­sure be­comes a tar­get, it ceases to be a good mea­sure.” Good­hart’s origi­nal for­mu­la­tion is “Any ob­served statis­ti­cal reg­u­lar­ity will tend to col­lapse when pres­sure is placed upon it for con­trol pur­poses.”

For ex­am­ple, sup­pose we re­quire banks to have ‘3% cap­i­tal re­serves’ as defined some par­tic­u­lar way. ‘Cap­i­tal re­serves’ mea­sured that par­tic­u­lar ex­act way will rapidly be­come a much less good in­di­ca­tor of the sta­bil­ity of a bank, as ac­coun­tants fid­dle with bal­ance sheets to make them legally cor­re­spond to the high­est pos­si­ble level of ‘cap­i­tal re­serves’.

Decades ear­lier, IBM once paid its pro­gram­mers per line of code pro­duced. If you pay peo­ple per line of code pro­duced, the “to­tal lines of code pro­duced” will have even less cor­re­la­tion with real pro­duc­tivity than it had pre­vi­ously.

Good­hart’s Curse in al­ign­ment theory

Good­hart’s Curse is a ne­ol­o­gism (by Yud­kowsky) for the crossover of the Op­ti­mizer’s Curse with Good­hart’s Law, yield­ing that neu­trally op­ti­miz­ing a proxy mea­sure U of V seeks out up­ward di­ver­gence of U from V.

Sup­pose the hu­mans have true val­ues V. We try to con­vey these val­ues to a pow­er­ful AI, via some value learn­ing method­ol­ogy that ends up giv­ing the AI a util­ity func­tion U.

Even if U is lo­cally an un­bi­ased es­ti­ma­tor of V, op­ti­miz­ing U will seek out what we would re­gard as ‘er­rors in the defi­ni­tion’, places where U di­verges up­ward from V. Op­ti­miz­ing for a high U may im­plic­itly seek out re­gions where U—V is high; that is, places where V is lower than U. This may es­pe­cially in­clude re­gions of the out­come space or policy space where the value learn­ing sys­tem was sub­ject to great var­i­ance; that is, places where the value learn­ing worked poorly or ran into a snag.

Good­hart’s Curse would be ex­pected to grow worse as the AI be­came more pow­er­ful. A more pow­er­ful AI would be im­plic­itly search­ing a larger space and would have more op­por­tu­nity to un­cover what we’d re­gard as “er­rors”; it would be able to find smaller loop­holes, blow up more minor flaws. There is a po­ten­tial con­text dis­aster if new di­ver­gences are un­cov­ered as more of the pos­si­bil­ity space is searched, etcetera.

We could see the ge­nie as im­plic­itly or emer­gently seek­ing out any pos­si­ble loop­hole in the wish: Not be­cause it is an evil ge­nie that knows our ‘truly in­tended’ V and is look­ing for some place that V can be min­i­mized while ap­pear­ing to satisfy U; but just be­cause the ge­nie is neu­trally seek­ing out very large val­ues of U and these are places where it is un­usu­ally likely that U di­verged up­ward from V.

Many fore­see­able difficul­ties of AGI al­ign­ment in­ter­act with Good­hart’s Curse. Good­hart’s Curse is one of the cen­tral rea­sons we’d ex­pect ‘lit­tle tiny mis­takes’ to ‘break’ when we dump a ton of op­ti­miza­tion pres­sure on them. Hence the claim: “AI al­ign­ment is hard like build­ing a rocket is hard: enor­mous pres­sures will break things that don’t break in less ex­treme en­g­ineer­ing do­mains.”

Good­hart’s Curse and meta-util­ity functions

An ob­vi­ous next ques­tion is “Why not just define the AI such that the AI it­self re­gards U as an es­ti­mate of V, caus­ing the AI’s U to more closely al­ign with V as the AI gets a more ac­cu­rate em­piri­cal pic­ture of the world?”

Re­ply: Of course this is the ob­vi­ous thing that we’d want to do. But what if we make an er­ror in ex­actly how we define “treat U as an es­ti­mate of V”? Good­hart’s Curse will mag­nify and blow up any er­ror in this defi­ni­tion as well.

We must dis­t­in­guish:

  • V, the true value func­tion that is in our hearts.

  • T, the ex­ter­nal tar­get that we for­mally told the AI to al­ign on, where we are hop­ing that T re­ally means V.

  • U, the AI’s cur­rent es­ti­mate of T or prob­a­bil­ity dis­tri­bu­tion over pos­si­ble T.

U will con­verge to­ward T as the AI be­comes more ad­vanced. The AI’s epistemic im­prove­ments and learned ex­pe­rience will tend over time to elimi­nate a sub­class of Good­hart’s Curse where the cur­rent es­ti­mate of U-value has di­verged up­ward from T-value, cases where the un­cer­tain U-es­ti­mate was se­lected to be er­ro­neously above the cor­rect for­mal value T.

How­ever, Good­hart’s Curse will still ap­ply to any po­ten­tial re­gions where T di­verges up­ward from V, where the for­mal tar­get di­verges from the true value func­tion that is in our hearts. We’d be plac­ing im­mense pres­sure to­ward seek­ing out what we would ret­ro­spec­tively re­gard as hu­man er­rors in defin­ing the meta-rule for de­ter­min­ing util­ities. noteThat is, we’d ret­ro­spec­tively re­gard those as er­rors if we sur­vived.

Good­hart’s Curse and ‘moral un­cer­tainty’

“Mo­ral un­cer­tainty” is some­times offered as a solu­tion source in AI al­ign­ment; if the AI has a prob­a­bil­ity dis­tri­bu­tion over util­ity func­tions, it can be risk-averse about things that might be bad. Would this not be safer than hav­ing the AI be very sure about what it ought to do?

Trans­lat­ing this idea into the V-T-U story, we want to give the AI a for­mal ex­ter­nal tar­get T to which the AI does not cur­rently have full ac­cess and knowl­edge. We are then hop­ing that the AI’s un­cer­tainty about T, the AI’s es­ti­mate of the var­i­ance be­tween T and U, will warn the AI away from re­gions where from our per­spec­tive U would be a high-var­i­ance es­ti­mate of V. In other words, we’re hop­ing that es­ti­mated U-T un­cer­tainty cor­re­lates well with, and is a good proxy for, ac­tual U-V di­ver­gence.

The idea would be that T is some­thing like a su­per­vised learn­ing pro­ce­dure from la­beled ex­am­ples, and the places where the cur­rent U di­verges from V are things we ‘for­got to tell the AI’; so the AI should no­tice that in these cases it has lit­tle in­for­ma­tion about T.

Good­hart’s Curse would then seek out any flaws or loop­holes in this hoped-for cor­re­la­tion be­tween es­ti­mated U-T un­cer­tainty and real U-V di­ver­gence. Search­ing a very wide space of op­tions would be li­able to se­lect on:

  • Re­gions where the AI has made an epistemic er­ror and poorly es­ti­mated the var­i­ance be­tween U and T;

  • Re­gions where the for­mal tar­get T is solidly es­timable to the AI, but from our own per­spec­tive the di­ver­gence from T to V is high (that is, the U-T un­cer­tainty fails to perfectly cover all T-V di­ver­gences).

The sec­ond case seems es­pe­cially likely to oc­cur in fu­ture phases where the AI is smarter and has more em­piri­cal in­for­ma­tion, and has cor­rectly re­duced its un­cer­tainty about its for­mal tar­get T. So moral un­cer­tainty and risk aver­sion may not scale well to su­per­in­tel­li­gence as a means of warn­ing the AI away from re­gions where we’d ret­ro­spec­tively judge that U/​T and V had di­verged.


You tell the AI that hu­man val­ues are defined rel­a­tive to hu­man brains in some par­tic­u­lar way T. While the AI is young and stupid, the AI knows that it is very un­cer­tain about hu­man brains, hence un­cer­tain about T. Hu­man be­hav­ior is pro­duced by hu­man brains, so the AI can re­gard hu­man be­hav­ior as in­for­ma­tive about T; the AI is sen­si­tive to spo­ken hu­man warn­ings that kil­ling the house­cat is bad.

When the AI is more ad­vanced, the AI scans a hu­man brain us­ing molec­u­lar nan­otech­nol­ogy and re­solves all its moral un­cer­tainty about T. As we defined T, the op­ti­mum T turns out to be “feed hu­mans heroin be­cause that is what hu­man brains max­i­mally want”.

Now the AI already knows ev­ery­thing our for­mal defi­ni­tion of T re­quires the AI to know about the hu­man brain to get a very sharp es­ti­mate of U. So hu­man be­hav­iors like shout­ing “stop!” are no longer seen as in­for­ma­tive about T and don’t lead to up­dates in U.

T, as defined, was always mis­al­igned with V. But early on, the mis­al­ign­ment was in a re­gion where the young AI es­ti­mated high var­i­ance be­tween U and T, thus keep­ing the AI out of this low-V re­gion. Later, the AI’s em­piri­cal un­cer­tainty about T was re­duced, and this pro­tec­tive bar­rier of moral un­cer­tainty and risk aver­sion was dis­pel­led.

Un­less the AI’s moral un­cer­tainty is perfectly con­ser­va­tive and never un­der­es­ti­mates the true re­gions of U-V di­ver­gence, there will be some cases where the AI thinks it is morally sure even though from our stand­point the U-V di­ver­gence is large. Then Good­hart’s Curse would se­lect on those cases.

Could we use a very con­ser­va­tive es­ti­mate of util­ity-func­tion un­cer­tainty, or a for­mal tar­get T that is very hard for even a su­per­in­tel­li­gence to be­come cer­tain about?

We would first need to worry that if the util­ity-func­tion un­cer­tainty is un­re­solv­able, that means the AI can’t ever ob­tain em­piri­cally strong ev­i­dence about it. In this case the AI would not up­date its es­ti­mate of T from ob­serv­ing hu­man be­hav­iors, mak­ing the AI again in­sen­si­tive to hu­mans shout­ing “Stop!”

Another pro­posal would be to rely on risk aver­sion over un­re­solv­ably un­cer­tain prob­a­bil­ities broad enough to con­tain some­thing similar to the true V as a hy­poth­e­sis, and hence en­gen­der suffi­cient aver­sion to low-true-V out­comes. Then we should worry on a prag­matic level that a suffi­ciently con­ser­va­tive amount of moral un­cer­tainty—so con­ser­va­tive that U-T risk aver­sion never un­der­es­ti­mated the ap­pro­pri­ate de­gree of risk aver­sion from our V-stand­point—would end up pre­vent­ing the AI from act­ing ever. Or that this de­gree of moral risk aver­sion would be such a prag­matic hin­drance that the pro­gram­mers might end up prag­mat­i­cally by­pass­ing all this in­con­ve­nient aver­sion in some set of safe-seem­ing cases. Then Good­hart’s Curse would seek out any un­fore­seen flaws in the coded be­hav­ior of ‘safe-seem­ing cases’.

Con­di­tions for Good­hart’s Curse

The ex­act con­di­tions for Good­hart’s Curse ap­ply­ing be­tween V and a point es­ti­mate or prob­a­bil­ity dis­tri­bu­tion over U, have not yet been writ­ten out in a con­vinc­ing way.

For ex­am­ple, sup­pose we have a mul­ti­vari­ate nor­mal dis­tri­bu­tion in which X and Y di­men­sions are pos­i­tively cor­re­lated, only Y is ob­serv­able, and we are se­lect­ing on Y in or­der to ob­tain more X. While X will re­vert to the mean com­pared to Y, it’s not likely to be zero or nega­tive; pick­ing max­i­mum Y is our best strat­egy for ob­tain­ing max­i­mum X and will prob­a­bly ob­tain a very high X. (Ob­ser­va­tion due to Scott Garrabrant.)

Con­sider also the case of the smile max­i­mizer which we trained to op­ti­mize smiles as a proxy for hap­piness. Tiny molec­u­lar smiley­faces are very low in hap­piness, an ap­par­ent man­i­fes­ta­tion of Good­hart’s Curse. On the oth­er­wise, if we op­ti­mized for ‘true hap­piness’ among biolog­i­cal hu­mans, this would pro­duce more smiles than de­fault. It might be only a tiny frac­tion of pos­si­ble smiles, on the or­der of 1e-30, but it would be more smiles than would have ex­isted oth­er­wise. So the re­la­tion be­tween V (max­i­mized at ‘true hap­piness’, zero at tiny molec­u­lar smiley­faces) and U (max­i­mized at tiny molec­u­lar smiley­faces, but also above av­er­age for true hap­piness) is not sym­met­ric; and this is one hint to the un­known nec­es­sary and/​or suffi­cient con­di­tion for Good­hart’s Curse to ap­ply.

In the case above, we might hand­wave some­thing like, “U had lots of lo­cal peaks one of which was V, but the U of V’s peak wasn’t any­where near the high­est U-peak, and the high­est U-peak was low in V. V was more nar­row and its more unique peak was non­co­in­ci­den­tally high in U.”

Re­search avenues

Mild op­ti­miza­tion is a pro­posed av­enue for di­rect at­tack on the cen­tral difficulty of Good­hart’s Curse and all the other difficul­ties it ex­ac­er­bates. Ob­vi­ously, if our for­mu­la­tion of mild op­ti­miza­tion is not perfect, Good­hart’s Curse may well se­lect for any place where our no­tion of ‘mild op­ti­miza­tion’ turns out to have a loop­hole that al­lows a lot of op­ti­miza­tion. But in­so­far as some ver­sion of mild op­ti­miza­tion is work­ing most of the time, it could avoid blow­ing up things that would oth­er­wise blow up. See also Tasks.

Similarly, con­ser­va­tive strate­gies can be seen as a more in­di­rect at­tack on some forms of Good­hart’s Curse—we try to stick to a con­ser­va­tive bound­ary drawn around pre­vi­ously whitelisted in­stances of the goal con­cept, or to us­ing strate­gies similar to pre­vi­ously whitelisted strate­gies. This averts search­ing a much huger space of pos­si­bil­ities that would be more likely to con­tain er­rors some­where. But Good­hart’s Curse might sin­gle out what con­sti­tutes a ‘con­ser­va­tive’ bound­ary, if our defi­ni­tion is less than ab­solutely perfect.


  • Advanced safety

    An agent is re­ally safe when it has the ca­pac­ity to do any­thing, but chooses to do what the pro­gram­mer wants.