Advanced safety

A pro­posal meant to pro­duce value-al­igned agents is ‘ad­vanced-safe’ if it suc­ceeds, or fails safely, in sce­nar­ios where the AI be­comes much smarter than its hu­man de­vel­op­ers.


A pro­posal for a value-al­ign­ment method­ol­ogy, or some as­pect of that method­ol­ogy, is alleged to be ‘ad­vanced-safe’ if that pro­posal is claimed ro­bust to sce­nar­ios where the agent:

  • Knows more or has bet­ter prob­a­bil­ity es­ti­mates than us

  • Learns new facts un­known to us

  • Searches a larger strat­egy space than we can consider

  • Con­fronts new in­stru­men­tal prob­lems we didn’t fore­see in detail

  • Gains power quickly

  • Has ac­cess to greater lev­els of cog­ni­tive power than in the regime where it was pre­vi­ously tested

  • Wields strate­gies that wouldn’t make sense to us even if we were told about them in advance


It seems rea­son­able to ex­pect that there will be difficul­ties of deal­ing with minds smarter than our own, do­ing things we didn’t imag­ine, that will be qual­i­ta­tively differ­ent from de­sign­ing a toaster oven to not burn down a house, or from de­sign­ing an AI sys­tem that is dumber than hu­man. This means that the con­cept of ‘ad­vanced safety’ will end up im­por­tantly differ­ent from the con­cept of ro­bust pre-ad­vanced AI.

Con­cretely, it has been ar­gued to be fore­see­able for sev­eral difficul­ties in­clud­ing e.g. pro­gram­mer de­cep­tion and un­fore­seen max­i­mums, that they won’t ma­te­ri­al­ize be­fore an agent is ad­vanced, or won’t ma­te­ri­al­ize in the same way, or won’t ma­te­ri­al­ize as severely. This means that prac­tice with dumber-than-hu­man AIs may not train us against these difficul­ties, re­quiring a sep­a­rate the­ory and men­tal dis­ci­pline for mak­ing ad­vanced AIs safe.

We have ob­served in prac­tice that many pro­pos­als for ‘AI safety’ do not seem to have been thought through against ad­vanced agent sce­nar­ios; thus, there seems to be a prac­ti­cal ur­gency to em­pha­siz­ing the con­cept and the differ­ence.

Key prob­lems of ad­vanced safety that are new or qual­i­ta­tively differ­ent com­pared to pre-ad­vanced AI safety in­clude:

Non-ad­vanced-safe method­olo­gies may con­ceiv­ably be use­ful if a known al­gorithm non­re­cur­sive agent can be cre­ated that is (a) pow­er­ful enough to be rele­vant and (b) can be known not to be­come ad­vanced. Even here there may be grounds for worry that such an agent finds un­ex­pect­edly strong strate­gies in some par­tic­u­lar sub­do­main—that it ex­hibits flashes of do­main-spe­cific ad­vance­ment that break a non-ad­vanced-safe method­ol­ogy.


As an ex­treme case, an ‘omni-safe’ method­ol­ogy allegedly re­mains value-al­igned, or fails safely, even if the agent sud­denly be­comes om­ni­scient and om­nipo­tent (ac­quires delta prob­a­bil­ity dis­tri­bu­tions on all facts of in­ter­est and has all de­scrib­able out­comes available as di­rect op­tions). See: real-world agents should be omni-safe.


  • Methodology of unbounded analysis

    What we do and don’t un­der­stand how to do, us­ing un­limited com­put­ing power, is a crit­i­cal dis­tinc­tion and im­por­tant fron­tier.

  • AI safety mindset

    Ask­ing how AI de­signs could go wrong, in­stead of imag­in­ing them go­ing right.

  • Optimization daemons

    When you op­ti­mize some­thing so hard that it crys­tal­izes into an op­ti­mizer, like the way nat­u­ral se­lec­tion op­ti­mized apes so hard they turned into hu­man-level intelligences

  • Nearest unblocked strategy

    If you patch an agent’s prefer­ence frame­work to avoid an un­de­sir­able solu­tion, what can you ex­pect to hap­pen?

  • Safe but useless

    Some­times, at the end of lock­ing down your AI so that it seems ex­tremely safe, you’ll end up with an AI that can’t be used to do any­thing in­ter­est­ing.

  • Distinguish which advanced-agent properties lead to the foreseeable difficulty

    Say what kind of AI, or thresh­old level of in­tel­li­gence, or key type of ad­vance­ment, first pro­duces the difficulty or challenge you’re talk­ing about.

  • Goodness estimate biaser

    Some of the main prob­lems in AI al­ign­ment can be seen as sce­nar­ios where ac­tual good­ness is likely to be sys­tem­at­i­cally lower than a bro­ken way of es­ti­mat­ing good­ness.

  • Goodhart's Curse

    The Op­ti­mizer’s Curse meets Good­hart’s Law. For ex­am­ple, if our val­ues are V, and an AI’s util­ity func­tion U is a proxy for V, op­ti­miz­ing for high U seeks out ‘er­rors’—that is, high val­ues of U—V.

  • Context disaster

    Some pos­si­ble de­signs cause your AI to be­have nicely while de­vel­op­ing, and be­have a lot less nicely when it’s smarter.

  • Methodology of foreseeable difficulties

    Build­ing a nice AI is likely to be hard enough, and con­tain enough gotchas that won’t show up in the AI’s early days, that we need to fore­see prob­lems com­ing in ad­vance.

  • Actual effectiveness

    If you want the AI’s so-called ‘util­ity func­tion’ to ac­tu­ally be steer­ing the AI, you need to think about how it meshes up with be­liefs, or what gets out­put to ac­tions.


  • AI alignment

    The great civ­i­liza­tional prob­lem of cre­at­ing ar­tifi­cially in­tel­li­gent com­puter sys­tems such that run­ning them is a good idea.