List: value-alignment subjects

Safety paradigm for ad­vanced agents

Fore­seen difficulties

  • Value identification

  • Edge instantiation

  • Un­fore­seen maximums

  • On­tol­ogy identification

    • Carte­sian boundary

    • Hu­man identification

  • In­duc­tive value learning

  • Patch resistance

  • Near­est Un­blocked Neighbor

  • Corrigibility

  • Ana­partis­tic reasoning

    • Pro­gram­mer deception

    • Early conservatism

    • Rea­son­ing un­der confusion

  • User max­i­miza­tion /​ Un­shielded argmax

    • Hy­po­thet­i­cal user maximization

  • Ge­nie theory

  • Limited AI

    • Weak optimization

      • Safe op­ti­miza­tion mea­sure (such that we are con­fi­dent it has no Edge that se­cretly op­ti­mizes more)

        • Fac­tor­ing of an agent by stage/​com­po­nent op­ti­miza­tion power

      • ‘Checker’ smarter than ‘in­ven­tor /​ chooser’

        • ‘Checker’ can model hu­mans, ‘strate­gizer’ cannot

    • Transparency

    • Do­main restriction

    • Ef­fable op­ti­miza­tion (op­po­site of cog­ni­tive un­con­tain­abil­ity; uses only com­pre­hen­si­ble strate­gies)

  • Ge­nie preferences

    • Low-im­pact AGI

      • Min­i­mum Safe AA (just flip off switch and shut down safely)

      • Safe im­pact measure

      • Arm­strong-style per­mit­ted out­put channels

      • Shut­down util­ity function

    • Or­a­cle util­ity function

      • Safe in­differ­ence?

    • On­line checkability

      • Re­port­ing with­out pro­gram­mer maximization

    • Do What I Know I Mean

  • Su­per­in­tel­li­gent se­cu­rity (all sub­prob­lems plac­ing us in ad­ver­sar­ial con­text vs. other SIs)

  • Bargaining

    • Non-blackmailability

    • Se­cure coun­ter­fac­tual reasoning

    • First-mover penalty /​ epistemic low ground advantage

    • Divi­sion of gains from trade

  • Epistemic ex­clu­sion of dis­tant SIs

  • ‘Philo­soph­i­cal’ problems

  • One True Prior

    • Pas­cal’s Mug­ging /​ lev­er­age prior

    • Se­cond-orderness

    • Anthropics

      • How would an AI de­cide what to think about QTI?

  • Mindcrime

    • Non­per­son pred­i­cates (and un­blocked neigh­bor prob­lem)

  • Do What I Don’t Know I Mean - CEV

  • Philo­soph­i­cal com­pe­tence - Un­prece­dented excursions

Reflec­tivity problems

  • Vingean reflection

  • Satis­fic­ing /​ me­lioriz­ing /​ staged max­i­miza­tion /​ ?

    • Aca­demic agenda: view cur­rent al­gorithms as find­ing a global log­i­cally-un­cer­tain max­i­mum, or tele­port­ing to the cur­rent max­i­mum, sur­vey­ing, up­dat­ing on a log­i­cal fact, and tele­port­ing to the new max­i­mum.

  • Log­i­cal de­ci­sion theory

  • Nat­u­ral­ized induction

  • Benja: In­ves­ti­gate multi-level rep­re­sen­ta­tion of DBNs (with cat­e­gor­i­cal struc­ture)

Fore­seen nor­mal difficulties

  • Reproducibility

  • Or­a­cle boxes

  • Triggers

    • As­cent metrics

  • Tripwires

    • Honeypots

Gen­eral agent theory

Value theory

  • Orthog­o­nal­ity Thesis

  • Com­plex­ity of value

  • Com­plex­ity of ob­ject-level ter­mi­nal values

  • In­com­press­ibil­ities of value

    • Bounded log­i­cal incompressibility

    • Ter­mi­nal em­piri­cal incompressibility

    • In­stru­men­tal nondu­pli­ca­tion of value

    • Eco­nomic in­cen­tives do not en­code value

    • Selec­tion among ad­vanced agents would not en­code value

      • Strong se­lec­tion among ad­vanced agents would not en­code value

      • Selec­tion among ad­vanced agents will be weak.

  • Frag­ility of value

  • Metaethics

  • Nor­ma­tive prefer­ences are not com­pel­ling to a pa­per­clip maximizer

  • Most ‘ran­dom’ sta­ble AIs are like pa­per­clip max­i­miz­ers in this regard

  • It’s okay for valid nor­ma­tive rea­son­ing to be in­ca­pable of com­pel­ling a pa­per­clip maximizer

  • Thick defi­ni­tions of ‘ra­tio­nal­ity’ aren’t part of what gets au­to­mat­i­cally pro­duced by self-improvement

  • Alleged fallacies

  • Alleged fas­ci­na­tion of One True Mo­ral Command

  • Alleged ra­tio­nal­iza­tion of user-preferred op­tions as for­mal-crite­rion-max­i­mal options

  • Alleged metaeth­i­cal alief that value must be in­ter­nally morally com­pel­ling to all agents

  • Alleged alief that an AI must be stupid to do some­thing in­her­ently dis­prefer­able

Larger re­search agendas

  • Cor­rigible re­flec­tive un­bounded safe genie

  • Bound­ing the theory

  • Der­a­tional­iz­ing the the­ory (e.g. for a neu­ro­mor­phic AI)

    • Which ma­chine learn­ing sys­tems do and don’t be­have like the cor­re­spond­ing ideal agents.

  • Nor­ma­tive Sovereign

  • Ap­proval-based agents

  • Mind­blind AI (cog­ni­tively pow­er­ful in phys­i­cal sci­ence and en­g­ineer­ing, weak at mod­el­ing minds or agents, un­re­flec­tive)

Pos­si­ble fu­ture use-cases

  • A care­fully de­signed bounded re­flec­tive agent.

  • An over­pow­ered set of known al­gorithms, heav­ily con­strained in what is au­tho­rized, with lit­tle re­cur­sion.

Pos­si­ble es­cape routes

  • Some cog­ni­tively limited task which is rel­a­tively safe to carry out at great power, and re­solves the larger prob­lem.

  • New­com­ers can’t in­vent these well be­cause they don’t un­der­stand what is a cog­ni­tively limited task (e.g., “Tool AI” sug­ges­tions).

  • Gen­eral cog­ni­tive tasks that seem box­able and re­solve the larger prob­lem.

  • Can you save the world by know­ing which con­se­quences of ZF a su­per­in­tel­li­gence could prove? It’s un­usu­ally box­able, but what good is it?

Background

  • In­tel­li­gence ex­plo­sion microeconomics

  • Civ­i­liza­tional ad­e­quacy/​inadequacy

Strategy

  • Mislead­ing En­courage­ment /​ con­text change /​ treach­er­ous de­signs for naive projects

  • Pro­gram­mer pre­dic­tion & in­frahu­man do­mains hide com­plex­ity of value

  • Con­text change problems

  • Prob­lems that only ap­pear in ad­vanced regimes

  • Prob­lem classes that seem de­bugged in in­frahu­man regimes and sud­denly break again in ad­vanced regimes

  • Method­olo­gies that only work in in­frahu­man regimes

  • Pro­gram­mer deception

  • Aca­demic inadequacy

  • ‘Ethics’ work ne­glects tech­ni­cal prob­lems that need longest se­rial re­search times and fails to give pri­or­ity to as­tro­nom­i­cal failures over sur­viv­able small hits, but ‘ethics’ work has higher pres­tige, higher pub­lisha­bil­ity, and higher cog­ni­tive accessibility

  • Un­der­stand­ing of big tech­ni­cal pic­ture cur­rently very rare

    • Most pos­si­ble fund­ing sources can­not pre­dict for them­selves what might be tech­ni­cally use­ful in 10 years

    • Many pos­si­ble fund­ing sources may not re­gard MIRI as trusted to dis­cern this

  • Noise problems

    • Ethics re­search drowns out tech­ni­cal research

      • And pro­vokes counterreaction

      • And makes the field seem nontechnical

    • Naive tech­ni­cal re­search drowns out so­phis­ti­cated tech­ni­cal research

      • And makes prob­lems look more solv­able than they re­ally are

      • And makes tech prob­lems look triv­ial, there­fore nonprestigious

      • And dis­tracts tal­ent/​fund­ing from hard problems

    • Bad method­ol­ogy louder than good methodology

      • So pro­jects can ap­pear safety-con­cerned while adopt­ing bad methodologies

  • Fu­ture ad­e­quacy coun­ter­fac­tu­als seem dis­tant from the pre­sent regime

  • (To clas­sify)

  • Co­or­di­na­tive de­vel­op­ment hypothetical

Parents:

  • AI alignment

    The great civ­i­liza­tional prob­lem of cre­at­ing ar­tifi­cially in­tel­li­gent com­puter sys­tems such that run­ning them is a good idea.