List: value-alignment subjects

Safety paradigm for advanced agents

Foreseen difficulties

  • Value identification

  • Edge instantiation

  • Unforeseen maximums

  • Ontology identification

    • Cartesian boundary

    • Human identification

  • Inductive value learning

  • Patch resistance

  • Nearest Unblocked Neighbor

  • Corrigibility

  • Anapartistic reasoning

    • Programmer deception

    • Early conservatism

    • Reasoning under confusion

  • User maximization /​ Unshielded argmax

    • Hypothetical user maximization

  • Genie theory

  • Limited AI

    • Weak optimization

      • Safe optimization measure (such that we are confident it has no Edge that secretly optimizes more)

        • Factoring of an agent by stage/​component optimization power

      • ‘Checker’ smarter than ‘inventor /​ chooser’

        • ‘Checker’ can model humans, ‘strategizer’ cannot

    • Transparency

    • Domain restriction

    • Effable optimization (opposite of cognitive uncontainability; uses only comprehensible strategies)

      • Minimal concepts (simple, not simplest, that contains fewest whitelisted strategies)

  • Genie preferences

    • Low-impact AGI

      • Minimum Safe AA (just flip off switch and shut down safely)

      • Safe impact measure

      • Armstrong-style permitted output channels

      • Shutdown utility function

    • Oracle utility function

      • Safe indifference?

    • Online checkability

      • Reporting without programmer maximization

    • Do What I Know I Mean

  • Superintelligent security (all subproblems placing us in adversarial context vs. other SIs)

  • Bargaining

    • Non-blackmailability

    • Secure counterfactual reasoning

    • First-mover penalty /​ epistemic low ground advantage

    • Division of gains from trade

  • Epistemic exclusion of distant SIs

  • ‘Philosophical’ problems

  • One True Prior

    • Pascal’s Mugging /​ leverage prior

    • Second-orderness

    • Anthropics

      • How would an AI decide what to think about QTI?

  • Mindcrime

    • Nonperson predicates (and unblocked neighbor problem)

  • Do What I Don’t Know I Mean - CEV

  • Philosophical competence - Unprecedented excursions

Reflectivity problems

  • Vingean reflection

  • Satisficing /​ meliorizing /​ staged maximization /​ ?

    • Academic agenda: view current algorithms as finding a global logically-uncertain maximum, or teleporting to the current maximum, surveying, updating on a logical fact, and teleporting to the new maximum.

  • Logical decision theory

  • Naturalized induction

  • Benja: Investigate multi-level representation of DBNs (with categorical structure)

Foreseen normal difficulties

  • Reproducibility

  • Oracle boxes

  • Triggers

    • Ascent metrics

  • Tripwires

    • Honeypots

General agent theory

Value theory

  • Orthogonality Thesis

  • Complexity of value

  • Complexity of object-level terminal values

  • Incompressibilities of value

    • Bounded logical incompressibility

    • Terminal empirical incompressibility

    • Instrumental nonduplication of value

    • Economic incentives do not encode value

    • Selection among advanced agents would not encode value

      • Strong selection among advanced agents would not encode value

      • Selection among advanced agents will be weak.

  • Fragility of value

  • Metaethics

  • Normative preferences are not compelling to a paperclip maximizer

  • Most ‘random’ stable AIs are like paperclip maximizers in this regard

  • It’s okay for valid normative reasoning to be incapable of compelling a paperclip maximizer

  • Thick definitions of ‘rationality’ aren’t part of what gets automatically produced by self-improvement

  • Alleged fallacies

  • Alleged fascination of One True Moral Command

  • Alleged rationalization of user-preferred options as formal-criterion-maximal options

  • Alleged metaethical alief that value must be internally morally compelling to all agents

  • Alleged alief that an AI must be stupid to do something inherently dispreferable

Larger research agendas

  • Corrigible reflective unbounded safe genie

  • Bounding the theory

  • Derationalizing the theory (e.g. for a neuromorphic AI)

    • Which machine learning systems do and don’t behave like the corresponding ideal agents.

  • Normative Sovereign

  • Approval-based agents

  • Mindblind AI (cognitively powerful in physical science and engineering, weak at modeling minds or agents, unreflective)

Possible future use-cases

  • A carefully designed bounded reflective agent.

  • An overpowered set of known algorithms, heavily constrained in what is authorized, with little recursion.

Possible escape routes

  • Some cognitively limited task which is relatively safe to carry out at great power, and resolves the larger problem.

  • Newcomers can’t invent these well because they don’t understand what is a cognitively limited task (e.g., “Tool AI” suggestions).

  • General cognitive tasks that seem boxable and resolve the larger problem.

  • Can you save the world by knowing which consequences of ZF a superintelligence could prove? It’s unusually boxable, but what good is it?

Background

  • Intelligence explosion microeconomics

  • Civilizational adequacy/​inadequacy

Strategy

  • Misleading Encouragement /​ context change /​ treacherous designs for naive projects

  • Programmer prediction & infrahuman domains hide complexity of value

  • Context change problems

  • Problems that only appear in advanced regimes

  • Problem classes that seem debugged in infrahuman regimes and suddenly break again in advanced regimes

  • Methodologies that only work in infrahuman regimes

  • Programmer deception

  • Academic inadequacy

  • ‘Ethics’ work neglects technical problems that need longest serial research times and fails to give priority to astronomical failures over survivable small hits, but ‘ethics’ work has higher prestige, higher publishability, and higher cognitive accessibility

  • Understanding of big technical picture currently very rare

    • Most possible funding sources cannot predict for themselves what might be technically useful in 10 years

    • Many possible funding sources may not regard MIRI as trusted to discern this

  • Noise problems

    • Ethics research drowns out technical research

      • And provokes counterreaction

      • And makes the field seem nontechnical

    • Naive technical research drowns out sophisticated technical research

      • And makes problems look more solvable than they really are

      • And makes tech problems look trivial, therefore nonprestigious

      • And distracts talent/​funding from hard problems

    • Bad methodology louder than good methodology

      • So projects can appear safety-concerned while adopting bad methodologies

  • Future adequacy counterfactuals seem distant from the present regime

  • (To classify)

  • Coordinative development hypothetical

Parents:

  • AI alignment

    The great civilizational problem of creating artificially intelligent computer systems such that running them is a good idea.