List: value-alignment subjects
Safety paradigm for advanced agents
Context Change problems (“Treacherous problems”?)
Priority of astronomical failures (those that destroy error recovery or are immediately catastrophic)
Inductive value learning
Reasoning under confusion
User maximization / Unshielded argmax
Hypothetical user maximization
Safe optimization measure (such that we are confident it has no Edge that secretly optimizes more)
Factoring of an agent by stage/component optimization power
‘Checker’ smarter than ‘inventor / chooser’
‘Checker’ can model humans, ‘strategizer’ cannot
Effable optimization (opposite of cognitive uncontainability; uses only comprehensible strategies)
Minimal concepts (simple, not simplest, that contains fewest whitelisted strategies)
Minimum Safe AA (just flip off switch and shut down safely)
Safe impact measure
Armstrong-style permitted output channels
Shutdown utility function
Oracle utility function
Reporting without programmer maximization
Do What I Know I Mean
Superintelligent security (all subproblems placing us in adversarial context vs. other SIs)
Secure counterfactual reasoning
First-mover penalty / epistemic low ground advantage
Division of gains from trade
Epistemic exclusion of distant SIs
Breaking out of hypotheses
One True Prior
Pascal’s Mugging / leverage prior
How would an AI decide what to think about QTI?
Nonperson predicates (and unblocked neighbor problem)
Do What I Don’t Know I Mean - CEV
Philosophical competence - Unprecedented excursions
Satisficing / meliorizing / staged maximization / ?
Academic agenda: view current algorithms as finding a global logically-uncertain maximum, or teleporting to the current maximum, surveying, updating on a logical fact, and teleporting to the new maximum.
Logical decision theory
Benja: Investigate multi-level representation of DBNs (with categorical structure)
Foreseen normal difficulties
General agent theory
Complexity of object-level terminal values
Incompressibilities of value
Bounded logical incompressibility
Terminal empirical incompressibility
Instrumental nonduplication of value
Economic incentives do not encode value
Selection among advanced agents would not encode value
Strong selection among advanced agents would not encode value
Selection among advanced agents will be weak.
Fragility of value
Normative preferences are not compelling to a paperclip maximizer
Most ‘random’ stable AIs are like paperclip maximizers in this regard
It’s okay for valid normative reasoning to be incapable of compelling a paperclip maximizer
Thick definitions of ‘rationality’ aren’t part of what gets automatically produced by self-improvement
Alleged fascination of One True Moral Command
Alleged rationalization of user-preferred options as formal-criterion-maximal options
Alleged metaethical alief that value must be internally morally compelling to all agents
Alleged alief that an AI must be stupid to do something inherently dispreferable
Larger research agendas
Corrigible reflective unbounded safe genie
Bounding the theory
Derationalizing the theory (e.g. for a neuromorphic AI)
Which machine learning systems do and don’t behave like the corresponding ideal agents.
Mindblind AI (cognitively powerful in physical science and engineering, weak at modeling minds or agents, unreflective)
Possible future use-cases
A carefully designed bounded reflective agent.
An overpowered set of known algorithms, heavily constrained in what is authorized, with little recursion.
Possible escape routes
Some cognitively limited task which is relatively safe to carry out at great power, and resolves the larger problem.
Newcomers can’t invent these well because they don’t understand what is a cognitively limited task (e.g., “Tool AI” suggestions).
General cognitive tasks that seem boxable and resolve the larger problem.
Can you save the world by knowing which consequences of ZF a superintelligence could prove? It’s unusually boxable, but what good is it?
Intelligence explosion microeconomics
Misleading Encouragement / context change / treacherous designs for naive projects
Programmer prediction & infrahuman domains hide complexity of value
Context change problems
Problems that only appear in advanced regimes
Problem classes that seem debugged in infrahuman regimes and suddenly break again in advanced regimes
Methodologies that only work in infrahuman regimes
‘Ethics’ work neglects technical problems that need longest serial research times and fails to give priority to astronomical failures over survivable small hits, but ‘ethics’ work has higher prestige, higher publishability, and higher cognitive accessibility
Understanding of big technical picture currently very rare
Most possible funding sources cannot predict for themselves what might be technically useful in 10 years
Many possible funding sources may not regard MIRI as trusted to discern this
Ethics research drowns out technical research
And provokes counterreaction
And makes the field seem nontechnical
Naive technical research drowns out sophisticated technical research
And makes problems look more solvable than they really are
And makes tech problems look trivial, therefore nonprestigious
And distracts talent/funding from hard problems
Bad methodology louder than good methodology
So projects can appear safety-concerned while adopting bad methodologies
Future adequacy counterfactuals seem distant from the present regime
- AI alignment
The great civilizational problem of creating artificially intelligent computer systems such that running them is a good idea.