List: value-alignment subjects

Safety paradigm for advanced agents

Advanced Safety
Advanced agents
- Efficient agents
AI safety mindset
- Omni Test
Methodology of foreseeable difficulties
Context Change problems (“Treacherous problems”?)
Methodology of unbounded analysis
Priority of astronomical failures (those that destroy error recovery or are immediately catastrophic)

Foreseen difficulties

Value identification
Edge instantiation
Unforeseen maximums
Ontology identification
- Cartesian boundary
- Human identification
Inductive value learning
- Ambiguity-querying
- Moral uncertainty
  - Indifference
Patch resistance
Nearest Unblocked Neighbor
Corrigibility
Anapartistic reasoning
- Programmer deception
- Early conservatism
- Reasoning under confusion
User maximization / Unshielded argmax
- Hypothetical user maximization
Genie theory
Limited AI
- Weak optimization
  - Safe optimization measure (such that we are confident it has no Edge that secretly optimizes more)
    - Factoring of an agent by stage/component optimization power
  - ‘Checker’ smarter than ‘inventor / chooser’
    - ‘Checker’ can model humans, ‘strategizer’ cannot
- Transparency
- Domain restriction
  - Behaviorism
- Effable optimization (opposite of cognitive uncontainability; uses only comprehensible strategies)
  - Minimal concepts (simple, not simplest, that contains fewest whitelisted strategies)
Genie preferences
- Low-impact AGI
  - Minimum Safe AA (just flip off switch and shut down safely)
  - Safe impact measure
  - Armstrong-style permitted output channels
  - Shutdown utility function
- Oracle utility function
  - Safe indifference?
- Online checkability
  - Reporting without programmer maximization
- Do What I Know I Mean
Superintelligent security (all subproblems placing us in adversarial context vs. other SIs)
Bargaining
- Non-blackmailability
- Secure counterfactual reasoning
- First-mover penalty / epistemic low ground advantage
- Division of gains from trade
Epistemic exclusion of distant SIs
- Distant superintelligences can coerce the most probable environment of your AI
- Breaking out of hypotheses
‘Philosophical’ problems
One True Prior
- Pascal’s Mugging / leverage prior
- Second-orderness
- Anthropics
  - How would an AI decide what to think about QTI?
Mindcrime
- Nonperson predicates (and unblocked neighbor problem)
Do What I Don’t Know I Mean - CEV
Philosophical competence - Unprecedented excursions

Reflectivity problems

Vingean reflection
Satisficing / meliorizing / staged maximization / ?
- Academic agenda: view current algorithms as finding a global logically-uncertain maximum, or teleporting to the current maximum, surveying, updating on a logical fact, and teleporting to the new maximum.
Logical decision theory
Naturalized induction
Benja: Investigate multi-level representation of DBNs (with categorical structure)

Foreseen normal difficulties

Reproducibility
Oracle boxes
Triggers
- Ascent metrics
Tripwires
- Honeypots

General agent theory

Bounded rational agency
Instrumental convergence

Value theory

Orthogonality Thesis
Complexity of value
Complexity of object-level terminal values
Incompressibilities of value
- Bounded logical incompressibility
- Terminal empirical incompressibility
- Instrumental nonduplication of value
- Economic incentives do not encode value
- Selection among advanced agents would not encode value
  - Strong selection among advanced agents would not encode value
  - Selection among advanced agents will be weak.
Fragility of value
Metaethics
Normative preferences are not compelling to a paperclip maximizer
Most ‘random’ stable AIs are like paperclip maximizers in this regard
It’s okay for valid normative reasoning to be incapable of compelling a paperclip maximizer
Thick definitions of ‘rationality’ aren’t part of what gets automatically produced by self-improvement
Alleged fallacies
Alleged fascination of One True Moral Command
Alleged rationalization of user-preferred options as formal-criterion-maximal options
Alleged metaethical alief that value must be internally morally compelling to all agents
Alleged alief that an AI must be stupid to do something inherently dispreferable

Larger research agendas

Corrigible reflective unbounded safe genie
Bounding the theory
Derationalizing the theory (e.g. for a neuromorphic AI)
- Which machine learning systems do and don’t behave like the corresponding ideal agents.
Normative Sovereign
Approval-based agents
Mindblind AI (cognitively powerful in physical science and engineering, weak at modeling minds or agents, unreflective)

Possible future use-cases

A carefully designed bounded reflective agent.
An overpowered set of known algorithms, heavily constrained in what is authorized, with little recursion.

Possible escape routes

Some cognitively limited task which is relatively safe to carry out at great power, and resolves the larger problem.
Newcomers can’t invent these well because they don’t understand what is a cognitively limited task (e.g., “Tool AI” suggestions).
General cognitive tasks that seem boxable and resolve the larger problem.
Can you save the world by knowing which consequences of ZF a superintelligence could prove? It’s unusually boxable, but what good is it?

Background

Intelligence explosion microeconomics
Civilizational adequacy/inadequacy

Strategy

Misleading Encouragement / context change / treacherous designs for naive projects
Programmer prediction & infrahuman domains hide complexity of value
Context change problems
Problems that only appear in advanced regimes
Problem classes that seem debugged in infrahuman regimes and suddenly break again in advanced regimes
Methodologies that only work in infrahuman regimes
Programmer deception
Academic inadequacy
‘Ethics’ work neglects technical problems that need longest serial research times and fails to give priority to astronomical failures over survivable small hits, but ‘ethics’ work has higher prestige, higher publishability, and higher cognitive accessibility
Understanding of big technical picture currently very rare
- Most possible funding sources cannot predict for themselves what might be technically useful in 10 years
- Many possible funding sources may not regard MIRI as trusted to discern this
Noise problems
- Ethics research drowns out technical research
  - And provokes counterreaction
  - And makes the field seem nontechnical
- Naive technical research drowns out sophisticated technical research
  - And makes problems look more solvable than they really are
  - And makes tech problems look trivial, therefore nonprestigious
  - And distracts talent/funding from hard problems
- Bad methodology louder than good methodology
  - So projects can appear safety-concerned while adopting bad methodologies
Future adequacy counterfactuals seem distant from the present regime
(To classify)
Coordinative development hypothetical

Alexei Andreev 1 Apr 2015 19:41 UTC
Ideally we shouldn’t have pages like this. It means that the hierarchy feature failed. Is this just meant to be temporary? Or do you foresee this as a permanent page?
- Eliezer Yudkowsky 4 Apr 2015 19:44 UTC
  I think one will often still need ‘introductory’ or ‘tutorial’ type pages that walk through the hierarchy as English text, but this exact page was something I whipped up during the recent Experimental Research Retreat as an alternative to just dumping the info and because I thought I might start filling it in as Arbital pages.
- Anna Salamon 6 Jun 2015 5:28 UTC
  I’m finding this page helpful. Alexei, does your theory think I shouldn’t be?
- Alexei Andreev 15 Jun 2015 21:36 UTC
  I definitely think something like this should exist and will be helpful, but I think Arbital should be able to generate something like this automatically. Until it can, we are stuck doing it manually.
  
  Expanding all children in the Children tab on the AI alignment page achieves something similar, but not quite as clean.
Mike Johnson 20 Apr 2017 21:29 UTC
Within the “Value Theory” section, I’d propose two subpoints:
- Unity of Value Thesis
- Necessity of Physical Representation
The ‘Unity of Value Thesis’ is simply what we get if the Complexity of Value Thesis is wrong. And it could be wrong- we just don’t know. For what this could look like, see e.g. https://qualiacomputing.com/2016/11/19/the-tyranny-of-the-intentional-object/

‘Necessity of Physical Representation’ refers to the notion that ultimately, a proper theory of value must compile to physics. We are made from physical stuff, and everything we interact with and value is made from the same physical stuff, and so ethics ultimately is about how to move & arrange the physical stuff in our light-cone. If a theory of value does not operate at this level, it can’t be a final theory of value. See e.g., Tegmark’s argument here: https://arxiv.org/abs/1409.0813