List: value-alignment subjects
Safety paradigm for advanced agents
Context Change problems (“Treacherous problems”?)
Priority of astronomical failures (those that destroy error recovery or are immediately catastrophic)
Foreseen difficulties
Cartesian boundary
Human identification
Inductive value learning
Moral uncertainty
Indifference
Anapartistic reasoning
Programmer deception
Early conservatism
Reasoning under confusion
User maximization / Unshielded argmax
Hypothetical user maximization
Limited AI
Weak optimization
Safe optimization measure (such that we are confident it has no Edge that secretly optimizes more)
Factoring of an agent by stage/component optimization power
‘Checker’ smarter than ‘inventor / chooser’
‘Checker’ can model humans, ‘strategizer’ cannot
Transparency
Domain restriction
Effable optimization (opposite of cognitive uncontainability; uses only comprehensible strategies)
Minimal concepts (simple, not simplest, that contains fewest whitelisted strategies)
Genie preferences
Low-impact AGI
Minimum Safe AA (just flip off switch and shut down safely)
Safe impact measure
Armstrong-style permitted output channels
Shutdown utility function
Oracle utility function
Safe indifference?
Online checkability
Reporting without programmer maximization
Do What I Know I Mean
Superintelligent security (all subproblems placing us in adversarial context vs. other SIs)
Bargaining
Non-blackmailability
Secure counterfactual reasoning
First-mover penalty / epistemic low ground advantage
Division of gains from trade
Epistemic exclusion of distant SIs
Distant superintelligences can coerce the most probable environment of your AI
Breaking out of hypotheses
‘Philosophical’ problems
One True Prior
Pascal’s Mugging / leverage prior
Second-orderness
Anthropics
How would an AI decide what to think about QTI?
Nonperson predicates (and unblocked neighbor problem)
Do What I Don’t Know I Mean - CEV
Philosophical competence - Unprecedented excursions
Reflectivity problems
Vingean reflection
Satisficing / meliorizing / staged maximization / ?
Academic agenda: view current algorithms as finding a global logically-uncertain maximum, or teleporting to the current maximum, surveying, updating on a logical fact, and teleporting to the new maximum.
Logical decision theory
Naturalized induction
Benja: Investigate multi-level representation of DBNs (with categorical structure)
Foreseen normal difficulties
Reproducibility
Oracle boxes
Triggers
Ascent metrics
Tripwires
Honeypots
General agent theory
Instrumental convergence
Value theory
Complexity of object-level terminal values
Incompressibilities of value
Bounded logical incompressibility
Terminal empirical incompressibility
Instrumental nonduplication of value
Economic incentives do not encode value
Selection among advanced agents would not encode value
Strong selection among advanced agents would not encode value
Selection among advanced agents will be weak.
Fragility of value
Metaethics
Normative preferences are not compelling to a paperclip maximizer
Most ‘random’ stable AIs are like paperclip maximizers in this regard
It’s okay for valid normative reasoning to be incapable of compelling a paperclip maximizer
Thick definitions of ‘rationality’ aren’t part of what gets automatically produced by self-improvement
Alleged fallacies
Alleged fascination of One True Moral Command
Alleged rationalization of user-preferred options as formal-criterion-maximal options
Alleged metaethical alief that value must be internally morally compelling to all agents
Alleged alief that an AI must be stupid to do something inherently dispreferable
Larger research agendas
Corrigible reflective unbounded safe genie
Bounding the theory
Derationalizing the theory (e.g. for a neuromorphic AI)
Which machine learning systems do and don’t behave like the corresponding ideal agents.
Normative Sovereign
Approval-based agents
Mindblind AI (cognitively powerful in physical science and engineering, weak at modeling minds or agents, unreflective)
Possible future use-cases
A carefully designed bounded reflective agent.
An overpowered set of known algorithms, heavily constrained in what is authorized, with little recursion.
Possible escape routes
Some cognitively limited task which is relatively safe to carry out at great power, and resolves the larger problem.
Newcomers can’t invent these well because they don’t understand what is a cognitively limited task (e.g., “Tool AI” suggestions).
General cognitive tasks that seem boxable and resolve the larger problem.
Can you save the world by knowing which consequences of ZF a superintelligence could prove? It’s unusually boxable, but what good is it?
Background
Intelligence explosion microeconomics
Civilizational adequacy/inadequacy
Strategy
Misleading Encouragement / context change / treacherous designs for naive projects
Programmer prediction & infrahuman domains hide complexity of value
Context change problems
Problems that only appear in advanced regimes
Problem classes that seem debugged in infrahuman regimes and suddenly break again in advanced regimes
Methodologies that only work in infrahuman regimes
Programmer deception
Academic inadequacy
‘Ethics’ work neglects technical problems that need longest serial research times and fails to give priority to astronomical failures over survivable small hits, but ‘ethics’ work has higher prestige, higher publishability, and higher cognitive accessibility
Understanding of big technical picture currently very rare
Most possible funding sources cannot predict for themselves what might be technically useful in 10 years
Many possible funding sources may not regard MIRI as trusted to discern this
Noise problems
Ethics research drowns out technical research
And provokes counterreaction
And makes the field seem nontechnical
Naive technical research drowns out sophisticated technical research
And makes problems look more solvable than they really are
And makes tech problems look trivial, therefore nonprestigious
And distracts talent/funding from hard problems
Bad methodology louder than good methodology
So projects can appear safety-concerned while adopting bad methodologies
Future adequacy counterfactuals seem distant from the present regime
(To classify)
Parents:
- AI alignment
The great civilizational problem of creating artificially intelligent computer systems such that running them is a good idea.
Ideally we shouldn’t have pages like this. It means that the hierarchy feature failed. Is this just meant to be temporary? Or do you foresee this as a permanent page?
I think one will often still need ‘introductory’ or ‘tutorial’ type pages that walk through the hierarchy as English text, but this exact page was something I whipped up during the recent Experimental Research Retreat as an alternative to just dumping the info and because I thought I might start filling it in as Arbital pages.
I’m finding this page helpful. Alexei, does your theory think I shouldn’t be?
I definitely think something like this should exist and will be helpful, but I think Arbital should be able to generate something like this automatically. Until it can, we are stuck doing it manually.
Expanding all children in the Children tab on the AI alignment page achieves something similar, but not quite as clean.
Within the “Value Theory” section, I’d propose two subpoints:
Unity of Value Thesis
Necessity of Physical Representation
The ‘Unity of Value Thesis’ is simply what we get if the Complexity of Value Thesis is wrong. And it could be wrong- we just don’t know. For what this could look like, see e.g. https://qualiacomputing.com/2016/11/19/the-tyranny-of-the-intentional-object/
‘Necessity of Physical Representation’ refers to the notion that ultimately, a proper theory of value must compile to physics. We are made from physical stuff, and everything we interact with and value is made from the same physical stuff, and so ethics ultimately is about how to move & arrange the physical stuff in our light-cone. If a theory of value does not operate at this level, it can’t be a final theory of value. See e.g., Tegmark’s argument here: https://arxiv.org/abs/1409.0813