A proposal meant to produce value-aligned agents is ‘advanced-safe’ if it succeeds, or fails safely, in scenarios where the AI becomes much smarter than its human developers.
A proposal for a value-alignment methodology, or some aspect of that methodology, is alleged to be ‘advanced-safe’ if that proposal is claimed robust to scenarios where the agent:
Knows more or has better probability estimates than us
Learns new facts unknown to us
Searches a larger strategy space than we can consider
Confronts new instrumental problems we didn’t foresee in detail
Gains power quickly
Has access to greater levels of cognitive power than in the regime where it was previously tested
It seems reasonable to expect that there will be difficulties of dealing with minds smarter than our own, doing things we didn’t imagine, that will be qualitatively different from designing a toaster oven to not burn down a house, or from designing an AI system that is dumber than human. This means that the concept of ‘advanced safety’ will end up importantly different from the concept of robust pre-advanced AI.
Concretely, it has been argued to be programmer deception and unforeseen maximums, that they won’t materialize before an agent is advanced, or won’t materialize in the same way, or won’t materialize as severely. This means that practice with dumber-than-human AIs may not train us against these difficulties, requiring a separate theory and mental discipline for making advanced AIs safe.for several difficulties including e.g.
We have observed in practice that many proposals for ‘AI safety’ do not seem to have been thought through against advanced agent scenarios; thus, there seems to be a practical urgency to emphasizing the concept and the difference.
Key problems of advanced safety that are new or qualitatively different compared to pre-advanced AI safety include:
Non-advanced-safe methodologies may conceivably be useful if a powerful enough to be relevant and (b) can be known not to become advanced. Even here there may be grounds for worry that such an agent finds unexpectedly strong strategies in some particular subdomain—that it exhibits flashes of domain-specific advancement that break a non-advanced-safe methodology.can be created that is (a)
As an extreme case, an ‘omni-safe’ methodology allegedly remains value-aligned, or fails safely, even if the agent suddenly becomes omniscient and omnipotent (acquires delta probability distributions on all facts of interest and has all describable outcomes available as direct options). See: real-world agents should be omni-safe.
- Methodology of unbounded analysis
What we do and don’t understand how to do, using unlimited computing power, is a critical distinction and important frontier.
- AI safety mindset
Asking how AI designs could go wrong, instead of imagining them going right.
- Optimization daemons
When you optimize something so hard that it crystalizes into an optimizer, like the way natural selection optimized apes so hard they turned into human-level intelligences
- Nearest unblocked strategy
If you patch an agent’s preference framework to avoid an undesirable solution, what can you expect to happen?
- Safe but useless
Sometimes, at the end of locking down your AI so that it seems extremely safe, you’ll end up with an AI that can’t be used to do anything interesting.
- Distinguish which advanced-agent properties lead to the foreseeable difficulty
Say what kind of AI, or threshold level of intelligence, or key type of advancement, first produces the difficulty or challenge you’re talking about.
- Goodness estimate biaser
Some of the main problems in AI alignment can be seen as scenarios where actual goodness is likely to be systematically lower than a broken way of estimating goodness.
- Goodhart's Curse
The Optimizer’s Curse meets Goodhart’s Law. For example, if our values are V, and an AI’s utility function U is a proxy for V, optimizing for high U seeks out ‘errors’—that is, high values of U—V.
- Context disaster
Some possible designs cause your AI to behave nicely while developing, and behave a lot less nicely when it’s smarter.
- Methodology of foreseeable difficulties
Building a nice AI is likely to be hard enough, and contain enough gotchas that won’t show up in the AI’s early days, that we need to foresee problems coming in advance.
- Actual effectiveness
If you want the AI’s so-called ‘utility function’ to actually be steering the AI, you need to think about how it meshes up with beliefs, or what gets output to actions.
- AI alignment
The great civilizational problem of creating artificially intelligent computer systems such that running them is a good idea.