Reflective stability
An agent is “reflectively stable” in some regard, if having a choice of how to construct a successor agent or modify its own code, the agent will only construct a successor that thinks similarly in that regard.
In tiling agent theory, an expected utility satisficer is reflectively consistent, since it will approve of building another EU satisficer, but an EU satisficer is not reflectively stable, since it may also approve of building an expected utility maximizer (it expects the consequences of building the maximizer to satisfice).
Having a utility function that only weighs paperclips is “reflectively stable” because paperclip maximizers only try to build other paperclip maximizers.
If, thinking the way you currently do (in some regard), it seems unacceptable to not think that way (in that regard), then you are reflectively stable (in that regard).
untangle possible confusion about reflective stability not being “good” and wanting reflectively unstable agents because it seems bad to them if a paperclip maximizer stays a paperclip maximizer, or they imagine causal decision theorists building something incrementally saner than casual decision theorists.
Children:
- Reflectively consistent degree of freedom
When an instrumentally efficient, self-modifying AI can be like X or like X’ in such a way that X wants to be X and X’ wants to be X’, that’s a reflectively consistent degree of freedom.
- Other-izing (wanted: new optimization idiom)
Maximization isn’t possible for bounded agents, and satisficing doesn’t seem like enough. What other kind of ‘izing’ might be good for realistic, bounded agents?
- Consequentialist preferences are reflectively stable by default
Gandhi wouldn’t take a pill that made him want to kill people, because he knows in that case more people will be murdered. A paperclip maximizer doesn’t want to stop maximizing paperclips.
Parents:
- Vingean reflection
The problem of thinking about your future self when it’s smarter than you.