Time-machine metaphor for efficient agents
The time-machine metaphor is an intuition pump for instrumentally efficient agents—agents smart enough that they always get at least as much of their utility as any strategy we can imagine. Taking a superintelligent paperclip maximizer as our central example, rather than visualizing a brooding mind and spirit deciding which actions to output in order to achieve its paperclippy desires, we should consider visualizing a time machine hooked up to an output channel, which, for every possible output, peeks at how many paperclips will result in the future given that output, and which always yields the output leading to the greatest number of paperclips. In the metaphor, we’re not to imagine the time machine as having any sort of mind or intentions—it just repeatedly outputs the action that actually leads to the most paperclips.
The time machine metaphor isn’t a perfect way to visualize a bounded superintelligence; the time machine is strictly more powerful. E.g., the time machine can instantly unravel a 4096-bit encryption key because it ‘knows’ the bitstring that is the answer. So the point of this metaphor is not as an intuition pump for capabilities, but rather, an intuition pump for overcoming in reasoning about a paperclip maximizer’s policies; or as an intuition pump for understanding the sense-update-predict-act agent architecture.
That is: If you imagine a superintelligent paperclip maximizer as a mind, you might imagine persuading it that, really, it can get more paperclips by trading with humans instead of turning them into paperclips. If you imagine a time machine, which isn’t a mind, you’re less likely to imagine persuading it, and instead ask more honestly the question, “What is the maximum number of paperclips the universe can be turned into, and how would one go about doing that?” Instead of imagining ourselves arguing with Clippy about how humans really are very productive, we ask the question from the time machine’s standpoint—which universe actually ends up with more paperclips in it?
The relevant fact about instrumentally efficient agents is that they are, from our perspective, unbiased (in the) in their policies, relative to any kind of bias we can detect.
As an example, consider a 2015-era chess engine, contrasted to a 1985-era chess engine. The 1985-era chess engine may lose to a moderately strong human amateur, so it’s not relatively efficient. It may have humanly-perceivable quirks such as “It likes to move its queen”, that is, “I detect that it moves its queen more often than would be strictly required to win the game.” As we go from 1985 to 2015, the machine chessplayer improves beyond the point where we, personally, can detect any flaws in it. You should expect the reason why the 2015 chess engine moves anywhere to be only understandable to you (without machine assistance) as “because that move had a great probability of leading to a winning position later”, and not in any other psychological terms like “it likes to move its pawn”.
From your perspective, the 2015 chess engine will only move its pawn on occasions where that probably leads to winning the game, and does not move the pawn on occasions where it leads to losing the game. If you see the 2015 chess engine make a move you didn’t think was high in winningness, you conclude that it has seen some winningness you didn’t know about and is about to do exceptionally well, or you conclude that the move you favored led into futures surprisingly low in winningness, and not that the chess engine is favoring some unwinning move. We can no longer personally and without machine assistance detect any systematic departure from “It makes the chess move that leads to winning the game” in the direction of “It favors some other class of chess move for reasons apart from its winningness.”
This is what makes the time machine metaphor a good intuition pump for an instrumentally efficient agent’s choice of policies (though not a good intuition for the magnitude of its capabilities).
- Epistemic and instrumental efficiency
An efficient agent never makes a mistake you can predict. You can never successfully predict a directional bias in its estimates.