It can be difficult for one agent to oversee another agent that is maximizing a reward controlled by the first agent, even if the first agent is smarter than the second agent. This is a serious obstacle to Approval directed agents and other approaches based on reinforcement learning.

More formally, suppose a strong agent B with a utility function U is overseeing a weak agent A. Agent A produces an action y and gets a reward B(y) determined by agent B. B wants to give rewards such that agent A is incentivized to make its best effort at maximizing U(a).

If B(y) is an efficient estimate of U(y) relative to A, then indeed maximizing B(y) is equivalent to maximizing U(y) from A’s perspective. Unfortunately, it is difficult for this to be the case even when B is smarter than A, because A may think thoughts B does not in the course of selecting y. Here are a couple concrete examples (each having the flavor of “easy to generate a solution, hard to check a solution”):

Suppose A is outputting a partial Sudoku solutions. U assigns a high value to partial Sudoku solutions consistent with some solution to the entire Sudoku problem. A comes up with some solution to the Sudoku puzzle and outputs part of this solution. B cannot tell that this partial solution is actually part of a full solution, even though B is smarter than A. (B is able to come up with its own solutions but most aren’t consistent with A’s partial solution).
Suppose U assigns a high value to cat pictures that don’t contain steganographic messages. B has no way of distinguishing cat pictures that contain steganographic messages from ones that don’t. Thus, B has to give all cat pictures the same score, regardless of whether the pictures contain steganographic messages. This either fails to reward A for producing cat pictures not containing steganographic messages, or rewards A for producing cat pictures containing steganographic messages.

Here are a couple analogous examples that seem easier to analyze:

Let O be a one-way function (e.g. SHA256). U assigns a high value to binary strings y equal to O(111x) for some x. B cannot tell whether a particular string y is equal to O(111x) for some x (even if B is smarter than A), so B has no good way of giving rewards to A.
Same as the previous problem, except U assigns a high value to binary strings y that are not equal to O(111x) for any short string x. A’s intended strategy is to generate a uniformly random string (or generate a string in some other way that doesn’t introduce a bias towards generating strings of the form O(111x)), but B has no way of knowing that A’s string was uniformly random (rather than equal to O(111x) for a uniformly random x).

One attempted solution is for agent B to see A’s thought processes in addition to the action y. Then, agent B can use all the information A has, so that B(y) is an efficient estimate of U(y) from the perspective of A. However, this requires B to understand something about A’s mental architecture. It would be good to have a more generic solution that requires only weak assumptions about A’s mental architecture.

An attempted solution to problem 2 is for B to incentivize A to generate its action in a provably random manner. However, this doesn’t appear to generalize well to problems significantly different from 2.

Informed oversight

Further reading