Informed oversight

It can be difficult for one agent to over­see an­other agent that is max­i­miz­ing a re­ward con­trol­led by the first agent, even if the first agent is smarter than the sec­ond agent. This is a se­ri­ous ob­sta­cle to Ap­proval di­rected agents and other ap­proaches based on re­in­force­ment learn­ing.

More for­mally, sup­pose a strong agent B with a util­ity func­tion U is over­see­ing a weak agent A. Agent A pro­duces an ac­tion y and gets a re­ward B(y) de­ter­mined by agent B. B wants to give re­wards such that agent A is in­cen­tivized to make its best effort at max­i­miz­ing U(a).

If B(y) is an effi­cient es­ti­mate of U(y) rel­a­tive to A, then in­deed max­i­miz­ing B(y) is equiv­a­lent to max­i­miz­ing U(y) from A’s per­spec­tive. Un­for­tu­nately, it is difficult for this to be the case even when B is smarter than A, be­cause A may think thoughts B does not in the course of se­lect­ing y. Here are a cou­ple con­crete ex­am­ples (each hav­ing the fla­vor of “easy to gen­er­ate a solu­tion, hard to check a solu­tion”):

  • Sup­pose A is out­putting a par­tial Su­doku solu­tions. U as­signs a high value to par­tial Su­doku solu­tions con­sis­tent with some solu­tion to the en­tire Su­doku prob­lem. A comes up with some solu­tion to the Su­doku puz­zle and out­puts part of this solu­tion. B can­not tell that this par­tial solu­tion is ac­tu­ally part of a full solu­tion, even though B is smarter than A. (B is able to come up with its own solu­tions but most aren’t con­sis­tent with A’s par­tial solu­tion).

  • Sup­pose U as­signs a high value to cat pic­tures that don’t con­tain stegano­graphic mes­sages. B has no way of dis­t­in­guish­ing cat pic­tures that con­tain stegano­graphic mes­sages from ones that don’t. Thus, B has to give all cat pic­tures the same score, re­gard­less of whether the pic­tures con­tain stegano­graphic mes­sages. This ei­ther fails to re­ward A for pro­duc­ing cat pic­tures not con­tain­ing stegano­graphic mes­sages, or re­wards A for pro­duc­ing cat pic­tures con­tain­ing stegano­graphic mes­sages.

Here are a cou­ple analo­gous ex­am­ples that seem eas­ier to an­a­lyze:

  1. Let O be a one-way func­tion (e.g. SHA256). U as­signs a high value to bi­nary strings y equal to O(111x) for some x. B can­not tell whether a par­tic­u­lar string y is equal to O(111x) for some x (even if B is smarter than A), so B has no good way of giv­ing re­wards to A.

  2. Same as the pre­vi­ous prob­lem, ex­cept U as­signs a high value to bi­nary strings y that are not equal to O(111x) for any short string x. A’s in­tended strat­egy is to gen­er­ate a uniformly ran­dom string (or gen­er­ate a string in some other way that doesn’t in­tro­duce a bias to­wards gen­er­at­ing strings of the form O(111x)), but B has no way of know­ing that A’s string was uniformly ran­dom (rather than equal to O(111x) for a uniformly ran­dom x).

One at­tempted solu­tion is for agent B to see A’s thought pro­cesses in ad­di­tion to the ac­tion y. Then, agent B can use all the in­for­ma­tion A has, so that B(y) is an effi­cient es­ti­mate of U(y) from the per­spec­tive of A. How­ever, this re­quires B to un­der­stand some­thing about A’s men­tal ar­chi­tec­ture. It would be good to have a more generic solu­tion that re­quires only weak as­sump­tions about A’s men­tal ar­chi­tec­ture.

An at­tempted solu­tion to prob­lem 2 is for B to in­cen­tivize A to gen­er­ate its ac­tion in a prov­ably ran­dom man­ner. How­ever, this doesn’t ap­pear to gen­er­al­ize well to prob­lems sig­nifi­cantly differ­ent from 2.

Fur­ther reading

Paul Chris­ti­ano on ad­e­quate oversight

Paul Chris­ti­ano on the in­formed over­sight problem


  • AI alignment

    The great civ­i­liza­tional prob­lem of cre­at­ing ar­tifi­cially in­tel­li­gent com­puter sys­tems such that run­ning them is a good idea.