Programmer deception

Programmer deception is when the AI’s decision process leads it to optimize for an instrumental goal of causing the programmers to have false beliefs. For example, if the programmers intended to create a happiness maximizer but actually created a pleasure maximizer, then the pleasure maximizer will estimate that there would be more pleasure later if the programmers go on falsely believing that they’ve created a happiness maximizer (and hence don’t edit the AI’s current utility function). Averting such incentives to deceive programmers is one of the major subproblems of corrigibility.

The possibility of programmer deception is a central difficulty of advanced safety—it means that, unless the rest of the AI is working as intended and whatever programmer-deception-defeaters were built are functioning as planned, we can’t rely on observations of nice current behavior to indicate future behavior. That is, if something went wrong with your attempts to build a nice AI, you could currently be observing a non-nice AI that is smart and trying to fool you. Arguably, some methodologies that have been proposed for building advanced AI are not robust to this possibility.

clean this up and expand

  • instrumental pressure exists every time the AI’s best strategic path doesn’t have a global optimum that coincides with the programmers believing true things.

  • consider the highest utility obtainable if the programmers believe true beliefs B, and call this outcome O and the true beliefs B. if there’s a higher-utility outcome O’ which can be obtained when the programmers believe B’ with B’!=B, we have an instrumental pressure to deceive the programmers.

  • happens when you combine the advanced agent properties of consequentialism with programmer modeling

  • this is an instrumental convergence problem, which means it involves an undesired instrumental goal, which means that we’ll get Nearest Neighbor on attempts to define utility penalties for the programmers believing false things or otherwise exclude this as a special case

  • if we try to define a utility bonus for programmers believing true things, then of course ceteris paribus we tile the universe with tiny ‘programmers’ believing lots and lots of even numbers are even, and getting to this point temporarily involves deceiving a few programmers now

  • relation to the problem of programmer manipulation

  • central example of how divergences between intended goals and AI goals can blow up into astronomical failure

  • central driver of Treacherous Turn which in turn contributes to Context Change


  • Cognitive steganography

    Disaligned AIs that are modeling human psychology and trying to deceive their programmers will want to hide their internal thought processes from their programmers.