Safe training procedures for human-imitators

How do we train a reinforcement learning system to imitate a human producing complex outputs such as strings? Existing approaches are not entirely satisfactory.

Concretely, suppose there is some set of questions. A human can answer each question with a string. We have a set of (question, answer) pairs as training data. Train a model that, given a question, will produce an answer similar to the answer a human would give. Here are some existing approaches to this problem:

Generative adversarial models

In a generative adversarial model, one agent (the imitator) outputs an answer and another agent (the distinguisher) tries to determine if the answer was produced by the imitator or by an actual human. The imitator gets a score for fooling the distinguisher.


  1. It’s possible to tell when the imitator is not imitating the human well. Example: humans always solve Sudoku puzzles. The imitator cannot solve Sudoku puzzles. The distinguisher knows humans always solve Sudoku puzzles and can check candidate solutions to see if they actually solve the Sudoku puzzle, to distinguish the imitator from the human.

  2. Whenever the human can be imitated efficiently, the imitator is able to get a good score by doing this.


  1. The imitator may differ from human behavior in important but hard-to-detect ways. Example: humans always produce photos of cat drawings. The imitator produces photos of cat drawings that contain steganographic messages. The distinguisher cannot detect these messages, and therefore cannot tell the imitator apart from the human.


Given a question, the imitator outputs some representation of a distribution. It is scored according to the estimated probability that this distribution assigns to the human’s actual answer (by log probability). This is called “retrodiction” because it must be possible to show that the actual answer had high probability in hindsight.

Practical examples of this approach include variational autoencoders and approximate model counting. In theory, perhaps importance sampling is a good general theory of retrodiction.


  1. No steganography problem.


  1. The computational complexity properties are unclear. We have no guarantee of the form “if it’s efficient to imitate a human doing something, then it’s efficient to create a probabilistic model that provably assigns a high probability to the human’s behavior”.

  2. It’s hard to tell when the model is performing badly in an absolute sense (as in the Sudoku example).


  • AI alignment

    The great civilizational problem of creating artificially intelligent computer systems such that running them is a good idea.