Safe training procedures for human-imitators

How do we train a re­in­force­ment learn­ing sys­tem to imi­tate a hu­man pro­duc­ing com­plex out­puts such as strings? Ex­ist­ing ap­proaches are not en­tirely satis­fac­tory.

Con­cretely, sup­pose there is some set of ques­tions. A hu­man can an­swer each ques­tion with a string. We have a set of (ques­tion, an­swer) pairs as train­ing data. Train a model that, given a ques­tion, will pro­duce an an­swer similar to the an­swer a hu­man would give. Here are some ex­ist­ing ap­proaches to this prob­lem:

Gen­er­a­tive ad­ver­sar­ial models

In a gen­er­a­tive ad­ver­sar­ial model, one agent (the imi­ta­tor) out­puts an an­swer and an­other agent (the dis­t­in­guisher) tries to de­ter­mine if the an­swer was pro­duced by the imi­ta­tor or by an ac­tual hu­man. The imi­ta­tor gets a score for fool­ing the dis­t­in­guisher.


  1. It’s pos­si­ble to tell when the imi­ta­tor is not imi­tat­ing the hu­man well. Ex­am­ple: hu­mans always solve Su­doku puz­zles. The imi­ta­tor can­not solve Su­doku puz­zles. The dis­t­in­guisher knows hu­mans always solve Su­doku puz­zles and can check can­di­date solu­tions to see if they ac­tu­ally solve the Su­doku puz­zle, to dis­t­in­guish the imi­ta­tor from the hu­man.

  2. When­ever the hu­man can be imi­tated effi­ciently, the imi­ta­tor is able to get a good score by do­ing this.


  1. The imi­ta­tor may differ from hu­man be­hav­ior in im­por­tant but hard-to-de­tect ways. Ex­am­ple: hu­mans always pro­duce pho­tos of cat draw­ings. The imi­ta­tor pro­duces pho­tos of cat draw­ings that con­tain stegano­graphic mes­sages. The dis­t­in­guisher can­not de­tect these mes­sages, and there­fore can­not tell the imi­ta­tor apart from the hu­man.


Given a ques­tion, the imi­ta­tor out­puts some rep­re­sen­ta­tion of a dis­tri­bu­tion. It is scored ac­cord­ing to the es­ti­mated prob­a­bil­ity that this dis­tri­bu­tion as­signs to the hu­man’s ac­tual an­swer (by log prob­a­bil­ity). This is called “retro­d­ic­tion” be­cause it must be pos­si­ble to show that the ac­tual an­swer had high prob­a­bil­ity in hind­sight.

Prac­ti­cal ex­am­ples of this ap­proach in­clude vari­a­tional au­toen­coders and ap­prox­i­mate model count­ing. In the­ory, per­haps im­por­tance sam­pling is a good gen­eral the­ory of retro­d­ic­tion.


  1. No steganog­ra­phy prob­lem.


  1. The com­pu­ta­tional com­plex­ity prop­er­ties are un­clear. We have no guaran­tee of the form “if it’s effi­cient to imi­tate a hu­man do­ing some­thing, then it’s effi­cient to cre­ate a prob­a­bil­is­tic model that prov­ably as­signs a high prob­a­bil­ity to the hu­man’s be­hav­ior”.

  2. It’s hard to tell when the model is perform­ing badly in an ab­solute sense (as in the Su­doku ex­am­ple).


  • AI alignment

    The great civ­i­liza­tional prob­lem of cre­at­ing ar­tifi­cially in­tel­li­gent com­puter sys­tems such that run­ning them is a good idea.