Cognitive steganography

Cog­ni­tive steganog­ra­phy is an in­stru­men­tal strat­egy for AGIs that are already try­ing to de­ceive their pro­gram­mers for any rea­son (e.g., be­cause the AGI is dis­al­igned and does not want the pro­gram­mers to re­al­ize this and edit the AGI’s util­ity func­tion). What­ever tools the pro­gram­mers have for mon­i­tor­ing the AGI’s thoughts, the AGI will not want those tools to helpfully print out all the AGIs plans for de­cep­tion. This could mean, e.g., try­ing to have the calcu­la­tions oc­cur some­where else or in a weird side-mode that the tools can’t read; or mak­ing the thoughts look like they’re about some­thing else; or bury­ing any warn­ing signs un­der an avalanche of false alarms too long for the pro­gram­mers to man­u­ally check all of them; or mak­ing the visi­ble por­tions of the thoughts look in­nocu­ous. (“Oh, it’s try­ing to figure out what be­hav­ior we want to see! That’s good, right?”)

Hope­fully, the in­ten­tion to figure out cog­ni­tive steganog­ra­phy and im­ple­ment it, would oc­cur be­fore any cog­ni­tive steganog­ra­phy had ac­tu­ally been im­ple­mented. This means that per­haps the most crit­i­cal AI thought-mon­i­tor­ing tool, es­pe­cially for mon­i­tor­ing an AGI that is or might be able to model hu­man psy­chol­ogy, is the one that tries to de­tect a form­ing in­ten­tion to defeat the thought-mon­i­tor­ing tool. Similarly, so long as the AI’s state is perfectly re­pro­ducible, one could in prin­ci­ple go back and use more so­phis­ti­cated later thought-mon­i­tor­ing tools to try to de­tect whether any in­ten­tion to­ward cog­ni­tive steganog­ra­phy had oc­curred months ear­lier.