Cognitive steganography

Cognitive steganography is an instrumental strategy for AGIs that are already trying to deceive their programmers for any reason (e.g., because the AGI is disaligned and does not want the programmers to realize this and edit the AGI’s utility function). Whatever tools the programmers have for monitoring the AGI’s thoughts, the AGI will not want those tools to helpfully print out all the AGIs plans for deception. This could mean, e.g., trying to have the calculations occur somewhere else or in a weird side-mode that the tools can’t read; or making the thoughts look like they’re about something else; or burying any warning signs under an avalanche of false alarms too long for the programmers to manually check all of them; or making the visible portions of the thoughts look innocuous. (“Oh, it’s trying to figure out what behavior we want to see! That’s good, right?”)

Hopefully, the intention to figure out cognitive steganography and implement it, would occur before any cognitive steganography had actually been implemented. This means that perhaps the most critical AI thought-monitoring tool, especially for monitoring an AGI that is or might be able to model human psychology, is the one that tries to detect a forming intention to defeat the thought-monitoring tool. Similarly, so long as the AI’s state is perfectly reproducible, one could in principle go back and use more sophisticated later thought-monitoring tools to try to detect whether any intention toward cognitive steganography had occurred months earlier.