Strong cognitive uncontainability


Suppose somebody from the 10th century were asked how somebody from the 20th century might cool their house. While they would be able to understand the problem and offer some solutions, maybe even clever solutions (“Locate your house someplace with cooler weather”, “divert water from the stream to flow through your living room”) the 20th century’s actual solution of ‘air conditioning’ is not available to them as a strategy. Not just because they don’t think fast enough or aren’t clever enough, but because an air conditioner takes advantage of physical laws they don’t know about. Even if they somehow randomly imagined an air conditioner’s exact blueprint, they wouldn’t expect that design to operate as an air conditioner until they were told about the relation of pressure to temperature, how electricity can power a compressor motor, and so on.

By definition, a strongly uncontainable agent can conceive strategies that go through causal domains you can’t currently model, and it has options accessing those strategies; therefore it may execute high-value solutions such that, even being told the exact strategy, you would not assign those solutions high expected efficacy without being told further background facts.

At least in this sense, the 20th century is ‘strongly cognitively uncontainable’ relative to the 10th century: We can solve the problem of how to cool homes using a strategy that would not be recognizable in advance to a 10th-century observer.

Arguably, most real-world problems, if we today addressed them using the full power of modern science and technology (i.e. we were willing to spend a lot of money on tech and maybe run a prediction market on the relevant facts) would have best solutions that couldn’t be verified in the 10th-century.

We can imagine a cognitively powerful agent being strongly uncontainable in some domains but not others. Since every cognitive agent is containable on formal games of tic-tac-toe (at least so far as we can imagine, and so long as there isn’t a real-world opponent to manipulate), strong uncontainability cannot be a universal property of an agent across all formal and informal domains.

General arguments

Arguments in favor of strong uncontainability tend to revolve around either:

  • The richness and partial unknownness of a particular domain. (E.g. human psychology seems very complicated; has a lot of unknown pathways; and previously discovered exploits often seemed very surprising; therefore we should expect strong uncontainability on the domain of human psychology.)

  • Outside-view induction on previous ability advantages derived from cognitive advantages. (The 10th century couldn’t contain the 20th century even though all parties involved were biological Homo sapiens; what makes us think we’re the first generation to have the real true laws of the universe in our minds?)

Arguments against strong uncontainability tend to revolve around:

  • The apparent knownness of a particular domain. (E.g., since we have observed the rules of chemistry with great precision and know their origin in the underlying molecular dynamics, we can believe that even an arbitrarily smart agent should not be able to turn lead into gold using non-radioactive chemical reagents.)

  • Backward reasoning from the Fermi Paradox, which gives us weak evidence bounding the capabilities of the most powerful agents possible in our universe. (E.g., even though there might be surprises remaining in the question of how to standardly model physics, any surprise yielding Faster-Than-Light travel to a previously un-traveled point makes the Fermi Paradox harder to explain.)

Key propositions

  • Can Oracles be contained inside a computational sandbox? That is, is there some restriction of input-output channels and of other environmental interactions such that:

  • The richness of the ‘human psychology’ domain is averted;

  • Remaining causal interactions with the outside universe have an option set too small and flat to contain interesting options.

  • How solid is our current knowledge of the physical universe?

  • To what extent should we expect an advanced agency (e.g. machine superintelligences a million years later) to be boundable using our present physical understanding?

  • Can we reasonably rule out unknown physical domains being accessed by a computationally sandboxed AI?

  • What is the highest reasonable probability that could, under optimal conditions, be assigned to having genuinely contained an AI inside a computational sandbox, if it is not allowed any rich output channels? Is it more like 20% or 80%?

  • Are there useful domains conceptually closed to humans’ internal understanding?

  • Will a machine superintelligence have ‘power we know not’ in the sense that it can’t be explained to us even after we’ve seen it (except in the trivial sense that we could simulate another mind understanding it using external storage and Turing-like rules), as with a chimpanzee encountering an air conditioner?


  • AI alignment

    The great civilizational problem of creating artificially intelligent computer systems such that running them is a good idea.