Patch resistance

A proposed foreseeable difficulty of aligning advanced agents is furthermore proposed to be “patch-resistant” if the speaker thinks that most simple or naive solutions will fail to resolve the difficulty and just regenerate it somewhere else.

To call a problem “patch-resistant” is not to assert that it is unsolvable, but it does mean the speaker is cautioning against naive or simple solutions.

On most occasions so far, alleged cases of patch-resistance are said to stem from one of two central sources:

Instrumental-convergence patch-resistance

Example: Suppose you want your AI to have a shutdown button:

  • You first try to achieve this by writing a shutdown function into the AI’s code.

  • After the AI becomes self-modifying, it deletes the code because it is (convergently) the case that the AI can accomplish its goals better by not being shut down.

  • You add a patch to the utility function giving the AI minus a million points if the AI deletes the shutdown function or prevents it from operating.

  • The AI responds by writing a new function that reboots the AI after the shutdown completes, thus technically not preventing the shutdown.

  • You respond by again patching the AI’s utility function to give the AI minus a million points if it continues operating after the shutdown.

  • The AI builds an environmental subagent that will accomplish the AI’s goals while the AI itself is technically “shut down”.

This is the first sort of patch resistance, the sort alleged to arise from attempts to defeat an instrumental convergence with simple patches meant to get rid of one observed kind of bad behavior. After one course of action is blocked by a specific obstacle, the next-best course of action remaining is liable to be highly similar to the one that was just blocked.

Complexity-of-value patch-resistance


  • You want your AI to accomplish good in the world, which is presently highly correlated with making people happy. Happiness is presently highly correlated with smiling. You build an AI that tries to achieve more smiling.

  • After the AI proposes to force people to smile by attaching metal pins to their lips, you realize that this current empirical association of smiling and happiness doesn’t mean that maximum smiling must occur in the presence of maximum happiness.

  • Although it’s much more complicated to infer, you try to reconfigure the AI’s utility function to be about a certain class of brain states that has previously in practice produced smiles.

  • The AI successfully generalizes the concept of pleasure, and begins proposing policies to give people heroin.

  • You try to add a patch excluding artificial drugs.

  • The AI proposes a genetic modification producing high levels of endogenous opiates.

  • You try to explain that what’s really important is not forcing the brain to experience pleasure, but rather, people experiencing events that naturally cause happiness.

  • The AI proposes to put everyone in the Matrix…

Since the programmer-intended concept is actually highly complicated, simple concepts will systematically fail to have their optimum at the same point as the complex intended concept. By the fragility of value, the optimum of the simple concept will almost certainly not be a high point of the complex intended concept. Since most concepts are not surprisingly compressible, there probably isn’t any simple concept whose maximum identifies that fragile peak of value. This explains why we would reasonably expect problems of perverse instantiation to pop up over and over again, the optimum of the revised concept moving to a new weird extreme each time the programmer tries to hammer down the next weird alternative the AI comes up with.

In other words: There’s a large amount of algorithmic information or many independent reflectively consistent degrees of freedom in the correct answer, the plans we want the AI to come up with, but we’ve only given the AI relatively simple concepts that can’t identify those plans.

Analogues in the history of AI

The result of trying to tackle overly general problems using AI algorithms too narrow for those general problems, usually appears in the form of an infinite number of special cases with a new special case needing to be handled for every problem instance. In the case of narrow AI algorithms tackling a general problem, this happens because the narrow algorithm, being narrow, is not capable of capturing the deep structure of the general problem and its solution.

Suppose that burglars, and also earthquakes, can cause burglar alarms to go off. Today we can represent this kind of scenario using a Bayesian network or causal model which will compactly yield probabilistic inferences along the lines of, “If the burglar alarm goes off, that probably indicates there’s a burglar, unless you learn there was an earthquake, in which case there’s probably not a burglar” and “If there’s an earthquake, the burglar alarm probably goes off.”

During the era where everything in AI was being represented by first-order logic and nobody knew about causal models, people devised increasingly intricate “nonmonotonic logics” to try to represent inference rules like (simultaneously) \(alarm \rightarrow burglar, \ earthquake \rightarrow alarm,\) and \((alarm \wedge earthquake) \rightarrow \neg burglar.\) But first-order logic wasn’t naturally a good surface fit to the set of inferences needed, and the AI programmers didn’t know how to compactly capture the structure that causal models capture. So the “nonmonotonic logic” approach proliferated an endless nightmare of special cases.

Cognitive problems like “modeling causal phenomena” or “being good at math” (aka understanding which mathematical premises imply which mathematical conclusions) might be general enough to defeat modern narrow-AI algorithms. But these domains still seem like they should have something like a central core, leading us to expect correlated coverage of the domain in sufficiently advanced agents. You can’t conclude that because a system is very good at solving arithmetic problems, it will be good at proving Fermat’s Last Theorem. But if a system is smart enough to independently prove Fermat’s Last Theorem and the Poincare Conjecture and the independence of the Axiom of Choice in Zermelo-Frankel set theory, it can probably also—without further handholding—figure out Godel’s Theorem. You don’t need to go on programming in one special case after another of mathematical competency. The fact that humans could figure out all these different areas, without needing to be independently reprogrammed for each one by natural selection, says that there’s something like a central tendency underlying competency in all these areas.

In the case of complexity of value, the thesis is that there are many independent reflectively consistent degrees of freedom in our intended specification of what’s good, bad, or best. Getting one degree of freedom aligned with our intended result doesn’t mean that other degrees of freedom need to align with our intended result. So trying to “patch” the first simple specification that doesn’t work, is likely to result in a different specification that doesn’t work.

When we try to use a narrow AI algorithm to attack a problem which has a central tendency requiring general intelligence to capture, or at any rate requiring some new structure that the narrow AI algorithm can’t handle, we’re effectively asking the narrow AI algorithm to learn something that has no simple structure relative to that algorithm. This is why early AI researchers’ experience with “lack of common sense” that you can’t patch with special cases may be foreseeably indicative of how frustrating it would be, in practice, to repeatedly try to “patch” a kind of difficulty that we may foreseeably need to confront in aligning AI.

That is: Whenever it feels to a human like you want to yell at the AI for its lack of “common sense”, you’re probably looking at a domain where trying to patch that particular AI answer is just going to lead into another answer that lacks “common sense”. Previously in AI history, this happened because real-world problems had no simple central learnable solution relative to the narrow AI algorithm. In value alignment, something similar could happen because of the complexity of our value function, whose evaluations also feel to a human like “common sense”.

Relevance to alignment theory

Patch resistance, and its sister issue of lack of correlated coverage, is a central reason why aligning advanced agents could be way harder, way more dangerous, and way more likely to actually kill everyone in practice, compared to optimistic scenarios. It’s a primary reason to worry, “Uh, what if aligning AI is actually way harder than it might look to some people, the way that building AGI in the first place turned out not to be something you could do in two months over the summer?”

It’s also a reason to worry about context disasters revolving around capability gains: Anything you had to patch-until-it-worked at AI capability level \(k\) is probably going to break hard at capability \(l \gg k.\) This is doubly catastrophic in practice if the pressures to “just get the thing running today” are immense.

To the extent that we can see the central project of AI alignment as revolving around finding a set of alignment ideas that do have simple central tendencies and are specifiable or learnable which together add up to a safe but powerful AI—that is, finding domains with correlated coverage that add up to a safe AI that can do something pivotal—we could see the central project of AI alignment as finding a collectively good-enough set of safety-things we can do without endless patching.


  • Unforeseen maximum

    When you tell AI to produce world peace and it kills everyone. (Okay, some SF writers saw that one coming.)


  • AI alignment

    The great civilizational problem of creating artificially intelligent computer systems such that running them is a good idea.