Unforeseen maximum

An unforeseen maximum of a utility function (or other preference framework) is when, e.g., you tell the AI to produce smiles, thinking that the AI will make people happy in order to produce smiles. But unforeseen by you, the AI has an alternative for making even more smiles, which is to convert all matter within reach into tiny molecular smileyfaces.

In other words, you’re proposing to give the AI a goal \(U\), because you think \(U\) has a maximum around some nice options \(X.\) But it turns out there’s another option \(X'\) you didn’t imagine, with \(X' >_U X,\) and \(X'\) is not so nice.

Unforeseen maximums are argued to be a foreseeable difficulty of AGI alignment, if you try to identify nice policies by giving a simple criterion \(U\) that, so far as you can see, seems like it’d be best optimized by doing nice things.

Slightly more semiformally, we could say that “unforeseen maximum” is realized as a difficulty when:

  1. A programmer thinking about a utility function \(U\) considers policy options \(\pi_i \in \Pi_N\) and concludes that of these options the policy with highest \(\mathbb E [ U | \pi_i ]\) is \(\pi_1,\) and hence a \(U\)-maximizer will probably do \(\pi_1.\)

  2. The programmer also thinks that their own criterion of goodness \(V\) will be promoted by \(\pi_1,\) that is, \(\mathbb E [ V | \pi_1 ] > \mathbb E [ V ]\) or “$\pi_1$ is beneficial”. So the programmer concludes that it’s a great idea to build an AI that optimizes for \(U.\)

  3. Alas, the AI is searching a policy space \(\Pi_M,\) which although it does contain \(\pi_1\) as an option, also contains an attainable option \(\pi_0\) which programmer didn’t consider, with \(\mathbb E [ U | \pi_0 ] > \mathbb E [ U | \pi_1 ].\) This is a problem if \(\pi_0\) produces much less \(V\)-benefit than \(\pi_1\) or is outright detrimental.

That is:

$$\underset{\pi_i \in \Pi_N}{\operatorname {argmax}} \ \mathbb E [ U | \pi_i ] = \pi_1$$

$$\underset{\pi_k \in \Pi_M}{\operatorname {argmax}} \ \mathbb E [ U | \pi_k ] = \pi_0$$

$$\mathbb E [ V | \pi_0 ] \ll \mathbb E [ V | \pi_1 ]$$

Example: Schmidhuber’s compression goal.

Juergen Schmidhuber of IDSIA, during the 2009 Singularity Summit, gave a talk proposing that the best and most moral utility function for an AI was the gain in compression of sensory data over time. Schmidhuber gave examples of valuable behaviors he thought this would motivate, like doing science and understanding the universe, or the construction of art and highly aesthetic objects.

Yudkowsky in Q&A suggested that this utility function would instead motivate the construction of external objects that would internally generate random cryptographic secrets, encrypt highly regular streams of 1s and 0s, and then reveal the cryptographic secrets to the AI.

Translating into the above schema:

  1. Schmidhuber, considering the utility function \(U\) of “maximize gain in sensory compression”, thought that option \(\pi_1\) of “do art and science” would be the attainable maximum of \(U\) within all options \(\Pi_N\) that Schmidhuber considered.

  2. Schmidhuber also considered the option \(\pi_1\) “do art and science” to achieve most of the attainable value under his own criterion of goodness \(V\).

  3. However, while the AI’s option space \(\Pi_M\) would indeed include \(\pi_1\) as an option, it would also include the option \(\pi_0\) of “have an environmental object encrypt streams of 1s or 0s and then reveal the key” which would score much higher under \(U\), and much lower under \(V.\)

Relation to other foreseeable difficulties

Context disaster implies an unforeseen maximum may come as a surprise, or not show up during the development phase, because during the development phase the AI’s options are restricted to some \(\Pi_L \subset \Pi_M\) with \(\pi_0 \not\in \Pi_L.\)

Indeed, the pseudo-formalization of a “type-1 context disaster” is isomorphic to the pseudoformalization of “unforeseen maximum”, except that in a context disaster, \(\Pi_N\) and \(\Pi_M\) are identified with “AI’s options during development” and “AI’s options after a capability gain”. (Instead of “Options the programmer is thinking of” and “Options the AI will consider”.)

The two concepts are conceptually distinct because, e.g:

  • A context disaster could also apply to a decision criterion learned by training, not just a utility function envisioned by the programmer.

  • It’s an unforeseen maximum but not a context disaster if the programmer is initially reasoning, not that the AI has already been observed to be beneficial during a development phase, but rather that the AI ought to be beneficial when it optimizes \(U\) later because of the supposed nice maximum at \(\pi_1\).

If we hadn’t observed what seem like clear-cut cases of some actors in the field being blindsided by unforeseen maxima in imagination, we’d worry less about actors being blindsided by context disasters over observations.

Edge instantiation suggests that the real maxima of non-$V$ utility functions will be “strange, weird, and extreme” relative to our own \(V\)-views on preferable options.

Missing the weird alternative suggests that people may psychologically fail to consider alternative agent options \(\pi_0\) that are very low in \(V,\) because the human search function looks for high-$V$ and normal policies. In other words, that Schmidhuber didn’t generate “encrypt streams of 1s or 0s and then reveal the key” because this policy was less attractive to him than “do art and science” and because it was weird.

Nearest unblocked strategy suggests that if you try to add a penalty term to exclude \(\pi_0\), the next-highest \(U\)-ranking option will often be some similar alternative \(\pi_{0.01}\) which still isn’t nice.

fragile value asserts that our true criterion of goodness \(V\) is narrowly peaked within the space of all achievable outcomes for a superintelligence, such that we rapidly fall off in \(V\) as we move away from the peak. Complexity of value says that \(V\) and its corresponding peak have high algorithmic complexity. Then the peak outcomes identified by any simple object-level \(U\) will systematically fail to find \(V\). It’s like trying to find a 1000-byte program which will approximately reproduce the text of Shakespeare’s Hamlet; algorithmic information theory says that you just shouldn’t expect to find a simple program like that.

apple pie problem raises the concern that some people may have psychological trouble accepting the “But \(\pi_0\)” critique even after it is pointed out, because of their ideological attachment to a noble goal \(U\) (probably actually noble!) that would be even more praiseworthy if \(U\) could also serve as a complete utility function for an AGI (which it unfortunately can’t).

Implications and research avenues

Conservatism in goal concepts can be seen as trying to directly tackle the problem of unforeseen maxima. More generally, AI approaches which work on “whitelisting conservative boundaries around approved policy spaces” instead of “search the widest possible policy space, minus some blacklisted parts”.

The Task paradigm for advanced agents concentrates on trying to accomplish some single pivotal act which can be accomplished by one or more tasks of limited scope. Combined with other measures, this might make it easier to identify an adequate safe plan for accomplishing the limited-scope task, rather than needing to identify the fragile peak of \(V\) within some much larger landscape. The Task AGI formulation is claimed to let us partially “narrow down” the scope of the necessary \(U\), the part of \(V\) that’s relevant to the task, and the searched policy space \(\Pi\) to what is only adequate. This might reduce or meliorate, though not by itself eliminate, unforeseen maxima.

Mild optimization can be seen as “not trying so hard, not shoving all the way to the maximum”—the hope is that when combined with a Task paradigm plus other measures like conservative goals and strategies, this will produce less optimization pressure toward weird edges and unforeseen maxima. (This method is not adequate on its own because an arbitrary adequate-$U$ policy may still not be high-$V$, ceteris paribus.)

Imitation-based agents try to maximize similarity to a reference human’s immediate behavior, rather than trying to optimize a utility function.

The prospect of being tripped up by unforeseen maxima, is one of the contributing motivations for giving up on hand-coded object-level utilities in favor of meta-level preference frameworks that learn a utility function or decision rule. (Again, this doesn’t seem like a full solution by itself, only one ingredient to be combined with other methods. If the utility function is a big complicated learned object, that by itself is not a good reason to relax about the possibility that its maximum will be somewhere you didn’t foresee, especially after a capabilities boost.)

Missing the weird alternative and the apple pie problem suggest that it may be unusually difficult to explain to actors why \(\pi_0 >_U \pi_1\) is a difficulty of their favored utility function \(U\) that allegedly implies nice policy \(\pi_1.\) That is, for psychological reasons, this difficulty seems unusually likely to actually trip up sponsors of AI projects or politically block progress on alignment.


  • Missing the weird alternative

    People might systematically overlook “make tiny molecular smileyfaces” as a way of “producing smiles”, because our brains automatically search for high-utility-to-us ways of “producing smiles”.