Meta-rules for (narrow) value learning are still unsolved


This proposition is true according to you if you believe that: “Nobody has yet proposed a satisfactory fixed/​simple algorithm that takes as input a material description of the universe, and/​or channels of sensory observation, and spits out ideal values or a task identification.”


The Complexity of value thesis says that, on the object-level, any specification of what we’d really want from the future has high Algorithmic complexity.

In some sense, all the complexity required to specify value must be contained inside human brains; even as an object of conversation, we can’t talk about anything our brains do not point to. This is why Complexity of value distinguishes the object-level complexity of value from meta-level complexity—the minimum program required to get a sufficiently advanced artificial intelligence to learn values. It would be a separate question to consider the minimum complexity of a function that takes as input a full description of the material universe including humans, and outputs “value”.

This question also has a narrow rather than ambitious form: given sensory observations an AGI could reasonably receive in cooperation with its programmers, or a predictive model of humans that AGI could reasonably form and refine, is there a simple rule that will take this data as input, and safely and reliably identify Tasks on the order of “develop molecular nanotechnology, use the nanotechnology to synthesize one strawberry, and then stop, with a minimum of side effects”?

In this case we have no strong reason to think that the functions are high-complexity in an absolute sense.

However, nobody has yet proposed a satisfactory piece of pseudocode that solves any variant of this problem even in principle.

Obstacles to simple meta-rules

Consider a simple Meta-utility function that specifies a sense-input-dependent formulation of Moral uncertainty: An object-level outcome \(o\) has a utility \(U(o)\) that is \(U_1(o)\) if a future sense signal \(s\) is 1 and \(U_2(o)\) if \(s\) is 2. Given this setup, the AI has an incentive to tamper with \(s\) and cause it to be 1 if \(U_1\) is easier to optimize than \(U_2,\) and vice versa.

More generally, sensory signals from humans will usually not be reliably and unalterably correlated with our intended goal identification. We can’t treat human-generated signals as an ideally reliable ground truth about any referent, because (a) some AI actions interfere with the signal; and (b) humans make mistakes, especially when you ask them something complicated. You can’t have a scheme along the lines of “the humans press a button if something goes wrong”, because some policies go wrong in ways humans don’t notice until it’s too late, and some AI policies destroy the button (or modify the human).

Even leaving that aside, nobody has yet suggested any fully specified pseudocode that takes in a human-controlled sensory channel \(R\) and a description of the universe \(O\) and spits out a utility function that (actually realistically) identifies our intended task over \(O\) (including not tiling the universe with subagents and so on).

Indeed, nobody has yet suggested a realistic scheme for identifying any kind of goal whatsoever with respect to an AI ontology flexible enough to actually describe the material universe. noteExcept in the rather non-meta sense of inspecting the AI’s ontology once it’s advanced enough to describe what you think you want the AI to do, and manually programming the AI’s consequentialist preferences with respect to what you think that ontology means.

Meta-meta rules

For similar reasons as above, nobody has yet proposed (even in principle) effective pseudocode for a meta-meta program over some space of meta-rules, which would let the AI learn a value-identifying meta-rule. Two main problems here are:

One, nobody even has the seed of any proposal whatsoever for how that could, work short of “define a correctness-signaling channel and throw program induction at it” (which seems unlikely to work directly, given fallible, fragile humans controlling the signal).

Two, if the learned meta-rule doesn’t have a stable, extremely compact human-transparent representation, it’s not clear how we could arrive at any confidence whatsoever that good behavior in a development phase would correspond to good behavior in a test phase. E.g., consider all the example meta-rules we could imagine which would work well on a small scale but fail to scale, like “something good just happened if the humans smiled”.


  • Complexity of value

    There’s no simple way to describe the goals we want Artificial Intelligences to want.