Correlated coverage

“Correlated coverage” occurs within a domain when—going to some lengths to avoid words like “competent” or “correct”—an advanced agent handling some large number of domain problems the way we want, means that the AI is likely to handle all problems in the domain the way we want.

To see the difference between correlated coverage and not-correlated coverage, consider humans as general epistemologists, versus the Complexity of value problem.

In Complexity of value, there’s Humean freedom and multiple fixed points when it comes to “Which outcomes rank higher than which other outcomes?” All the terms in Frankena’s list of desiderata have their own Humean freedom as to the details. An agent can decide 1000 issues the way we want, that happen to shadow 12 terms in our complex values, so that covering the answers we want pins down 12 degrees of freedom; and then it turns out there’s a 13th degree of freedom that isn’t shadowed in the 1000 issues, because later problems are not drawn from the same barrel as prior problems. In which case the answer on the 1001st issue, that does turn around that 13th degree of freedom, isn’t pinned down by correlation with the coverage of the first 1000 issues. Coverage on the first 1000 queries may not correlate with coverage on the 1001st query.

When it comes to Epistemology, there’s something like a central idea: Bayesian updating plus simplicity prior. Although not every human can solve every epistemic question, there’s nonetheless a sense in which humans, having been optimized to run across the savanna and figure out which plants were poisonous and which of their political opponents might be plotting against them, were later able to figure out General Relativity despite having not been explicitly selected-on for solving that problem. If we include human subagents into our notion of what problems, in general, human beings can be said to cover, then any question of fact where we can get a correct answer by building a superintelligence to solve it for us, is in some sense “covered” by humans as general epistemologists.

Human neurology is big and complicated and involves many different brain areas, and we had to go through a long process of bootstrapping our epistemology by discovering and choosing to adopt cultural rules about science. Even so, the fact that there’s something like a central tendency or core or simple principle of “Bayesian updating plus simplicity prior”, means that when natural selection built brains to figure out who was plotting what, it accidentally built brains that could figure out General Relativity.

We can see other parts of value alignment in the same light—trying to find places, problems to tackle, where there may be correlated coverage:

The reason to work on ideas like Safe impact measure is that we might hope that there’s something like a core idea for “Try not to impact unnecessarily large amounts of stuff” in a way that there isn’t a core idea for “Try not to do anything that decreases value.”

The hope that anapartistic reasoning could be a general solution to Corrigibility says, “Maybe there’s a core central idea that covers everything we mean by an agent B letting agent A correct it—like, if we really honestly wanted to let someone else correct us and not mess with their safety measures, it seems like there’s a core thing for us to want that doesn’t go through all the Humean degrees of freedom in humane value.” This doesn’t mean that there’s a short program that encodes all of anapartistic reasoning, but it means there’s more reason to hope that if you get 100 problems right, and then the next 1000 problems are gotten right without further tweaking, and it looks like there’s a central core idea behind it and the core thing looks like anapartistic reasoning, maybe you’re done.

Do What I Know I Mean similarly incorporates a hope that, even if it’s not simple and there isn’t a short program that encodes it, there’s something like a core or a center to the notion of “Agent X does what Agent Y asks while modeling Agent Y and trying not to do things whose consequences it isn’t pretty sure Agent Y will be okay with” where we can get correlated coverage of the problem with less complexity than it would take to encode values directly.

From the standpoint of the AI safety mindset, understanding the notion of correlated coverage and its complementary problem of patch resistance is what leads to traversing the gradient from:

“Oh, we’ll just hardwire the AI’s utility function to tell it not to kill people.”

To:

“Of course there’ll be an extended period where we have to train the AI not to do various sorts of bad things.”

To:

“Bad impacts isn’t a compact category and the training data may not capture everything that could be a bad impact, especially if the AI gets smarter than the phase in which it was trained. But maybe the notion of being low impact in general (rather than blacklisting particular bad impacts) has a simple-enough core to be passed on by training or specification in a way that generalizes across sharp capability gains.”