Correlated coverage

“Cor­re­lated cov­er­age” oc­curs within a do­main when—go­ing to some lengths to avoid words like “com­pe­tent” or “cor­rect”—an ad­vanced agent han­dling some large num­ber of do­main prob­lems the way we want, means that the AI is likely to han­dle all prob­lems in the do­main the way we want.

To see the differ­ence be­tween cor­re­lated cov­er­age and not-cor­re­lated cov­er­age, con­sider hu­mans as gen­eral episte­mol­o­gists, ver­sus the Com­plex­ity of value prob­lem.

In Com­plex­ity of value, there’s Humean free­dom and mul­ti­ple fixed points when it comes to “Which out­comes rank higher than which other out­comes?” All the terms in Frankena’s list of desider­ata have their own Humean free­dom as to the de­tails. An agent can de­cide 1000 is­sues the way we want, that hap­pen to shadow 12 terms in our com­plex val­ues, so that cov­er­ing the an­swers we want pins down 12 de­grees of free­dom; and then it turns out there’s a 13th de­gree of free­dom that isn’t shad­owed in the 1000 is­sues, be­cause later prob­lems are not drawn from the same bar­rel as prior prob­lems. In which case the an­swer on the 1001st is­sue, that does turn around that 13th de­gree of free­dom, isn’t pinned down by cor­re­la­tion with the cov­er­age of the first 1000 is­sues. Cover­age on the first 1000 queries may not cor­re­late with cov­er­age on the 1001st query.

When it comes to Episte­mol­ogy, there’s some­thing like a cen­tral idea: Bayesian up­dat­ing plus sim­plic­ity prior. Although not ev­ery hu­man can solve ev­ery epistemic ques­tion, there’s nonethe­less a sense in which hu­mans, hav­ing been op­ti­mized to run across the sa­vanna and figure out which plants were poi­sonous and which of their poli­ti­cal op­po­nents might be plot­ting against them, were later able to figure out Gen­eral Rel­a­tivity de­spite hav­ing not been ex­plic­itly se­lected-on for solv­ing that prob­lem. If we in­clude hu­man sub­agents into our no­tion of what prob­lems, in gen­eral, hu­man be­ings can be said to cover, then any ques­tion of fact where we can get a cor­rect an­swer by build­ing a su­per­in­tel­li­gence to solve it for us, is in some sense “cov­ered” by hu­mans as gen­eral episte­mol­o­gists.

Hu­man neu­rol­ogy is big and com­pli­cated and in­volves many differ­ent brain ar­eas, and we had to go through a long pro­cess of boot­strap­ping our episte­mol­ogy by dis­cov­er­ing and choos­ing to adopt cul­tural rules about sci­ence. Even so, the fact that there’s some­thing like a cen­tral ten­dency or core or sim­ple prin­ci­ple of “Bayesian up­dat­ing plus sim­plic­ity prior”, means that when nat­u­ral se­lec­tion built brains to figure out who was plot­ting what, it ac­ci­den­tally built brains that could figure out Gen­eral Rel­a­tivity.

We can see other parts of value al­ign­ment in the same light—try­ing to find places, prob­lems to tackle, where there may be cor­re­lated cov­er­age:

The rea­son to work on ideas like Safe im­pact mea­sure is that we might hope that there’s some­thing like a core idea for “Try not to im­pact un­nec­es­sar­ily large amounts of stuff” in a way that there isn’t a core idea for “Try not to do any­thing that de­creases value.”

The hope that ana­partis­tic rea­son­ing could be a gen­eral solu­tion to Cor­rigi­bil­ity says, “Maybe there’s a core cen­tral idea that cov­ers ev­ery­thing we mean by an agent B let­ting agent A cor­rect it—like, if we re­ally hon­estly wanted to let some­one else cor­rect us and not mess with their safety mea­sures, it seems like there’s a core thing for us to want that doesn’t go through all the Humean de­grees of free­dom in hu­mane value.” This doesn’t mean that there’s a short pro­gram that en­codes all of ana­partis­tic rea­son­ing, but it means there’s more rea­son to hope that if you get 100 prob­lems right, and then the next 1000 prob­lems are got­ten right with­out fur­ther tweak­ing, and it looks like there’s a cen­tral core idea be­hind it and the core thing looks like ana­partis­tic rea­son­ing, maybe you’re done.

Do What I Know I Mean similarly in­cor­po­rates a hope that, even if it’s not sim­ple and there isn’t a short pro­gram that en­codes it, there’s some­thing like a core or a cen­ter to the no­tion of “Agent X does what Agent Y asks while mod­el­ing Agent Y and try­ing not to do things whose con­se­quences it isn’t pretty sure Agent Y will be okay with” where we can get cor­re­lated cov­er­age of the prob­lem with less com­plex­ity than it would take to en­code val­ues di­rectly.

From the stand­point of the AI safety mind­set, un­der­stand­ing the no­tion of cor­re­lated cov­er­age and its com­ple­men­tary prob­lem of patch re­sis­tance is what leads to travers­ing the gra­di­ent from:

  • “Oh, we’ll just hard­wire the AI’s util­ity func­tion to tell it not to kill peo­ple.”


  • “Of course there’ll be an ex­tended pe­riod where we have to train the AI not to do var­i­ous sorts of bad things.”


  • Bad im­pacts isn’t a com­pact cat­e­gory and the train­ing data may not cap­ture ev­ery­thing that could be a bad im­pact, es­pe­cially if the AI gets smarter than the phase in which it was trained. But maybe the no­tion of be­ing low im­pact in gen­eral (rather than black­list­ing par­tic­u­lar bad im­pacts) has a sim­ple-enough core to be passed on by train­ing or speci­fi­ca­tion in a way that gen­er­al­izes across sharp ca­pa­bil­ity gains.”


  • AI alignment

    The great civ­i­liza­tional prob­lem of cre­at­ing ar­tifi­cially in­tel­li­gent com­puter sys­tems such that run­ning them is a good idea.