Don't try to solve the entire alignment problem

On first ap­proach­ing the al­ign­ment prob­lem for ad­vanced agents, aka “ro­bust and benefi­cial AGI”, aka “Friendly AI”, a very com­mon ap­proach is to try to come up with one idea that solves all of AI al­ign­ment. A sim­ple de­sign con­cept; a sim­ple util­ity func­tion; a sim­ple de­vel­op­ment strat­egy; one guideline for ev­ery­one to ad­here to; or a large di­a­gram full of boxes with lines to other boxes; that is allegedly suffi­cient to re­al­ize around as much benefit from benefi­cial su­per­in­tel­li­gences as can pos­si­bly be re­al­ized.

Without know­ing the de­tails of your cur­rent idea, this ar­ti­cle can’t tell you why it’s wrong—though frankly we’ve got a strong prior against it at this point. But some very stan­dard ad­vice would be:

  • Glance over what cur­rent dis­cus­sants think of as stan­dard challenges and difficul­ties of the over­all prob­lem, i.e., why peo­ple think the al­ign­ment might be hard, and what stan­dard ques­tions a new ap­proach would face.

  • Con­sider fo­cus­ing your at­ten­tion down on a sin­gle sub­prob­lem of al­ign­ment, and try­ing to make progress there—not nec­es­sar­ily solve it com­pletely, but con­tribute nonob­vi­ous knowl­edge about the prob­lem that wasn’t there be­fore. (If you have a broad new ap­proach that solves all of al­ign­ment, maybe you could walk through ex­actly how it solves one crisply iden­ti­fied sub­prob­lem?)

  • Check out the flaws in pre­vi­ous pro­pos­als that peo­ple cur­rently think won’t work. E.g. var­i­ous ver­sions of util­ity in­differ­ence.

A good ini­tial goal is not “per­suade ev­ery­one in the field to agree with a new idea” but rather “come up with a con­tri­bu­tion to an open dis­cus­sion that is suffi­ciently crisply stated that, if it were in fact wrong, it would be pos­si­ble for some­body else to shoot it down to­day.” I.e., an idea such that if you’re wrong, this can be pointed out in the form of a crisply deriv­able con­se­quence of a crisply speci­fied idea, rather than it tak­ing 20 years to see what hap­pens. For there to be sus­tained progress, propo­si­tions need to be stated mod­u­larly enough and crisply enough that there can be a con­ver­sa­tion about them that goes be­yond “does not /​ does too”—ideas need to be stated in forms that have suffi­ciently clear and deriv­able con­se­quences that if there’s a prob­lem, peo­ple can see the prob­lem and agree on it.

Alter­na­tively, poke a clearly demon­stra­ble flaw in some solu­tion cur­rently be­ing cri­tiqued. Since most pro­pos­als in al­ign­ment the­ory get shot down, try­ing to par­ti­ci­pate in the cri­tiquing pro­cess has a great ad­van­tage over try­ing to in­vent solu­tions, in that you’ll prob­a­bly have started with the true premise “pro­posal X is bro­ken or in­com­plete” rather than the false premise “pro­posal X works and solves ev­ery­thing”.

Psy­chol­o­giz­ing a lit­tle about why peo­ple might try to solve all of al­ign­ment the­ory in one shot, one might re­count Robyn Dawes’s ad­vice that:

  • Re­search shows that peo­ple come up with bet­ter solu­tions when they dis­cuss the prob­lem as thor­oughly as pos­si­ble be­fore dis­cussing any an­swers.

  • Dawes has ob­served that peo­ple seem more likely to vi­o­late this prin­ci­ple as the prob­lem be­comes more difficult.

…and fi­nally re­mark that build­ing a nice ma­chine in­tel­li­gence cor­rectly on the first try must be pretty darned difficult, since so many peo­ple solve it in the first 15 sec­onds.

It’s pos­si­ble that ev­ery­one work­ing in this field is just miss­ing the ob­vi­ous and that there is some sim­ple idea which solves all the prob­lems. But re­al­is­ti­cally, you should be aware that ev­ery­one in this field has already heard a dozen ter­rible To­tal Solu­tions, and prob­a­bly hasn’t had any­thing fun hap­pen as a re­sult of dis­cussing them, re­sult­ing in some amount of at­ten­tional fa­tigue. (Similarly: If not ev­ery­one be­lieves you, or even if it’s hard to get peo­ple to listen to your solu­tion in­stead of talk­ing with peo­ple they already know, that’s not nec­es­sar­ily be­cause of some deep-seated psy­cholog­i­cal prob­lem on their part, such as be­ing un­in­ter­ested in out­siders’ ideas. Even if you’re not an ob­vi­ous crank, peo­ple are still un­likely to take the time out to en­gage with you un­less you sig­nal aware­ness of what they think are the usual is­sues and ob­sta­cles. It’s not so differ­ent here from other fields.)


  • AI safety mindset

    Ask­ing how AI de­signs could go wrong, in­stead of imag­in­ing them go­ing right.