Meta-rules for (narrow) value learning are still unsolved


This propo­si­tion is true ac­cord­ing to you if you be­lieve that: “No­body has yet pro­posed a satis­fac­tory fixed/​sim­ple al­gorithm that takes as in­put a ma­te­rial de­scrip­tion of the uni­verse, and/​or chan­nels of sen­sory ob­ser­va­tion, and spits out ideal val­ues or a task iden­ti­fi­ca­tion.”


The Com­plex­ity of value the­sis says that, on the ob­ject-level, any speci­fi­ca­tion of what we’d re­ally want from the fu­ture has high Al­gorith­mic com­plex­ity.

In some sense, all the com­plex­ity re­quired to spec­ify value must be con­tained in­side hu­man brains; even as an ob­ject of con­ver­sa­tion, we can’t talk about any­thing our brains do not point to. This is why Com­plex­ity of value dis­t­in­guishes the ob­ject-level com­plex­ity of value from meta-level com­plex­ity—the min­i­mum pro­gram re­quired to get a suffi­ciently ad­vanced ar­tifi­cial in­tel­li­gence to learn val­ues. It would be a sep­a­rate ques­tion to con­sider the min­i­mum com­plex­ity of a func­tion that takes as in­put a full de­scrip­tion of the ma­te­rial uni­verse in­clud­ing hu­mans, and out­puts “value”.

This ques­tion also has a nar­row rather than am­bi­tious form: given sen­sory ob­ser­va­tions an AGI could rea­son­ably re­ceive in co­op­er­a­tion with its pro­gram­mers, or a pre­dic­tive model of hu­mans that AGI could rea­son­ably form and re­fine, is there a sim­ple rule that will take this data as in­put, and safely and re­li­ably iden­tify Tasks on the or­der of “de­velop molec­u­lar nan­otech­nol­ogy, use the nan­otech­nol­ogy to syn­the­size one straw­berry, and then stop, with a min­i­mum of side effects”?

In this case we have no strong rea­son to think that the func­tions are high-com­plex­ity in an ab­solute sense.

How­ever, no­body has yet pro­posed a satis­fac­tory piece of pseu­docode that solves any var­i­ant of this prob­lem even in prin­ci­ple.

Ob­sta­cles to sim­ple meta-rules

Con­sider a sim­ple Meta-util­ity func­tion that speci­fies a sense-in­put-de­pen­dent for­mu­la­tion of Mo­ral un­cer­tainty: An ob­ject-level out­come \(o\) has a util­ity \(U(o)\) that is \(U_1(o)\) if a fu­ture sense sig­nal \(s\) is 1 and \(U_2(o)\) if \(s\) is 2. Given this setup, the AI has an in­cen­tive to tam­per with \(s\) and cause it to be 1 if \(U_1\) is eas­ier to op­ti­mize than \(U_2,\) and vice versa.

More gen­er­ally, sen­sory sig­nals from hu­mans will usu­ally not be re­li­ably and un­alter­ably cor­re­lated with our in­tended goal iden­ti­fi­ca­tion. We can’t treat hu­man-gen­er­ated sig­nals as an ideally re­li­able ground truth about any refer­ent, be­cause (a) some AI ac­tions in­terfere with the sig­nal; and (b) hu­mans make mis­takes, es­pe­cially when you ask them some­thing com­pli­cated. You can’t have a scheme along the lines of “the hu­mans press a but­ton if some­thing goes wrong”, be­cause some poli­cies go wrong in ways hu­mans don’t no­tice un­til it’s too late, and some AI poli­cies de­stroy the but­ton (or mod­ify the hu­man).

Even leav­ing that aside, no­body has yet sug­gested any fully speci­fied pseu­docode that takes in a hu­man-con­trol­led sen­sory chan­nel \(R\) and a de­scrip­tion of the uni­verse \(O\) and spits out a util­ity func­tion that (ac­tu­ally re­al­is­ti­cally) iden­ti­fies our in­tended task over \(O\) (in­clud­ing not tiling the uni­verse with sub­agents and so on).

In­deed, no­body has yet sug­gested a re­al­is­tic scheme for iden­ti­fy­ing any kind of goal what­so­ever with re­spect to an AI on­tol­ogy flex­ible enough to ac­tu­ally de­scribe the ma­te­rial uni­verse. noteEx­cept in the rather non-meta sense of in­spect­ing the AI’s on­tol­ogy once it’s ad­vanced enough to de­scribe what you think you want the AI to do, and man­u­ally pro­gram­ming the AI’s con­se­quen­tial­ist prefer­ences with re­spect to what you think that on­tol­ogy means.

Meta-meta rules

For similar rea­sons as above, no­body has yet pro­posed (even in prin­ci­ple) effec­tive pseu­docode for a meta-meta pro­gram over some space of meta-rules, which would let the AI learn a value-iden­ti­fy­ing meta-rule. Two main prob­lems here are:

One, no­body even has the seed of any pro­posal what­so­ever for how that could, work short of “define a cor­rect­ness-sig­nal­ing chan­nel and throw pro­gram in­duc­tion at it” (which seems un­likely to work di­rectly, given fal­lible, frag­ile hu­mans con­trol­ling the sig­nal).

Two, if the learned meta-rule doesn’t have a sta­ble, ex­tremely com­pact hu­man-trans­par­ent rep­re­sen­ta­tion, it’s not clear how we could ar­rive at any con­fi­dence what­so­ever that good be­hav­ior in a de­vel­op­ment phase would cor­re­spond to good be­hav­ior in a test phase. E.g., con­sider all the ex­am­ple meta-rules we could imag­ine which would work well on a small scale but fail to scale, like “some­thing good just hap­pened if the hu­mans smiled”.


  • Complexity of value

    There’s no sim­ple way to de­scribe the goals we want Ar­tifi­cial In­tel­li­gences to want.