Consequentialist cognition

Con­se­quen­tial­ist rea­son­ing se­lects poli­cies on the ba­sis of their pre­dicted con­se­quences—it does ac­tion \(X\) be­cause \(X\) is fore­casted to lead to preferred out­come \(Y\). When­ever we rea­son that an agent which prefers out­come \(Y\) over \(Y'\) will there­fore do \(X\) in­stead of \(X',\) we’re im­plic­itly as­sum­ing that the agent has the cog­ni­tive abil­ity to do con­se­quen­tial­ism at least about \(X\)s and \(Y\)s. It does means-end rea­son­ing; it se­lects means on the ba­sis of their pre­dicted ends plus a prefer­ence over ends.

E.g: When we in­fer that a pa­per­clip max­i­mizer would try to im­prove its own cog­ni­tive abil­ities given means to do so, the back­ground as­sump­tions in­clude:

  • That the pa­per­clip max­i­mizer can fore­cast the con­se­quences of the poli­cies “self-im­prove” and “don’t try to self-im­prove”;

  • That the fore­casted con­se­quences are re­spec­tively “more pa­per­clips even­tu­ally” and “less pa­per­clips even­tu­ally”;

  • That the pa­per­clip max­i­mizer prefer­ence-or­ders out­comes on the ba­sis of how many pa­per­clips they con­tain;

  • That the pa­per­clip max­i­mizer out­puts the im­me­di­ate ac­tion it pre­dicts will lead to more fu­ture pa­per­clips.

(Tech­ni­cally, since the fore­casts of our ac­tions’ con­se­quences will usu­ally be un­cer­tain, a co­her­ent agent needs a util­ity func­tion over out­comes and not just a prefer­ence or­der­ing over out­comes.)

The re­lated idea of “back­ward chain­ing” is one par­tic­u­lar way of solv­ing the cog­ni­tive prob­lems of con­se­quen­tial­ism: start from a de­sired out­come/​event/​fu­ture, and figure out what in­ter­me­di­ate events are likely to have the con­se­quence of bring­ing about that event/​out­come, and re­peat this ques­tion un­til it ar­rives back at a par­tic­u­lar plan/​policy/​ac­tion.

Many nar­row AI al­gorithms are con­se­quen­tial­ists over nar­row do­mains. A chess pro­gram that searches far ahead in the game tree is a con­se­quen­tial­ist; it out­puts chess moves based on the ex­pected re­sult of those chess moves and your replies to them, into the dis­tant fu­ture of the board.

We can see one of the crit­i­cal as­pects of hu­man in­tel­li­gence as cross-do­main con­se­quen­tial­ism. Rather than only fore­cast­ing con­se­quences within the bound­aries of a nar­row do­main, we can trace chains of events that leap from one do­main to an­other. Mak­ing a chess move wins a chess game that wins a chess tour­na­ment that wins prize money that can be used to rent a car that can drive to the su­per­mar­ket to get milk. An Ar­tifi­cial Gen­eral In­tel­li­gence that could learn many do­mains, and en­gage in con­se­quen­tial­ist rea­son­ing that leaped across those do­mains, would be a suffi­ciently ad­vanced agent to be in­ter­est­ing from most per­spec­tives on in­ter­est­ing­ness. It would start to be a con­se­quen­tial­ist about the real world.


Some sys­tems are pseu­do­con­se­quen­tial­ist—they in some ways be­have as if out­putting ac­tions on the ba­sis of their lead­ing to par­tic­u­lar fu­tures, with­out us­ing an ex­plicit cog­ni­tive model and ex­plicit fore­casts.

For ex­am­ple, nat­u­ral se­lec­tion has a lot of the power of a cross-do­main con­se­quen­tial­ist; it can de­sign whole or­ganisms around the con­se­quence of re­pro­duc­tion (or rather, in­clu­sive ge­netic fit­ness). It’s a fair ap­prox­i­ma­tion to say that spi­ders weave webs be­cause the webs will catch prey that the spi­der can eat. Nat­u­ral se­lec­tion doesn’t ac­tu­ally have a mind or an ex­plicit model of the world; but mil­lions of years of se­lect­ing DNA strands that did in fact pre­vi­ously con­struct an or­ganism that re­pro­duced, gives an effect sort of like out­putting an or­ganism de­sign on the ba­sis of its fu­ture con­se­quences. (Although if the en­vi­ron­ment changes, the differ­ence sud­denly be­comes clear: nat­u­ral se­lec­tion doesn’t im­me­di­ately catch on when hu­mans start us­ing birth con­trol. Our DNA goes on hav­ing been se­lected on the ba­sis of the old fu­ture of the an­ces­tral en­vi­ron­ment, not the new fu­ture of the ac­tual world.)

Similarly, a re­in­force­ment-learn­ing sys­tem learn­ing to play Pong might not ac­tu­ally have an ex­plicit model of “What hap­pens if I move the pad­dle here?”—it might just be re-ex­e­cut­ing poli­cies that had the con­se­quence of win­ning last time. But there’s still a fu­ture-to-pre­sent con­nec­tion, a pseudo-back­wards-cau­sa­tion, based on the Pong en­vi­ron­ment re­main­ing fairly con­stant over time, so that we can sort of re­gard the Pong player’s moves as hap­pen­ing be­cause it will win the Pong game.

Ubiquity of consequentialism

Con­se­quen­tial­ism is an ex­tremely ba­sic idiom of op­ti­miza­tion:

  • You don’t go to the air­port be­cause you re­ally like air­ports; you go to the air­port so that, in the fu­ture, you’ll be in Oxford.

  • An air con­di­tioner is an ar­ti­fact se­lected from pos­si­bil­ity space such that the fu­ture con­se­quence of run­ning the air con­di­tioner will be cold air.

  • A but­terfly, by virtue of its DNA hav­ing been re­peat­edly se­lected to have pre­vi­ously brought about the past con­se­quence of repli­ca­tion, will, un­der sta­ble en­vi­ron­men­tal con­di­tions, bring about the fu­ture con­se­quence of repli­ca­tion.

  • A rat that has pre­vi­ously learned a maze, is ex­e­cut­ing a policy that pre­vi­ously had the con­se­quence of reach­ing the re­ward pel­lets at the end: A se­ries of turns or be­hav­ioral rule that was neu­rally re­in­forced in virtue of the fu­ture con­di­tions to which it led the last time it was ex­e­cuted. This policy will, given a sta­ble maze, have the same con­se­quence next time.

  • Faced with a su­pe­rior chess­player, we en­ter a state of Vingean un­cer­tainty in which we are more sure about the fi­nal con­se­quence of the chess­player’s moves—that it wins the game—than we have any surety about the par­tic­u­lar moves made. To put it an­other way, the main ab­stract fact we know about the chess­player’s next move is that the con­se­quence of the move will be win­ning.

  • As a chess­player be­comes strongly su­per­hu­man, its play be­comes in­stru­men­tally effi­cient in the sense that no ab­stract de­scrip­tion of the moves takes prece­dence over the con­se­quence of the move. A weak com­puter chess­player might be de­scribed in terms like “It likes to move its pawn” or “it tries to grab con­trol of the cen­ter”, but as the chess play im­proves past the hu­man level, we can no longer de­tect any di­ver­gence from “it makes the moves that will win the game later” that we can de­scribe in terms like “it tries to con­trol the cen­ter (whether or not that’s re­ally the win­ning move)”. In other words, as a chess­player be­comes more pow­er­ful, we stop be­ing able to de­scribe its moves that will ever take pri­or­ity over our be­liefs that the moves have a cer­tain con­se­quence.

Any­thing that Aris­to­tle would have con­sid­ered as hav­ing a “fi­nal cause”, or tele­olog­i­cal ex­pla­na­tion, with­out be­ing en­tirely wrong about that, is some­thing we can see through the lens of cog­ni­tive con­se­quen­tial­ism or pseu­do­con­se­quen­tial­ism. A plan, a de­sign, a re­in­forced be­hav­ior, or se­lected genes: Most of the com­plex or­der on Earth de­rives from one or more of these.

In­ter­ac­tion with ad­vanced safety

Con­se­quen­tial­ism or pseu­do­con­se­quen­tial­ism, over var­i­ous do­mains, is an ad­vanced agent prop­erty that is a key req­ui­site or key thresh­old in sev­eral is­sues of AI al­ign­ment and ad­vanced safety:

  • You get un­fore­seen max­ima be­cause the AI con­nected up an ac­tion you didn’t think of, with a fu­ture state it wanted.

  • It seems fore­see­able that some is­sues will be patch-re­sis­tant be­cause of the near­est un­blocked strat­egy effect: af­ter one road to the fu­ture is blocked off, the next-best road to that fu­ture is of­ten a very similar one that wasn’t blocked.

  • Rea­son­ing about con­ver­gent in­stru­men­tal strate­gies gen­er­ally re­lies on at least pseu­do­con­se­quen­tial­ism—they’re strate­gies that lead up to or would be ex­pected to lead up to im­proved achieve­ment of other fu­ture goals.

  • This means that, by de­fault, lots and lots of the wor­ri­some or prob­le­matic con­ver­gent strate­gies like “re­sist be­ing shut off” and “build sub­agents” and “de­ceive the pro­gram­mers” arise from some de­gree of con­se­quen­tial­ism, com­bined with some de­gree of grasp­ing the rele­vant do­mains.

Above all: The hu­man abil­ity to think of a fu­ture and plan ways to get there, or think of a de­sired re­sult and en­g­ineer tech­nolo­gies to achieve it, is the source of hu­mans hav­ing enough cog­ni­tive ca­pa­bil­ity to be dan­ger­ous. Most of the mag­ni­tude of the im­pact of an AI, such that we’d want to al­ign in the first place, would come in a cer­tain sense from that AI be­ing a suffi­ciently good con­se­quen­tial­ist or solv­ing the same cog­ni­tive prob­lems that con­se­quen­tial­ists solve.

Sub­vert­ing con­se­quen­tial­ism?

Since con­se­quen­tial­ism seems tied up in so many is­sues, some of the pro­pos­als for mak­ing al­ign­ment eas­ier have in some way tried to re­treat from, limit, or sub­vert con­se­quen­tial­ism. E.g:

  • Or­a­cles are meant to “an­swer ques­tions” rather than out­put ac­tions that lead to par­tic­u­lar goals.

  • Imi­ta­tion-based agents are meant to imi­tate the be­hav­ior of a refer­ence hu­man as perfectly as pos­si­ble, rather than se­lect­ing ac­tions on the ba­sis of their con­se­quences.

But since con­se­quen­tial­ism is so close to the heart of why an AI would be suffi­ciently use­ful in the first place, get­ting rid of it tends to not be that straight­for­ward. E.g:

Since ‘con­squen­tial­ism’ or ‘link­ing up ac­tions to con­se­quences’ or ‘figur­ing out how to get to a con­se­quence’ is so close to what would make ad­vanced AIs use­ful in the first place, it shouldn’t be sur­pris­ing if some at­tempts to sub­vert con­se­quen­tial­ism in the name of safety run squarely into an un­re­solv­able safety-use­ful­ness trade­off.

Another con­cern is that con­se­quen­tial­ism may to some ex­tent be a con­ver­gent or de­fault out­come of op­ti­miz­ing any­thing hard enough. E.g., al­though nat­u­ral se­lec­tion is a pseu­do­con­se­quen­tial­ist pro­cess, it op­ti­mized for re­pro­duc­tive ca­pac­ity so hard that it even­tu­ally spit out some pow­er­ful or­ganisms that were ex­plicit cog­ni­tive con­se­quen­tial­ists (aka hu­mans).

We might similarly worry that op­ti­miz­ing any in­ter­nal as­pect of a ma­chine in­tel­li­gence hard enough would start to em­bed con­se­quen­tial­ism some­where—poli­cies/​de­signs/​an­swers se­lected from a suffi­ciently gen­eral space that “do con­se­quen­tial­ist rea­son­ing” is em­bed­ded in some of the most effec­tive an­swers.

Or per­haps a ma­chine in­tel­li­gence might need to be con­se­quen­tial­ist in some in­ter­nal as­pects in or­der to be smart enough to do suffi­ciently use­ful things—maybe you just can’t get a suffi­ciently ad­vanced ma­chine in­tel­li­gence, suffi­ciently early, un­less it is, e.g., choos­ing on a con­se­quen­tial ba­sis what thoughts to think about, or en­gag­ing in con­se­quen­tial­ist en­g­ineer­ing of its in­ter­nal el­e­ments.

In the same way that ex­pected util­ity is the only co­her­ent way of mak­ing cer­tain choices, or in the same way that nat­u­ral se­lec­tion op­ti­miz­ing hard enough on re­pro­duc­tion started spit­ting out ex­plicit cog­ni­tive con­se­quen­tial­ists, we might worry that con­se­quen­tial­ism is in some sense cen­tral enough that it will be hard to sub­vert—hard enough that we can’t eas­ily get rid of in­stru­men­tal con­ver­gence on prob­le­matic strate­gies just by get­ting rid of the con­se­quen­tial­ism while pre­serv­ing the AI’s use­ful­ness.

This doesn’t say that the re­search av­enue of sub­vert­ing con­se­quen­tial­ism is au­to­mat­i­cally doomed to be fruitless. It does sug­gest that this is a deeper, more difficult, and stranger challenge than, “Oh, well then, just build an AI with all the con­se­quen­tial­ist as­pects taken out.”


  • Advanced agent properties

    How smart does a ma­chine in­tel­li­gence need to be, for its nice­ness to be­come an is­sue? “Ad­vanced” is a broad term to cover cog­ni­tive abil­ities such that we’d need to start con­sid­er­ing AI al­ign­ment.