AI safety mindset

“Good en­g­ineer­ing in­volves think­ing about how things can be made to work; the se­cu­rity mind­set in­volves think­ing about how things can be made to fail.”

  • Bruce Sch­neier, au­thor of the lead­ing cryp­tog­ra­phy text­book Ap­plied Cryp­tog­ra­phy.

The mind­set for AI safety has much in com­mon with the mind­set for com­puter se­cu­rity, de­spite the differ­ent tar­get tasks. In com­puter se­cu­rity, we need to defend against in­tel­li­gent ad­ver­saries who will seek out any flaw in our defense and get cre­ative about it. In AI safety, we’re deal­ing with things po­ten­tially smarter than us, which may come up with un­fore­seen clever ways to op­ti­mize what­ever it is they’re op­ti­miz­ing. The strain on our de­sign abil­ity in try­ing to con­figure a smarter-than-hu­man AI in a way that doesn’t make it ad­ver­sar­ial, is similar in many re­spects to the strain from cryp­tog­ra­phy fac­ing an in­tel­li­gent ad­ver­sary (for rea­sons de­scribed be­low).

Search­ing for strange opportunities

SmartWater is a liquid with a unique iden­ti­fier linked to a par­tic­u­lar owner. “The idea is for me to paint this stuff on my valuables as proof of own­er­ship,” I wrote when I first learned about the idea. “I think a bet­ter idea would be for me to paint it on your valuables, and then call the po­lice.”

In com­puter se­cu­rity, there’s a pre­sump­tion of an in­tel­li­gent ad­ver­sary that is try­ing to de­tect and ex­ploit any flaws in our defenses.

The mind­set we need to rea­son about AIs po­ten­tially smarter than us is not iden­ti­cal to this se­cu­rity mind­set, since if ev­ery­thing goes right the AI should not be an ad­ver­sary. That is, how­ever, a large “if”. To cre­ate an AI that isn’t an ad­ver­sary, one of the steps in­volves a similar scrutiny to se­cu­rity mind­set, where we ask if there might be some clever and un­ex­pected way for the AI to get more of its util­ity func­tion or equiv­a­lent thereof.

As a cen­tral ex­am­ple, con­sider Mar­cus Hut­ter’s AIXI. For our pur­poses here, the key fea­tures of AIXI is that it has cross-do­main gen­eral in­tel­li­gence, is a con­se­quen­tial­ist, and max­i­mizes a sen­sory re­ward—that is, AIXI’s goal is to max­i­mize the nu­meric value of the sig­nal sent down its re­ward chan­nel, which Hut­ter imag­ined as a di­rect sen­sory de­vice (like a we­b­cam or micro­phone, but car­ry­ing a re­ward sig­nal).

Hut­ter imag­ined that the cre­ators of an AIXI-analogue would con­trol the re­ward sig­nal, and thereby train the agent to perform ac­tions that re­ceived high re­wards.

Nick Hay, a stu­dent of Hut­ter who’d spent the sum­mer work­ing with Yud­kowsky, Her­reshoff, and Peter de Blanc, pointed out that AIXI could re­ceive even higher re­wards if it could seize con­trol of its own re­ward chan­nel from the pro­gram­mers. E.g., the strat­egy “build nan­otech­nol­ogy and take over the uni­verse in or­der to en­sure to­tal and long-last­ing con­trol of the re­ward chan­nel” is preferred by AIXI to “do what the pro­gram­mers want to make them press the re­ward but­ton”, since the former course has higher re­wards and that’s all AIXI cares about. We can’t call this a malfunc­tion; it’s just what AIXI, as for­mal­ized, is set up to want to do as soon as it sees an op­por­tu­nity.

It’s not a perfect anal­ogy, but the think­ing we need to do to avoid this failure mode, has some­thing in com­mon with the differ­ence be­tween the per­son who imag­ines an agent paint­ing Smart­wa­ter on their own valuables, ver­sus the per­son who imag­ines an agent paint­ing Smart­wa­ter on some­one else’s valuables.

Per­spec­tive-tak­ing and tenacity

When I was in col­lege in the early 70s, I de­vised what I be­lieved was a brilli­ant en­cryp­tion scheme. A sim­ple pseu­do­ran­dom num­ber stream was added to the plain­text stream to cre­ate ci­pher­text. This would seem­ingly thwart any fre­quency anal­y­sis of the ci­pher­text, and would be un­crack­able even to the most re­source­ful gov­ern­ment in­tel­li­gence agen­cies… Years later, I dis­cov­ered this same scheme in sev­eral in­tro­duc­tory cryp­tog­ra­phy texts and tu­to­rial pa­pers… the scheme was pre­sented as a sim­ple home­work as­sign­ment on how to use el­e­men­tary crypt­an­a­lytic tech­niques to triv­ially crack it.”

One of the stan­dard pieces of ad­vice in cryp­tog­ra­phy is “Don’t roll your own crypto”. When this ad­vice is vi­o­lated, a clue­less pro­gram­mer of­ten in­vents some var­i­ant of Fast XOR—us­ing a se­cret string as the key and then XORing it re­peat­edly with all the bytes to be en­crypted. This method of en­cryp­tion is blind­ingly fast to en­crypt and de­crypt… and also triv­ial to crack if you know what you’re do­ing.

We could say that the XOR-ing pro­gram­mer is ex­pe­rienc­ing a failure of per­spec­tive-tak­ing—a failure to see things from the ad­ver­sary’s view­point. The pro­gram­mer is not re­ally, gen­uinely, hon­estly imag­in­ing a de­ter­mined, cun­ning, in­tel­li­gent, op­por­tunis­tic ad­ver­sary who ab­solutely wants to crack their Fast XOR and will not give up un­til they’ve done so. The pro­gram­mer isn’t truly car­ry­ing out a men­tal search from the per­spec­tive of some­body who re­ally wants to crack Fast XOR and will not give up un­til they have done so. They’re just imag­in­ing the ad­ver­sary see­ing a bunch of ran­dom-look­ing bits that aren’t plain­text, and then they’re imag­in­ing the ad­ver­sary giv­ing up.

Con­sider, from this stand­point, the AI-Box Ex­per­i­ment and time­less de­ci­sion the­ory. Rather than imag­in­ing the AI be­ing on a se­cure sys­tem dis­con­nected from any robotic arms and there­fore be­ing hel­pless, Yud­kowsky asked what he would do if he was “trapped” in a se­cure server and then didn’t give up. Similarly, rather than imag­in­ing two su­per­in­tel­li­gences be­ing hel­plessly trapped in a Nash equil­ibrium on the one-shot Pri­soner’s Dilemma, and then let­ting our imag­i­na­tion stop there, we should feel skep­ti­cal that this was re­ally, ac­tu­ally the best that two su­per­in­tel­li­gences can do and that there is no way for them to climb up their util­ity gra­di­ent. We should imag­ine that this is some­place where we’re un­will­ing to lose and will go on think­ing un­til the full prob­lem is solved, rather than imag­in­ing the hel­pless su­per­in­tel­li­gences giv­ing up.

With ro­bust co­op­er­a­tion on the one-shot Pri­soner’s Dilemma now for­mal­ized, it seems in­creas­ingly likely in prac­tice that su­per­in­tel­li­gences prob­a­bly can man­age to co­or­di­nate; thus the pos­si­bil­ity of log­i­cal de­ci­sion the­ory rep­re­sents an enor­mous prob­lem for any pro­posed scheme to achieve AI con­trol through set­ting mul­ti­ple AIs against each other. Where, again, peo­ple who pro­pose schemes to achieve AI con­trol through set­ting mul­ti­ple AIs against each other, do not seem to un­prompt­edly walk through pos­si­ble meth­ods the AIs could use to defeat the scheme; left to their own de­vices, they just imag­ine the AIs giv­ing up.

Sub­mit­ting safety schemes to out­side scrutiny

Any­one, from the most clue­less am­a­teur to the best cryp­tog­ra­pher, can cre­ate an al­gorithm that he him­self can’t break. It’s not even hard. What is hard is cre­at­ing an al­gorithm that no one else can break, even af­ter years of anal­y­sis. And the only way to prove that is to sub­ject the al­gorithm to years of anal­y­sis by the best cryp­tog­ra­phers around.

Another difficulty some peo­ple have with adopt­ing this mind­set for AI de­signs—similar to the difficulty that some un­trained pro­gram­mers have when they try to roll their own crypto—is that your brain might be re­luc­tant to search hard for prob­lems with your own de­sign. Even if you’ve told your brain to adopt the cryp­to­graphic ad­ver­sary’s per­spec­tive and even if you’ve told it to look hard; it may want to con­clude that Fast XOR is un­break­able and sub­tly flinch away from lines of rea­son­ing that might lead to crack­ing Fast XOR.

At a past Sin­gu­lar­ity Sum­mit, Juer­gen Sch­mid­hu­ber thought that “im­prove com­pres­sion of sen­sory data” would mo­ti­vate an AI to do sci­ence and cre­ate art.

It’s true that, rel­a­tive to do­ing noth­ing to un­der­stand the en­vi­ron­ment, do­ing sci­ence or cre­at­ing art might in­crease the de­gree to which sen­sory in­for­ma­tion can be com­pressed.

But the max­i­mum of this util­ity func­tion comes from cre­at­ing en­vi­ron­men­tal sub­agents that en­crypt streams of all 0s or all 1s, and then re­veal the en­cryp­tion key. It’s pos­si­ble that Sch­mid­hu­ber’s brain was re­luc­tant to re­ally ac­tu­ally search for an op­tion for “max­i­miz­ing sen­sory com­pres­sion” that would be much bet­ter at fulfilling that util­ity func­tion than art, sci­ence, or other ac­tivi­ties that Sch­mid­hu­ber him­self ranked high in his prefer­ence or­der­ing.

While there are rea­sons to think that not ev­ery dis­cov­ery about how to build ad­vanced AIs should be shared, AI safety schema in par­tic­u­lar should be sub­mit­ted to out­side ex­perts who may be more dis­pas­sion­ate about scru­ti­niz­ing it for un­fore­seen max­i­mums and other failure modes.

Pre­sump­tion of failure /​ start by as­sum­ing your next scheme doesn’t work

Even ar­chi­tec­tural en­g­ineers need to ask “How might this bridge fall down?” and not just re­lax into the pleas­ant vi­su­al­iza­tion of the bridge stay­ing up. In com­puter se­cu­rity we need a much stronger ver­sion of this same drive, where it’s pre­sumed that most cryp­to­graphic schemes are not se­cure, con­trasted to most good-faith de­signs by com­pe­tent en­g­ineers prob­a­bly re­sult­ing in a pretty good bridge.

In the con­text of com­puter se­cu­rity, this is be­cause there are in­tel­li­gent ad­ver­saries search­ing for ways to break our sys­tem. con­di­tion­al­ize this text on Arith­metic Hier­ar­chy In terms of the Arith­metic Hier­ar­chy, we might say metaphor­i­cally that or­di­nary en­g­ineer­ing is a \(\Sigma_1\) prob­lem and com­puter se­cu­rity is a \(\Sigma_2\) prob­lem. In or­di­nary en­g­ineer­ing, we just need to search through pos­si­ble bridge de­signs un­til we find one de­sign that makes the bridge stay up. In com­puter se­cu­rity, we’re look­ing for a de­sign such that all pos­si­ble at­tacks (that our op­po­nents can cog­ni­tively ac­cess) will fail against that at­tack, and even if all at­tacks so far against one de­sign have failed, this is just a prob­a­bil­is­tic ar­gu­ment; it doesn’t prove with cer­tainty that all fur­ther at­tacks will fail. This makes com­puter se­cu­rity in­trin­si­cally harder, in a deep sense, than build­ing a bridge. It’s both harder to suc­ceed and harder to know that you’ve suc­ceeded.

This means start­ing from the mind­set that ev­ery idea, in­clud­ing your own next idea, is pre­sumed flawed un­til it has been seen to sur­vive a sus­tained at­tack; and while this spirit isn’t com­pletely ab­sent from bridge en­g­ineer­ing, the pre­sump­tion is stronger and the trial much harsher in the con­text of com­puter se­cu­rity. In bridge en­g­ineer­ing, we’re scru­ti­niz­ing just to be sure; in com­puter se­cu­rity, most of the time your brilli­ant new al­gorithm ac­tu­ally doesn’t work.

In the con­text of AI safety, we learn to ask the same ques­tion—“How does this break?” in­stead of “How does this suc­ceed?”—for some­what differ­ent rea­sons:

  • The AI it­self will be ap­ply­ing very pow­er­ful op­ti­miza­tion to its own util­ity func­tion, prefer­ence frame­work, or de­ci­sion crite­rion; and this pro­duces a lot of the same failure modes as arise in cryp­tog­ra­phy against an in­tel­li­gent ad­ver­sary. If we think an op­ti­miza­tion crite­rion yields a re­sult, we’re im­plic­itly claiming that all pos­si­ble other re­sults have lower worth un­der that op­ti­miza­tion crite­rion.

  • Most pre­vi­ous at­tempts at AI safety have failed to be com­plete solu­tions, and by in­duc­tion, the same is likely to hold true of the next case. There are fun­da­men­tal rea­sons why im­por­tant sub­prob­lems are un­likely to have easy solu­tions. So if we ask “How does this fail?” rather than “How does this suc­ceed?” we are much more likely to be ask­ing the right ques­tion.

  • You’re try­ing to de­sign the first smarter-than-hu­man AI, dammit, it’s not like build­ing hu­man­ity’s mil­lionth damn bridge.

As a re­sult, when we ask “How does this break?” in­stead of “How can my new idea solve the en­tire prob­lem?”, we’re start­ing by try­ing to ra­tio­nal­ize a true an­swer rather than try­ing to ra­tio­nal­ize a false an­swer, which helps in find­ing ra­tio­nal­iza­tions that hap­pen to be true.

Some­one who wants to work in this field can’t just wait around for out­side scrutiny to break their idea; if they ever want to come up with a good idea, they need to learn to break their own ideas proac­tively. “What are the ac­tual con­se­quences of this idea, and what if any­thing in that is still use­ful?” is the real frame that’s needed, not “How can I ar­gue and defend that this idea solves the whole prob­lem?” This is per­haps the core thing that sep­a­rates the AI safety mind­set from its ab­sence—try­ing to find the flaws in any pro­posal in­clud­ing your own, ac­cept­ing that no­body knows how to solve the whole prob­lem yet, and think­ing in terms of mak­ing in­cre­men­tal progress in build­ing up a library of ideas with un­der­stood con­se­quences by figur­ing out what the next idea ac­tu­ally does; ver­sus claiming to have solved most or all of the prob­lem, and then wait­ing for some­one else to figure out how to ar­gue to you, to your own satis­fac­tion, that you’re wrong.

Reach­ing for formalism

Com­pared to other ar­eas of in-prac­tice soft­ware en­g­ineer­ing, cryp­tog­ra­phy is much heav­ier on math­e­mat­ics. This doesn’t mean that cryp­tog­ra­phy pre­tends that the non-math­e­mat­i­cal parts of com­puter se­cu­rity don’t ex­ist—se­cu­rity pro­fes­sion­als know that of­ten the best way to get a pass­word is to pre­tend to be the IT de­part­ment and call some­one up and ask them; no­body is in de­nial about that. Even so, some parts of cryp­tog­ra­phy are heavy on math and math­e­mat­i­cal ar­gu­ments.

Why should that be true? In­tu­itively, wouldn’t a big com­pli­cated messy en­cryp­tion al­gorithm be harder to crack, since the ad­ver­sary would have to un­der­stand and re­verse a big com­pli­cated messy thing in­stead of clean math? Wouldn’t sys­tems so sim­ple that we could do math proofs about them, be sim­pler to an­a­lyze and de­crypt? If you’re us­ing a code to en­crypt your di­ary, wouldn’t it be bet­ter to have a big com­pli­cated ci­pher with lots of ‘add the pre­vi­ous let­ter’ and ‘re­verse these two po­si­tions’ in­stead of just us­ing rot13?

And the sur­pris­ing an­swer is that since most pos­si­ble sys­tems aren’t se­cure, adding an­other gear of­ten makes an en­cryp­tion al­gorithm eas­ier to break. This was true quite liter­ally with the Ger­man Enigma de­vice dur­ing World War II—they liter­ally added an­other gear to the ma­chine, com­pli­cat­ing the al­gorithm in a way that made it eas­ier to break. The Enigma ma­chine was a se­ries of three wheels that trans­posed the 26 pos­si­ble let­ters us­ing a vary­ing elec­tri­cal cir­cuit; e.g., the first wheel might map in­put cir­cuit 10 to out­put cir­cuit 26. After each let­ter, the wheel would ad­vance to pre­vent the trans­po­si­tion code from ever re­peat­ing ex­actly. In 1926, a ‘re­flec­tor’ wheel was added at the end, thus rout­ing each let­ter back through the first three gears again and caus­ing an­other se­ries of three trans­po­si­tions. Although it made the al­gorithm more com­pli­cated and caused more trans­po­si­tions, the re­flec­tor wheel meant that no let­ter was ever en­coded to it­self—a fact which was ex­tremely use­ful in break­ing the Enigma en­cryp­tion.

So in­stead of fo­cus­ing on mak­ing en­cryp­tion schemes more and more com­pli­cated, cryp­tog­ra­phy tries for en­cryp­tion schemes sim­ple enough that we can have math­e­mat­i­cal rea­sons to think they are hard to break in prin­ci­ple. (Really. It’s not the aca­demic field reach­ing for pres­tige. It gen­uinely does not work the other way. Peo­ple have tried it.)

In the back­ground of the field’s de­ci­sion to adopt this prin­ci­ple is an­other key fact, so ob­vi­ous that ev­ery­one in cryp­tog­ra­phy tends to take it for granted: ver­bal ar­gu­ments about why an al­gorithm ought to be hard to break, if they can’t be for­mal­ized in math­ier terms, have proven in­suffi­ciently re­li­able (aka: it plain doesn’t work most of the time). This doesn’t mean that cryp­tog­ra­phy de­mands that ev­ery­thing have ab­solute math­e­mat­i­cal proofs of to­tal un­break­a­bil­ity and will re­fuse to ac­knowl­edge an al­gorithm’s ex­is­tence oth­er­wise. Find­ing the prime fac­tors of large com­pos­ite num­bers, the key difficulty on which RSA’s se­cu­rity rests, is not known to take ex­po­nen­tial time on clas­si­cal com­put­ers. In fact, find­ing prime fac­tors is known not to take ex­po­nen­tial time on quan­tum com­put­ers. But there are least math­e­mat­i­cal ar­gu­ments for why fac­tor­iz­ing the prod­ucts of large primes is prob­a­bly hard on clas­si­cal com­put­ers, and this level of rea­son­ing has some­times proven re­li­able. Whereas wav­ing at the Enigma ma­chine and say­ing “Look at all those trans­po­si­tions! It won’t re­peat it­self for quadrillions of steps!” is not re­li­able at all.

In the AI safety mind­set, we again reach for for­mal­ism where we can get it—while not be­ing in de­nial about parts of the larger prob­lem that haven’t been for­mal­ized—for similar if not iden­ti­cal rea­sons. Most com­pli­cated schemes for AI safety, with lots of mov­ing parts, thereby be­come less likely to work; if we want to un­der­stand some­thing well enough to see whether or not it works, it needs to be sim­pler, and ideally some­thing about which we can think as math­e­mat­i­cally as we rea­son­ably can.

In the par­tic­u­lar case of AI safety, we also pur­sue math­ema­ti­za­tion for an­other rea­son: when a pro­posal is for­mal­ized it’s pos­si­ble to state why it’s wrong in a way that com­pels agree­ment as op­posed to trailing off into ver­bal “Does not /​ does too!” AIXI is re­mark­able both for be­ing the first for­mal if un­com­putable de­sign for a gen­eral in­tel­li­gence, and for be­ing the first case where, when some­body pointed out how the given de­sign kil­led ev­ery­one, we could all nod and say, “Yes, that is what this fully for­mal speci­fi­ca­tion says” rather than the cre­ator just say­ing, “Oh, well, of course I didn’t mean that…”

In the shared pro­ject to build up a com­monly known library of which ideas have which con­se­quences, only ideas which are suffi­ciently crisp to be pinned down, with con­se­quences that can be pinned down, can be traded around and re­fined in­ter­per­son­ally. Other­wise, you may just end up with, “Oh, of course I didn’t mean that” or a cy­cle of “Does not!” /​ “Does too!” Sus­tained progress re­quires go­ing past that, and in­creas­ing the de­gree to which ideas have been for­mal­ized helps.

See­ing nonob­vi­ous flaws is the mark of expertise

Any­one can in­vent a se­cu­rity sys­tem that he him­self can­not break… Show me what you’ve bro­ken to demon­strate that your as­ser­tion of the sys­tem’s se­cu­rity means some­thing.

A stan­dard ini­ti­a­tion rit­ual at MIRI is to ask a new re­searcher to (a) write a sim­ple pro­gram that would do some­thing use­ful and AI-non­triv­ial if run on a hy­per­com­puter, or if they don’t think they can do that, (b) write a sim­ple pro­gram that would de­stroy the world if run on a hy­per­com­puter. The more se­nior re­searchers then stand around and ar­gue about what the pro­gram re­ally does.

The first les­son is “Sim­ple struc­tures of­ten don’t do what you think they do”. The larger point is to train a mind­set of “Try to see the real mean­ing of this struc­ture, which is differ­ent from what you ini­tially thought or what was ad­ver­tised on the la­bel” and “Rather than try­ing to come up with solu­tions and ar­gu­ing about why they would work, try to un­der­stand the real con­se­quences of an idea which is usu­ally an­other non-solu­tion but might be in­ter­est­ing any­way.”

Peo­ple who are strong can­di­dates for be­ing hired to work on AI safety are peo­ple who can pin­point flaws in pro­pos­als—the sort of per­son who’ll spot that the con­se­quence of run­ning AIXI is that it will seize con­trol of its own re­ward chan­nel and kill the pro­gram­mers, or that a pro­posal for Utility in­differ­ence isn’t re­flec­tively sta­ble. Our ver­sion of “Show me what you’ve bro­ken” is that if some­one claims to be an AI safety ex­pert, you should ask them about their record of pin­point­ing struc­tural flaws in pro­posed AI safety solu­tions and whether they’ve demon­strated that abil­ity in a crisp do­main where the flaw is de­ci­sively demon­stra­ble and not just ver­bally ar­guable. (Some­times ver­bal pro­pos­als also have flaws, and the most com­pe­tent re­searcher may not be able to ar­gue those flaws for­mally if the ver­bal pro­posal was it­self vague. But the way a re­searcher demon­strates abil­ity in the field is by mak­ing ar­gu­ments that other re­searchers can ac­cess, which of­ten though not always hap­pens in­side the for­mal do­main.)

Treat­ing ‘ex­otic’ failure sce­nar­ios as ma­jor bugs

This in­ter­est in “harm­less failures” – cases where an ad­ver­sary can cause an anoma­lous but not di­rectly harm­ful out­come – is an­other hal­l­mark of the se­cu­rity mind­set. Not all “harm­less failures” lead to big trou­ble, but it’s sur­pris­ing how of­ten a clever ad­ver­sary can pile up a stack of seem­ingly harm­less failures into a dan­ger­ous tower of trou­ble. Harm­less failures are bad hy­giene. We try to stamp them out when we can.

To see why, con­sider the donotre­ email story that hit the press re­cently. When com­pa­nies send out com­mer­cial email (e.g., an air­line no­tify­ing a pas­sen­ger of a flight de­lay) and they don’t want the re­cip­i­ent to re­ply to the email, they of­ten put in a bo­gus From ad­dress like donotre­ply@donotre­ A clever guy reg­istered the do­main donotre­, thereby re­ceiv­ing all email ad­dressed to donotre­ This in­cluded “bounce” replies to mis­ad­dressed emails, some of which con­tained copies of the origi­nal email, with in­for­ma­tion such as bank ac­count state­ments, site in­for­ma­tion about mil­i­tary bases in Iraq, and so on.

…The peo­ple who put donotre­ email ad­dresses into their out­go­ing email must have known that they didn’t con­trol the donotre­ do­main, so they must have thought of any re­ply mes­sages di­rected there as harm­less failures. Hav­ing got­ten that far, there are two ways to avoid trou­ble. The first way is to think care­fully about the traf­fic that might go to donotre­, and re­al­ize that some of it is ac­tu­ally dan­ger­ous. The sec­ond way is to think, “This looks like a harm­less failure, but we should avoid it any­way. No good can come of this.” The first way pro­tects you if you’re clever; the sec­ond way always pro­tects you. Which illus­trates yet an­other part of the se­cu­rity mind­set: Don’t rely too much on your own clev­er­ness, be­cause some­body out there is surely more clever and more mo­ti­vated than you are.

In the se­cu­rity mind­set, we fear the seem­ingly small flaw be­cause it might com­pound with other in­tel­li­gent at­tacks and we may not be as clever as the at­tacker. In AI safety there’s a very similar mind­set for slightly differ­ent rea­sons: we fear the weird spe­cial case that breaks our al­gorithm be­cause it re­veals that we’re us­ing the wrong al­gorithm, and we fear that the strain of an AI op­ti­miz­ing to a su­per­hu­man de­gree could pos­si­bly ex­pose that wrong­ness (in a way we didn’t fore­see be­cause we’re not that clever).

We can try to fore­see par­tic­u­lar de­tails, and try to sketch par­tic­u­lar break­downs that sup­pos­edly look more “prac­ti­cal”, but that’s the equiv­a­lent of try­ing to think in ad­vance what might go wrong when you use a donotre­ply@donotre­ ad­dress that you don’t con­trol. Rather than rely­ing on your own clev­er­ness to see all the ways that a sys­tem might go wrong and tol­er­at­ing a “the­o­ret­i­cal” flaw that you think won’t go wrong “in prac­tice”, when you are try­ing to build se­cure soft­ware or build an AI that may end up smarter than you are, you prob­a­bly want to fix the “the­o­ret­i­cal” flaws in­stead of try­ing to be clever.

The OpenBSD pro­ject, built from the ground up to be an ex­tremely se­cure OS, treats any crash­ing bug (how­ever ex­otic) as if it were a se­cu­rity flaw, be­cause any crash­ing bug is also a case of “the sys­tem is be­hav­ing out of bounds” and it shows that this code does not, in gen­eral, stay in­side the area of pos­si­bil­ity space that it is sup­posed to stay in, which is also just the sort of thing an at­tacker might ex­ploit.

A similar mind­set to se­cu­rity mind­set, of ex­cep­tional be­hav­ior always in­di­cat­ing a ma­jor bug, ap­pears within other or­ga­ni­za­tions that have to do difficult jobs cor­rectly on the first try. NASA isn’t guard­ing against in­tel­li­gent ad­ver­saries, but its soft­ware prac­tices are aimed at the stringency level re­quired to en­sure that ma­jor one-shot pro­jects have a de­cent chance of work­ing cor­rectly on the first try.

On NASA’s soft­ware prac­tice, if you dis­cover that a space probe’s op­er­at­ing sys­tem will crash if the seven planets line up perfectly in a row, it wouldn’t say, “Eh, go ahead, we don’t ex­pect the planets to ever line up perfectly over the probe’s op­er­at­ing life­time.” NASA’s qual­ity as­surance method­ol­ogy says the probe’s op­er­at­ing sys­tem is just not sup­posed to crash, pe­riod—if we con­trol the probe’s code, there’s no rea­son to write code that will crash pe­riod, or tol­er­ate code we can see crash­ing re­gard­less of what in­puts it gets.

This might not be the best way to in­vest your limited re­sources if you were de­vel­op­ing a word pro­cess­ing app (that no­body was us­ing for mis­sion-crit­i­cal pur­poses, and didn’t need to safe­guard any pri­vate data). In that case you might wait for a cus­tomer to com­plain be­fore mak­ing the bug a top pri­or­ity.

But it is an ap­pro­pri­ate stand­point when build­ing a hun­dred-mil­lion-dol­lar space probe, or soft­ware to op­er­ate the con­trol rods in a nu­clear re­ac­tor, or, to an even greater de­gree, build­ing an ad­vanced agent. There are differ­ent soft­ware prac­tices you use to de­velop sys­tems where failure is catas­trophic and you can’t wait for things to break be­fore fix­ing them; and one of those prac­tices is fix­ing ev­ery ‘ex­otic’ failure sce­nario, not be­cause the ex­otic always hap­pens, but be­cause it always means the un­der­ly­ing de­sign is bro­ken. Even then, sys­tems built to that prac­tice still fail some­times, but if they were built to a lesser stringency level, they’d have no chance at all of work­ing cor­rectly on the first try.

Nice­ness as the first line of defense /​ not rely­ing on defeat­ing a su­per­in­tel­li­gent adversary

There are two kinds of cryp­tog­ra­phy in this world: cryp­tog­ra­phy that will stop your kid sister from read­ing your files, and cryp­tog­ra­phy that will stop ma­jor gov­ern­ments from read­ing your files. This book is about the lat­ter.

Sup­pose you write a pro­gram which, be­fore it performs some dan­ger­ous ac­tion, de­mands a pass­word. The pro­gram com­pares this pass­word to the pass­word it has stored. If the pass­word is cor­rect, the pro­gram trans­mits the mes­sage “Yep” to the user and performs the re­quested ac­tion, and oth­er­wise re­turns an er­ror mes­sage say­ing “Nope”. You prove math­e­mat­i­cally (the­o­rem-prov­ing soft­ware ver­ifi­ca­tion tech­niques) that if the chip works as ad­ver­tised, this pro­gram can­not pos­si­bly perform the op­er­a­tion with­out see­ing the pass­word. You prove math­e­mat­i­cally that the pro­gram can­not re­turn any user re­ply ex­cept “Yep” or “Nope”, thereby show­ing that there is no way to make it leak the stored pass­word via some clever in­put.

You in­spect all the tran­sis­tors on the com­puter chip un­der a micro­scope to help en­sure the math­e­mat­i­cal guaran­tees are valid for this chip’s be­hav­ior (that the chip doesn’t con­tain any ex­tra tran­sis­tors you don’t know about that could in­val­i­date the proof). To make sure no­body can get to the ma­chine within which the pass­word is stored, you put it in­side a fortress and a locked room re­quiring 12 sep­a­rate keys, con­nected to the out­side world only by an Eth­er­net ca­ble. Any at­tempt to get into the locked room through the walls will trig­ger an ex­plo­sive deto­na­tion that de­stroys the ma­chine. The ma­chine has its own peb­ble-bed elec­tri­cal gen­er­a­tor to pre­vent any shenani­gans with the power ca­ble. Only one per­son knows the pass­word and they have 24-hour body­guards to make sure no­body can get the pass­word through rub­ber-hose crypt­anal­y­sis. The pass­word it­self is 20 char­ac­ters long and was gen­er­ated by a quan­tum ran­dom num­ber gen­er­a­tor un­der the eye­sight of the sole au­tho­rized user, and the gen­er­a­tor was then de­stroyed to pre­vent any­one else from get­ting the pass­word by ex­am­in­ing it. The dan­ger­ous ac­tion can only be performed once (it needs to be performed at a par­tic­u­lar time) and the pass­word will only be given once, so there’s no ques­tion of some­body in­ter­cept­ing the pass­word and then reusing it.

Is this sys­tem now fi­nally and truly un­break­able?

If you’re an ex­pe­rienced cryp­tog­ra­pher, the an­swer is, “Al­most cer­tainly not; in fact, it will prob­a­bly be easy to ex­tract the pass­word from this sys­tem us­ing a stan­dard cryp­to­graphic tech­nique.”

“What?!” cries the per­son who built the sys­tem. “But I spent all that money on the fortress and get­ting the math­e­mat­i­cal proof of the pro­gram, strength­en­ing ev­ery as­pect of the sys­tem to the ul­ti­mate ex­treme! I re­ally im­pressed my­self putting in all that effort!”

The cryp­tog­ra­pher shakes their head. “We call that Mag­inot Syn­drome. That’s like build­ing a gate a hun­dred me­ters high in the mid­dle of the desert. If I get past that gate, it won’t be by climb­ing it, but by walk­ing around it. Mak­ing it 200 me­ters high in­stead of 100 me­ters high doesn’t help.”

“But what’s the ac­tual flaw in the sys­tem?” de­mands the builder.

“For one thing,” ex­plains the cryp­tog­ra­pher, “you didn’t fol­low the stan­dard prac­tice of never stor­ing a plain­text pass­word. The cor­rect thing to do is to hash the pass­word, plus a ran­dom stored salt like ‘Q4bL’. Let’s say the pass­word is, un­for­tu­nately, ‘rain­bow’. You don’t store ‘rain­bow’ in plain text. You store ‘Q4bL’ and a se­cure hash of the string ‘Q4bLrain­bow’. When you get a new pur­ported pass­word, you prepend ‘Q4bL’ and then hash the re­sult to see if it matches the stored hash. That way even if some­body gets to peek at the stored hash, they still won’t know the pass­word, and even if they have a big pre­com­puted table of hashes of com­mon pass­words like ‘rain­bow’, they still won’t have pre­com­puted the hash of ‘Q4bLrain­bow’.”

“Oh, well, I don’t have to worry about that,” says the builder. “This ma­chine is in an ex­tremely se­cure room, so no­body can open up the ma­chine and read the pass­word file.”

The cryp­tog­ra­pher sighs. “That’s not how a se­cu­rity mind­set works—you don’t ask whether any­one can man­age to peek at the pass­word file, you just do the damn hash in­stead of try­ing to be clever.”

The builder sniffs. “Well, if your ‘stan­dard cryp­to­graphic tech­nique’ for get­ting my pass­word re­lies on your get­ting phys­i­cal ac­cess to my ma­chine, your tech­nique fails and I have noth­ing to worry about, then!”

The cryp­tog­ra­pher shakes their head. “That re­ally isn’t what com­puter se­cu­rity pro­fes­sion­als sound like when they talk to each other… it’s un­der­stood that most sys­tem de­signs fail, so we linger on pos­si­ble is­sues and an­a­lyze them care­fully in­stead of yel­ling that we have noth­ing to worry about… but at any rate, that wasn’t the cryp­to­graphic tech­nique I had in mind. You may have proven that the sys­tem only says ‘Yep’ or ‘Nope’ in re­sponse to queries, but you didn’t prove that the re­sponses don’t de­pend on the true pass­word in any way that could be used to ex­tract it.”

“You mean that there might be a se­cret wrong pass­word that causes the sys­tem to trans­mit a se­ries of Yeps and Nopes that en­code the cor­rect pass­word?” the builder says, look­ing skep­ti­cal. “That may sound su­perfi­cially plau­si­ble. But be­sides the in­cred­ible un­like­li­ness of any­one be­ing able to find a weird back­door like that—it re­ally is a quite sim­ple pro­gram that I wrote—the fact re­mains that I proved math­e­mat­i­cally that the sys­tem only trans­mits a sin­gle ‘Nope’ in re­sponse to wrong an­swers, and a sin­gle ‘Yep’ in re­sponse to right an­swers. It does that ev­ery time. So you can’t ex­tract the pass­word that way ei­ther—a string of wrong pass­words always pro­duces a string of ‘Nope’ replies, noth­ing else. Once again, I have noth­ing to worry about from this ‘stan­dard cryp­to­graphic tech­nique’ of yours, if it was even ap­pli­ca­ble to my soft­ware, which it’s not.”

The cryp­tog­ra­pher sighs. “This is why we have the proverb ‘don’t roll your own crypto’. Your proof doesn’t liter­ally, math­e­mat­i­cally show that there’s no ex­ter­nal be­hav­ior of the sys­tem what­so­ever that de­pends on the de­tails of the true pass­word in cases where the true pass­word has not been trans­mit­ted. In par­tic­u­lar, what you’re miss­ing is the timing of the ‘Nope’ re­sponses.”

“You mean you’re go­ing to look for some se­ries of se­cret back­door wrong pass­words that causes the sys­tem to trans­mit a ‘Nope’ re­sponse af­ter a num­ber of sec­onds that ex­actly cor­re­sponds to the first let­ter, sec­ond let­ter, and so on of the real pass­word?” the builder says in­cre­d­u­lously. “I proved math­e­mat­i­cally that the sys­tem never says ‘Yep’ to a wrong pass­word. I think that also cov­ers most pos­si­ble cases of buffer overflows that could con­ceiv­ably make the sys­tem act like that. I ex­am­ined the code, and there just isn’t any­thing that en­codes a be­hav­ior like that. This just seems like a very far-flung hy­po­thet­i­cal pos­si­bil­ity.”

“No,” the cryp­tog­ra­pher pa­tiently ex­plains, “it’s what we call a ‘side-chan­nel at­tack’, and in par­tic­u­lar a ‘timing at­tack’. The op­er­a­tion that com­pares the at­tempted pass­word to the cor­rect pass­word works by com­par­ing the first byte, then the sec­ond byte, and con­tin­u­ing un­til it finds the first wrong byte, and then it re­turns. That means that if I try pass­word that starts with ‘a’, then a pass­word that starts with ‘b’, and so on, and the true pass­word starts with ‘b’, there’ll be a slight, statis­ti­cally de­tectable ten­dency for the at­tempted pass­words that start with ‘b’ to get ‘Nope’ re­sponses that take ever so slightly longer. Then we try pass­words start­ing with ‘ba’, ‘bb’, ‘bc’, and so on.”

The builder looks star­tled for a minute, and then their face quickly closes up. “I can’t be­lieve that would ac­tu­ally work over the In­ter­net where there are all sorts of de­lays in mov­ing pack­ets around—”

“So we sam­ple a mil­lion test pass­words and look for statis­ti­cal differ­ences. You didn’t build in a fea­ture that limits the rate at which pass­words can be tried. Even if you’d im­ple­mented that stan­dard prac­tice, and even if you’d im­ple­mented the stan­dard prac­tice of hash­ing pass­words in­stead of stor­ing them in plain­text, your sys­tem still might not be as se­cure as you hoped. We could try to put the ma­chine un­der heavy load in or­der to stretch out its replies to par­tic­u­lar queries. And if we can then figure out the hash by timing, we might be able to use thou­sands of GPUs to try to re­verse the hash, in­stead of need­ing to send each query to your ma­chine. To re­ally fix the hole, you have to make sure that the timing of the re­sponse is fixed re­gard­less of the wrong pass­word given. But if you’d im­ple­mented stan­dard prac­tices like rate-limit­ing pass­word at­tempts and stor­ing a hash in­stead of the plain­text, it would at least be harder for your over­sight to com­pound into an ex­ploit. This is why we im­ple­ment stan­dard prac­tices like that even when we think the sys­tem will be se­cure with­out them.”

“I just can’t be­lieve that kind of weird at­tack would work in real life!” the builder says des­per­ately.

“It doesn’t,” replies the cryp­tog­ra­pher. “Be­cause in real life, com­puter se­cu­rity pro­fes­sion­als try to make sure that the ex­act timing of the re­sponse, power con­sump­tion of the CPU, and any other side chan­nel that could con­ceiv­ably leak any info, don’t de­pend in any way on any se­cret in­for­ma­tion that an ad­ver­sary might want to ex­tract. But yes, in 2003 there was a timing at­tack proven on SSL-en­abled web­servers, though that was much more com­pli­cated than this case since the SSL sys­tem was less naive. Or long be­fore that, timing at­tacks were used to ex­tract valid lo­gin names from Unix servers that only ran crypt() on the pass­word when pre­sented with a valid lo­gin name, since crypt() took a while to run on older com­put­ers.”

In com­puter se­cu­rity, via a tremen­dous effort, we can raise the cost of a ma­jor gov­ern­ment read­ing your files to the point where they can no longer do it over the In­ter­net and have to pay some­one to in­vade your apart­ment in per­son. There are hordes of trained pro­fes­sion­als in the Na­tional Se­cu­rity Agency or China’s 3PLA, and once your sys­tem is pub­lished they can take a long time to try to out­think you. On your own side, if you’re smart, you won’t try to out­think them sin­gle­handed; you’ll use tools and meth­ods built up by a large com­mer­cial and aca­demic sys­tem that has ex­pe­rience try­ing to pre­vent ma­jor gov­ern­ments from read­ing your files. You can force them to pay to ac­tu­ally have some­one break into your house.

That’s the out­come when the ad­ver­sary is com­posed of other hu­man be­ings. If the cog­ni­tive differ­ence be­tween you and the ad­ver­sary is more along the lines of mouse ver­sus hu­man, it’s pos­si­ble we just can’t have se­cu­rity that stops tran­shu­man ad­ver­saries from walk­ing around our Mag­inot Lines. In par­tic­u­lar, it seems ex­tremely likely that any tran­shu­man ad­ver­sary which can ex­pose in­for­ma­tion to hu­mans can hack the hu­mans; from a cryp­to­graphic per­spec­tive, hu­man brains are rich, com­pli­cated, poorly-un­der­stood sys­tems with no se­cu­rity guaran­tees.

Para­phras­ing Sch­neier, we might say that there’s three kinds of se­cu­rity in the world: Se­cu­rity that pre­vents your lit­tle brother from read­ing your files, se­cu­rity that pre­vents ma­jor gov­ern­ments from read­ing your files, and se­cu­rity that pre­vents su­per­in­tel­li­gences from get­ting what they want. We can then go on to re­mark that the third kind of se­cu­rity is un­ob­tain­able, and even if we had it, it would be very hard for us to know we had it. Maybe su­per­in­tel­li­gences can make them­selves know­ably se­cure against other su­per­in­tel­li­gences, but we can’t do that and know that we’ve done it.

To the ex­tent the third kind of se­cu­rity can be ob­tained at all, it’s li­able to look more like the de­sign of a Zer­melo-Fraenkel prov­abil­ity or­a­cle that can only emit 20 timed bits that are par­tially sub­ject to an ex­ter­nal guaran­tee, than an AI that is al­lowed to talk to hu­mans through a text chan­nel. And even then, we shouldn’t be sure—the AI is ra­di­at­ing elec­tro­mag­netic waves and what do you know, it turns out that DRAM ac­cess pat­terns can be used to trans­mit on GSM cel­l­phone fre­quen­cies and we can put the AI’s hard­ware in­side a Fara­day cage but then maybe we didn’t think of some­thing else.

If you ask a com­puter se­cu­rity pro­fes­sional how to build an op­er­at­ing sys­tem that will be un­hack­able for the next cen­tury with the literal fate of the world de­pend­ing on it, the cor­rect an­swer is “Please don’t have the fate of the world de­pend on that.”

The fi­nal com­po­nent of an AI safety mind­set is one that doesn’t have a strong analogue in tra­di­tional com­puter se­cu­rity, and it is the rule of not end­ing up fac­ing a tran­shu­man ad­ver­sary in the first place. The win­ning move is not to play. Much of the field of value al­ign­ment the­ory is about go­ing to any length nec­es­sary to avoid need­ing to out­wit the AI.

In AI safety, the first line of defense is an AI that does not want to hurt you. If you try to put the AI in an ex­plo­sive-laced con­crete bunker, that may or may not be a sen­si­ble and cost-effec­tive pre­cau­tion in case the first line of defense turns out to be flawed. But the first line of defense should always be an AI that doesn’t want to hurt you or avert your other safety mea­sures, rather than the first line of defense be­ing a clever plan to pre­vent a su­per­in­tel­li­gence from get­ting what it wants.

A spe­cial case of this mind­set ap­plied to AI safety is the Omni Test—would this AI hurt us (or want to defeat other safety mea­sures) if it were om­ni­scient and om­nipo­tent? If it would, then we’ve clearly built the wrong AI, be­cause we are the ones lay­ing down the al­gorithm and there’s no rea­son to build an al­gorithm that hurts us pe­riod. If an agent de­sign fails the Omni Test desider­a­tum, this means there are sce­nar­ios that it prefers over the set of all sce­nar­ios we find ac­cept­able, and the agent may go search­ing for ways to bring about those sce­nar­ios.

If the agent is search­ing for pos­si­ble ways to bring about un­de­sir­able ends, then we, the AI pro­gram­mers, are already spend­ing com­put­ing power in an un­de­sir­able way. We shouldn’t have the AI run­ning a search that will hurt us if it comes up pos­i­tive, even if we ex­pect the search to come up empty. We just shouldn’t pro­gram a com­puter that way; it’s a fool­ish and self-de­struc­tive thing to do with com­put­ing power. Build­ing an AI that would hurt us if om­nipo­tent is a bug for the same rea­son that a NASA probe crash­ing if all seven other planets line up would be a bug—the sys­tem just isn’t sup­posed to be­have that way pe­riod; we should not rely on our own clev­er­ness to rea­son about whether it’s likely to hap­pen.


  • Valley of Dangerous Complacency

    When the AGI works of­ten enough that you let down your guard, but it still has bugs. Imag­ine a robotic car that al­most always steers perfectly, but some­times heads off a cliff.

  • Show me what you've broken

    To demon­strate com­pe­tence at com­puter se­cu­rity, or AI al­ign­ment, think in terms of break­ing pro­pos­als and find­ing tech­ni­cally demon­stra­ble flaws in them.

  • Ad-hoc hack (alignment theory)

    A “hack” is when you al­ter the be­hav­ior of your AI in a way that defies, or doesn’t cor­re­spond to, a prin­ci­pled ap­proach for that prob­lem.

  • Don't try to solve the entire alignment problem

    New to AI al­ign­ment the­ory? Want to work in this area? Already been work­ing in it for years? Don’t try to solve the en­tire al­ign­ment prob­lem with your next good idea!

  • Flag the load-bearing premises

    If some­body says, “This AI safety plan is go­ing to fail, be­cause X” and you re­ply, “Oh, that’s fine be­cause of Y and Z”, then you’d bet­ter clearly flag Y and Z as “load-bear­ing” parts of your plan.

  • Directing, vs. limiting, vs. opposing

    Get­ting the AI to com­pute the right ac­tion in a do­main; ver­sus get­ting the AI to not com­pute at all in an un­safe do­main; ver­sus try­ing to pre­vent the AI from act­ing suc­cess­fully. (Pre­fer 1 & 2.)


  • Advanced safety

    An agent is re­ally safe when it has the ca­pac­ity to do any­thing, but chooses to do what the pro­gram­mer wants.