Aligning an AGI adds significant development time


The votable proposition is true if, comparing reasonably attainable development paths for…

  • Project Path 1: An aligned advanced AI created by a responsible project that is hurrying where it can, but still being careful enough to maintain a success probability greater than 25%

  • Project Path 2: An unaligned unlimited superintelligence produced by a project cutting all possible corners

…where otherwise both projects have access to the same ideas or discoveries in the field of AGI capabilities and similar computation resources; then, as the default /​ ordinary /​ modal case after conditioning on all of the said assumptions:

Project Path 1 will require at least 50% longer serial time to complete than Project Path 2, or two years longer, whichever is less.


This page was written to address multiple questioners who seem to have accepted the Orthogonality thesis, but still mostly disbelieve it would take significantly longer to develop aligned AGI than unaligned AGI, if I’ve understood correctly.

At present this page is an overview of possible places of disagreement, and may later be selectively rather than fully expanded.


Propositions feeding into this one include:

If questioner believes the negation of either of these, it would imply easy specifiability of a decision function suitable for an unlimited superintelligence. That could greatly reduce the need for, e.g:

It’s worth checking whether any of these time-costly development principles seem to questioner to not follow as important from the basic idea of value alignment being necessary and not trivially solvable.

Outside view

To the best of my knowledge, it is normal /​ usual /​ unsurprising for at least 50% increased development time to be required by strong versus minimal demands on any one of:

  • (3a) safety of any kind

  • (3b) robust behavior in new one-shot contexts that can’t be tested in advance

  • (3c) robust behavior when experiencing strong forces

  • (3d) reliable avoidance of a single catastrophic failure

  • (3e) resilience in the face of strong optimization pressures that can potentially lead the system to traverse unusual execution paths

  • (3f) conformance to complicated details of a user’s desired system behavior

comment: It would indeed be unusual—some project managers might call it extra-ordinary good fortune—if a system demanding two or more of these properties did not require at least 50
more development time compared to a system that didn’t.%

Obvious-seeming-to-me analogies include:

  • Launching a space probe that cannot be corrected once launched, a deed which usually calls for extraordinary additional advance checking and testing

  • Launching the simplest working rocket that will experience uncommonly great accelerations and forces, compared to building the simplest working airplane

  • It would be far less expensive to design rockets if “the rocket explodes” were not a problem; most of the cost of a rocket is having the rocket not explode

  • NASA managing to write almost entirely bug-free code for some projects at 100x the cost per line of code, using means that involved multiple reviews and careful lines of organizational approval for every aspect and element of the system

  • The OpenBSD project to produce a secure operating system, which needed to constrain its code to be more minimal than larger Linux projects, and probably added a lot more than 50% time per function point to approve each element of the code

  • The difference in effort put forth by an amateur writing an encryption system they think is secure, versus the cryptographic ecosystem trying to ensure a channel is secure

  • The real premium on safety for hospital equipment, as opposed to the bureaucratic premium on it, is probably still over 50% because it does involve legitimate additional testing to try to not kill the patient

  • Surgeons probably legitimately require at least 50% longer to operate on humans than they would require to perform operations of analogous complexity on large plants it was okay to kill 10% of the time

  • Even in the total absence of regulatory overhead, it seems legitimately harder to build a nuclear power plant that usually does not melt down, compared to a coal power plant (confirmable by the Soviet experience?)

Some of the standard ways in which systems with strong versus minimal demands on (3*)-properties *usually* require additional development time:

  • (4a) Additional work for:

  • Whole extra modules

  • Universally enforced properties

  • Lots of little local function points

  • (4b) Needing a more extended process of interactive shaping in order to conform to a complicated target

  • (4c) Legitimately requiring longer organizational paths to approve ideas, changes and commits

  • (4d) Longer and deeper test phases; on whole systems, on local components, and on function points

  • (4e) Not being able to deploy a fast or easy solution (that you could use at some particular choice point if you didn’t need to worry about the rocket exploding)

Outside view on AI problems

Another reference class that feels relevant to me is that things having to do with AI are often more difficult than expected. E.g. the story of computer vision being assigned to 2 undergrads over the summer. This seems like a relevant case in point of “uncorrected intuition has a directional bias in underestimating the amount of work required to implement things having to do with AI, and you should correct that directional bias by revising your estimate upward”.

Given a sufficiently advanced Artificial General Intelligence, we might perhaps get narrow problems on the order of computer vision for free. But the whole point of Orthogonality is that you do not get AI alignment for free with general intelligence. Likewise, identifying value-laden concepts or executing value-laden behaviors doesn’t come free with identifying natural empirical concepts. We have separate basic AI work to do for alignment. So the analogy to underestimating a narrow AI problem, in the early days before anyone had confronted that problem, still seems relevant.

comment: I can’t see how, after imagining oneself in the shoes of the early researchers tackling computer vision and ‘commonsense reasoning’ and ‘natural-language processing’, after the entirety of the history of AI, anyone could reasonably stagger back in shocked and horrified surprise upon encountering the completely unexpected fact of a weird new AI problem being… kinda hard.

Inside view

While it is possible to build new systems that aren’t 100% understood, and have them work, the successful designs were usually greatly overengineered. Some Roman bridges have stayed up two millennia later, which probably wasn’t in the design requirements, so in that sense they turned out to be hugely overengineered, but we can’t blame them. “What takes good engineering is building bridges that just barely stay up.”

If we’re trying for an aligned Task AGI without a really deep understanding of how to build exactly the right AGI with no extra parts or extra problems—which must certainly be lacking on any scenario involving relatively short timescales—then we have to do lots of safety things in order to have any chance of surviving, because we don’t know in advance which part of the system will nearly fail. We don’t know in advance that the O-Rings are the part of the Space Shuttle that’s going to suddenly behave unexpectedly, and we can’t put in extra effort to armor only that part of the process. We have to overengineer everything to catch the small number of aspects that turn out not to be so “overengineered” after all.

This suggests that even if one doesn’t believe my particular laundry list below, whoever walks through this problem, conditional on their eventual survival, will have shown up with some laundry list of precautions, including costly precautions; and they will (correctly) not imagine themselves able to survive based on “minimum necessary” precautions.

Some specific extra time costs that I imagine might be required:

  • The AGI can only deploy internal optimization on pieces of itself that are small enough to be relatively safe and not vital to fully understand

  • In other words, the cautious programmers must in general do extra work to obtain functionality that a corner-cutting project could get in virtue of the AGI having further self-improved

  • Everything to do with real value alignment (as opposed to the AI having a reward button or being reinforcement-trained to ‘obey orders’ on some channel) is an additional set of function points

  • You have to build new pieces of the system for transparency and monitoring.

  • Including e.g. costly but important notions like “There’s actually a separatish AI over here that we built to inspect the first AI and check for problems, including having this separate AI trained on different data for safety-related concepts”

  • There’s a lot of trusted function points where you can’t just toss in an enormous deepnet because that wouldn’t meet the transparency or effability requirements at that function point

  • When somebody proposes a new optimization thingy, it has to be rejiggered to ensure e.g. that it meets the top-to-bottom taskishness requirement, and everyone has to stare at it to make sure it doesn’t blow up the world somehow

  • You can’t run jobs on AWS because you don’t trust Amazon with the code and you don’t want to put your AI in close causal contact with the Internet

  • Some of your system designs rely on all ‘major’ events being monitored and all unseen events being ‘minor’, and the major monitored events go through a human in the loop. The humans in the loop are then a rate-limiting factor and you can’t just ‘push the lever all the way up’ on that process.

  • E.g., maybe only ‘major’ goals can recruit subgoals across all known domains and ‘minor’ goals always operate within a single domain using limited cognitive resources.

  • Deployment involves a long conversation with the AI about ‘what do you expect to happen after you do X?’, and during that conversation other programmers are slowing down the AI to look at passively transparent interpretations of the AI’s internal thoughts

  • The project has a much lower threshold for saying “wait, what the hell just happened, we need to stop melt and catch fire, not just try different patches until it seems to run again”

  • The good project perhaps does a tad more testing

Indepedently of the particular list above, this doesn’t feel to me like a case where the conclusion is highly dependent on Eliezer-details. Anyone with a concrete plan for aligning an AI will walk in with a list of plans and methods for safety, some of which require close inspection of parts, and constrain allowable designs, and just plain take more work. One of the important ideas is going to turn out to take 500% more work than required, or solving a deep AI problem, and this isn’t going to shock them either.

Meta view

I genuinely have some trouble imagining what objection is standing in the way of accepting “ceteris paribus, alignment takes at least 50% more time”, having granted Orthogonality and alignment not being completely trivial. I did not expect the argument to bog down at this particular step. I wonder if I’m missing some basic premise or misunderstanding questioner’s entire thesis.

If I’m not misunderstanding, or if I consider the thesis as-my-ears-heard-it at face value, then I can only imagine the judgment “alignment probably doesn’t take that much longer” being produced by ignoring what I consider to be basic principles of cognitive realism. Despite the dangers of psychologizing, for purposes of oversharing, I’m going to say what feels to me like it would need to be missing:

  • (5a) Even if one feels intuitively optimistic about a project, one ought to expect in advance to run into difficulties not immediately obvious. You should not be in a state of mind where tomorrow’s surprises are a lot more likely to be unpleasant than pleasant; this is predictable updating. The person telling you your hopeful software project is going to take longer than 2 weeks should not need to argue you into acknowledging in advance that some particular delay will occur. It feels like the ordinary skill of “standard correction for optimistic bias” is not being applied.

  • (5b) It feels like this is maybe being put into a mental bucket of “futuristic scenarios” rather than “software development projects”, and is being processed as pessimistic future versus normal future, or something. Instead of: “If I ask a project manager for a mission-critical deep feature that impacts every aspect of the software project and needs to be implemented to a high standard of reliability, can that get done in just 10% more time than a project that’s eliminating that feature and cutting all the corners?”

  • (5c) I similarly recall the old experiment in which students named their “best case” scenarios where “everything goes as well as it reasonably could”, or named their “average case” scenarios; and the two elicitations produced indistinguishable results; and reality was usually slightly worse than the “worse case” scenario. I wonder if the “normal case” for AI alignment work required is being evaluated along much the same lines as “the best case /​ the case if every individual event goes as well as I imagine by default”.

AI alignment could be easy in theory and still take 50% more development time in practice. That is a very ordinary thing to have happen when somebody asks the project manager to make sure a piece of highly novel software actually implements an “easy” property the first time the software is run under new conditions that can’t be fully tested in advance.

“At least 50% more development time for the aligned AI project, versus the corner-cutting project, assuming both projects otherwise have access to the same stock of ideas and methods and computational resources” seems to me like an extremely probable and normal working premise to adopt. What am I missing?

comment: I have a sense of “Why am I not up fifty points in the polls?” and “What experienced software manager on the face of the Earth (assuming they didn’t go mentally haywire on hearing the words ‘Artificial Intelligence’, and considered this question as if it were engineering), even if they knew almost nothing else about AI alignment theory, would not be giving a rather skeptical look to the notion that carefully crafting a partially superhuman intelligence to be safe and robust would only take 1.5 times as long compared to cutting all the corners?”


  • Value achievement dilemma

    How can Earth-originating intelligent life achieve most of its potential value, whether by AI or otherwise?