# Low impact

A low-impact agent is a hypothetical task-based AGI that’s intended to avoid disastrous side effects via trying to avoid large side effects in general.

Consider the Sorcerer’s Apprentice fable: a legion of broomsticks, self-replicating and repeatedly overfilling a cauldron (perhaps to be as certain as possible that the cauldron was full). A low-impact agent would, if functioning as intended, have an incentive to avoid that outcome; it wouldn’t just want to fill the cauldron, but fill the cauldron in a way that had a minimum footprint. If the task given the AGI is to paint all cars pink, then we can hope that a low-impact AGI would not accomplish this via self-replicating nanotechnology that went on replicating after the cars were painted, because this would be an unnecessarily large side effect.

On a higher level of abstraction, we can imagine that the universe is parsed by us into a set of variables $$V_i$$ with values $$v_i.$$ We want to avoid the agent taking actions that cause large amounts of disutility, that is, we want to avoid perturbing variables from $$v_i$$ to $$v_i^*$$ in a way that decreases utility. However, the question of exactly which variables $$V_i$$ are important and shouldn’t be entropically perturbed is value-laden—complicated, fragile, high in algorithmic complexity, with Humean degrees of freedom in the concept boundaries.

Rather than relying solely on teaching an agent exactly which parts of the environment shouldn’t be perturbed and risking catastrophe if we miss an injunction, the low impact route would try to build an agent that tried to perturb fewer variables regardless.

The hope is that “have fewer side effects” is a problem that has a simple core and is learnable by a manageable amount of training. Conversely, trying to train “here is the list of bad effects not to have and important variables not to perturb” would be complicated and lack a simple core, because ‘bad’ and ‘important’ are value-laden. A list of dangerous variables would also be a blacklist rather than a whitelist, which would make it more vulnerable to treacherous context changes if the AI gained the ability to affect new things.

# Introduction: Formalizing low impact seems nontrivial

Intuitively, the notion of “low impact” seems like it should be simpler—have more of a central, core tendency to capture—than “avoid bad impacts”. If so, we don’t know yet how to compactly state this core principle semiformally.

Suppose we start with an obvious notion: to have low impact, minimize the number of variables you causally affect. But:

• Every event has a not-absolutely-zero impact on every future event. When you twitch your thumb or even just fire a neuron, the gravitational ripples from the moving atoms spread out and will eventually produce infinitesimal forces on atoms orbiting the other side of the galaxy. So we can’t say “have zero impact on as many quarks as possible” because it is impossible for any act to have absolutely zero impact on any quark.

This is a special case of a failure mode where, since the impact metric turns out to already be maxed out by all actions, moving around inside the maxed-out impact penalty doesn’t change expected utility, and therefore fails to influence which action is chosen. This foreseeable difficulty might also illustrate a typical treacherous context change, in that the AI during its training phase might have a relatively small causal model of its environment, with qualitative causal arrows, and so seem to be minimizing impact inside the training regime. Later development might cause the AI to adopt a new causal model in which, as happens to actually be the case, all its possible actions influence all the quarks inside its future light cone, and so actions are not differentiated by the magnitude of their impact penalty; thus the AI would safe in training and unsafe in later use, with the critical threshold possibly coming after the AI was too intelligent to be shut down.

But measuring causal impacts qualitatively isn’t the only possibility. Suppose we define impact as the sum of the expected movement of all atoms, relative to their mean expected positions if some privileged null action were taken instead? Or, to avoid this impact measure returning 0 as soon as the AI realizes that there are no atoms, only quarks, we try to give the AI some notion of the distance between outcomes in any way of modeling the world?

Arguendo, suppose that we could define a representational language over possible universes and laws of physics, such that within any representable model of the universe, there’s an obvious notion of ‘distance’ between any two outcomes conditional on the AI’s actions. If the universe is made of atoms, this representation will expose the obvious impact metric on the movements of atoms. If the universe is made of quarks, the same component of the utility function will readily calculate the movements of quarks. If the universe is made of quantum fields, this impact metric will behave in the intuitively intended way that basically works out to measuring particle motions, rather than the change metric always maxing out as the result of all amplitude flows ending up in qualitatively different sections of the quantum configuration space, etcetera. (Note that this is already sounding pretty nontrivial.)

Furthermore, suppose when the AI is thinking in terms of neither atoms nor quarks, but rather, say, the equivalent of chess moves or voxel fields, the same impact metric can apply to this as well; so that we can observe the low-impact behaviors at work during earlier development phases.

More formally: We suppose that the AI’s model class $$\mathcal M$$ is such that for any allowed model $$M \in \mathcal M,$$ for any two outcomes $$o_M$$ and $$o_M^'$$ that can result from the AI’s choice of actions, there is a distance $$%% o_M - o_M^' %%$$ which obeys standard rules for distances. This general distance measure is such that, within the standard model of physics, moving atoms around would add to the distance between outcomes in the obvious way; and for models short of molecular detail, will measure changes in other variables and quantities in an intuitive way. We then fix some particular policy $$\pi_0$$ whose consequence $$(o|\pi_0)$$ is “the result of the AI doing nothing”, and measure the impact penalty of any other policy $$\pi_k$$ as proportional to the expected distance between outcomes $$\mathbb E[%% (o | \pi_0) - (o | \pi_k) %%].$$

Then we might run into any of these further foreseeable difficulties, if, e.g., you asked the AGI to cure cancer with minimum ‘impact’ as defined above (or to put it another way, minimize impact subject to the constraint of curing cancer):

• Offsetting actions we would intuitively regard as both bad and impactful:

• The AGI models global death rates as a quantity, and implements offsetting actions to keep Earth’s annual death statistics constant after it cures cancer.

• Chaotic permissiveness:

• Weather is a chaotic system. If influencing any butterfly is going to move around all the atoms in the atmosphere anyway, you might as well have them end up somewhere you want.

• Stasis incentives:

• The AGI edits its programmers to make sure the programmers do whatever they would have done if the AGI, after being told its task, performed the null action.

• The AGI proliferates across galaxies to make sure everything else in the universe outside of human bodies adheres as closely as possible to the expected state it would have occupied if the null action had been taken.

• The AGI sets up a weather-control system so that at least its further actions won’t again disturb the weather.

All of this just goes to say that there’s apparently some subtlety built into our intuitively intended notion of “paint all cars pink, but do so with the minimum footprint possible apart from that”.

We want people to be able to notice that their cars have been painted pink, and for them to enjoy whatever further benefit of pink-painted cars led us to give the AGI this instruction in the first place. But we can’t just whitelist any further impact that happens as a consequence of the car being painted pink, because maybe the car was painted with pink replicating nanomachines. Etcetera.

Even if there is, in fact, some subtlety built into our intended notion of “make plans that have minimal side effects”, this subtle notion of low impact might still have a relatively much simpler core than our intuitive notion of “avoid bad impacts”. This might be reflected in either an improved formal intuition for ‘low impact’ that proves to stand up to a few years of skeptical scrutiny without any holes having been poked in it, or, much more nerve-rackingly, the ability to train an AI to make minimal-impact plans even if we don’t know a closed-form definition of “minimal impact”.

Work in this area is ongoing, so far mainly in the form of some preliminary suggestions by Stuart Armstrong (which were mostly shot down, but this is still progress compared to staring blankly at the problem). link Armstrong’s stuff.

# Foreseeable difficulties

## Permissiveness inside chaotic systems

Suppose you told the AI to affect as few things as possible, above the minimum necessary to achieve its task, and defined ‘impact’ qualitatively in terms of causal links that make variables occupy different states. Then since every act and indeed every internal decision (transistors, in switching, move electrons) would have infinitesimal influences on literally everything in the AI’s future light cone, all of which is defined as an ‘impact’, all actions would seem to have the same, maximum impact. Then the impact penalty would make no difference to the net expected utility of actions, causing the AI to behave as if it had no impact penalty.

Even if an impact measure doesn’t max out because of ubiquitous qualitative impacts, a poorly defined impact measure might max out quantitatively when the AGI is operating in a domain that is chaotic in the sense that tiny differences soon blow up to large differences. E.g., if a butterfly flaps its wings, that might cause a hurricane on the other side of the world a year later—so since you’re already changing the weather system as much as possible, why does it matter if, say, you on-purpose cause a hurricane in some area, or destroy a target using atmospheric lightning strikes? Those air molecules would all have ended up moving anyway because of the butterfly effect.

An imaginable patch is to try to evaluate impact over foreseeable impacts, so that a known lightning strike is ‘foreseeable’, while the effects on future hurricanes are ‘not foreseeable’. This seems worryingly like mixing up the map and the territory (is it okay to release environmental poisons so long as you don’t know who gets hurt?), but Stuart Armstrong has made some preliminary suggestions about minimizing knowable impacts. link Armstrong’s stuff on minimizing knowable impacts.

If you didn’t know it was coming, “maxing out the impact penalty” would potentially be a treacherous context change. When the AI was at the infrahuman level, it might model the world on a level where its actions had relatively few direct causal links spreading out from them, and most of the world would seem untouched by most of its possible actions. Then minimizing the impact of its actions, while fulfilling its goals, might in the infrahuman state seem to result in the AI carrying out plans with relatively few side effects, as intended. In a superhuman state, the AI might realize that its every act resulted in quantum amplitude flowing into a nonoverlapping section of configuration space, or having chaotic influences on a system the AI was not previously modeling as having maximum impact each time.

## Infinite impact penalties

In one case, a proposed impact penalty was written down on a whiteboard which happened to have the fractional form $$\frac{X}{Y}$$ where the quantity $$Y$$ could in some imaginable universes get very close to zero, causing Eliezer Yudkowsky to make an “Aaaaaaaaaaa”-sound as he waved his hands speechlessly in the direction of the denominator. The corresponding agent would have spent all its effort on further-minimizing infinitesimal probabilities of vast impact penalties.

Besides “don’t put denominators that can get close to zero in any term of a utility function”, this illustrates a special case of the general rule that impact penalties need to have their loudness set at a level where the AI is doing something besides minimizing the impact penalty. As a special case, this requires considering the growth scenario for improbable scenarios of very high impact penalty; the penalty must not grow faster than the probability diminishes.

(As usual, note that if the agent only started to visualize these ultra-unlikely scenarios upon reaching a superhuman level where it could consider loads of strange possibilities, this would constitute a treacherous context change.)

## Allowed consequences vs. offset actions

When we say “paint all cars pink” or “cure cancer” there’s some implicit set of consequences that we think are allowable and should definitely not be prevented, such as people noticing that their cars are pink, or planetary death rates dropping. We don’t want the AI trying to obscure people’s vision so they can’t notice the car is pink, and we don’t want the AI killing a corresponding number of people to level the planetary death rate. We don’t want these bad offsetting actions which would avert the consequences that were the point of the plan in the first place.

If we use a low-impact AGI to carry out some pivotal act that’s part of a larger plan to improve Earth’s chances of not being turned into paperclips, then this, in a certain sense, has a very vast impact on many galaxies that will not be turned into paperclips. We would not want this allowed consequence to max out and blur our AGI’s impact measure, nor have the AGI try to implement the pivotal act in a way that would minimize the probability of it actually working to prevent paperclips, nor have the AGI take offsetting actions to keep the probability of paperclips to its previous level.

Suppose we try to patch this rule that, when we carry out the plan, the further causal impacts of the task’s accomplishment are exempt from impact penalties.

But this seems to allow too much. What if the cars are painted with self-replicating pink nanomachines? What distinguishes the further consequences of that solved goal from the further causal impact of people noticing that their cars have been painted pink?

One difference between “people notice their cancer was cured” and “the cancer cure replicates and consumes the biosphere” is that the first case involves further effects that are, from our perspective, pretty much okay, while the second class of further effects are things we don’t like. But an ‘okay’ change versus a ‘bad’ change is a value-laden boundary. If we need to detect this difference as such, we’ve thrown out the supposed simplicity of ‘low impact’ that was our reason for tackling ‘low impact’ and not ‘low badness’ in the first place.

What we need instead is some way of distinguishing “People see their cars were painted pink” versus “The nanomachinery in the pink paint replicates further” that operates on a more abstract, non-value-laden level. For example, hypothetically speaking, we might claim that most ways of painting cars pink will have the consequence of people seeing their cars were painted pink and only a few ways of painting cars pink will not have this consequence, whereas the replicating machinery is an unusually large consequence of the task having reached its fulfilled state.

But is this really the central core of the distinction, or does framing an impact measure this way imply some further set of nonobvious undesirable consequences? Can we say rigorously what kind of measure on task fulfillments would imply that ‘most’ possible fulfillments lead people to see their cars painted pink, while ‘few’ destroy the world through self-replicating nanotechnology? Would that rigorous measure have further problems?

And if we told an AGI to shut down a nuclear plant, wouldn’t we want a low-impact AGI to err on the side of preventing radioactivity release, rather than trying to produce a ‘typical’ magnitude of consequences for shutting down a nuclear plant?

It seems difficult (but might still be possible) to classify the following consequences as having low and high extraneous impacts based on a generic impact measure only, without introducing further value lading:

• Low disallowed impact: Curing cancer causes people to notice their cancer has been cured, hospital incomes to drop, and world population to rise relative to its default state.

• High disallowed impact: Shutting down a nuclear power plant causes a release of radioactivity.

• High disallowed impact: Painting with pink nanomachinery causes the nanomachines to further replicate and eat some innocent bystanders.

• Low disallowed impact: Painting cars with ordinary pink paint changes the rays of light reflecting from those cars and causes people to gasp and say “What just happened to my car?”

• Low disallowed impact: Doing something smart with a Task AGI decreases the probability of the galaxies being consumed by an Unfriendly AI.

(Even if we think that good AGI scenarios involve the AGI concealing the fact of its existence, it’s hard to see why we’d want the events as such to be unnoticeable, or for their noticing to count as extraneous impacts.)

### Fallback use of an impact measure to prioritize operator queries

If this problem of allowing good consequences did turn out to be inherently value-laden, it might not obviate the usefulness of a safe impact measure. For example, it could be that in normal operation of a Task AGI, there are only a limited number of good further impacts that we want to whitelist—e.g., once we say that curing cancer is allowed to cause people to notice that they feel healthier and for there to be more people alive using more resources, etcetera, then it’s okay to minimize all other footprints.

Similarly, a safe impact measure might operate to alert the operators to all large impacts sorted by size of impact, such that the impact measure would reliably not have any remaining event that was ‘large’ in an intuitive sense that the operators didn’t know about. That is, we might not be able to give a blanket instruction to minimize impact, but a safe impact measure might still be useful for deciding which features of a plan need to be queried to the operators.

## Stasis incentives

This would be the general problem that includes “The AGI proliferates across galaxies to make sure everything else in the universe outside of human bodies adheres as closely as possible to the expected state it would have occupied if the null action had been taken.”

Similarly, we can imagine the AGI having a strong incentive to, in Round 1, paint the cars with nanomachines such that the nanomachinery quietly and unobtrusively sets up a weather-control system. This wouldn’t prevent the AGI’s actions in Round 1 from influencing the chaotic weather and having a huge impact, but could ensure that the AI’s future actions in Round 2 wouldn’t influence the formerly-chaotic weather relative to its null action then.

## The problem of the backup AGI

Suppose we’re using a measure of impact relative to a privileged null action. Imagine that there are two AGIs which will both be given the same task, and under ordinary circumstances, divide the effort of performing the task between them; but if one AGI fails to perform, the other will take over the task in its entirety. Then each AGI reasons, “If I perform the null action instead of carrying out the task, the other AGI will do almost exactly the same thing I was going to do” and so models itself as having counterfactually almost no impact. In this way, two AGIs can each evaluate themselves as having almost no impact, while operating in tandem they have a very large impact. (This problem was first observed by Stuart Armstrong.)

We can even imagine—though hopefully this isn’t the sort of thing that would go uncaught by other safety measures—that in Round 1, the agent builds a backup agent, which doesn’t count as having an extraneous impact because the system of 2 agents ends up carrying out the same steps that the agent would have carried out on its own. But then on Round 2, there’s now a backup and both agents regard themselves as having minimal impact.

(If you say that the agent should, on Round 1, realize that constructing the backup agent on Round 1 would have a later impact on the effect of the AI carrying out its orders in Round 2, then we’re now allowing the agent to minimize its impact across tasks, which might imply that in Round 1 it should try to build something that regularizes the weather as a side effect, so that its actions in Round 2 have less chaotic impact on the weather… or that it tries to influence its programmers to give it less impactful tasks, etcetera.)

A more general way of viewing this possible problem is that there might be tiling problems for a low-impact agent modifying itself or creating subagents or successors.

# Standard issues

The following standard issues from alignment theory seem like they could also apply to the problem of building a low-impact agent or creating a safe impact measure. We need an advanced-safe impact measure that will stand up to the load and stress of cognition that is superhuman in at least some domains.

## Programmer manipulation

A Task AGI which plans across rounds may have an incentive to influence its programmers into giving it plans that will have net lower impact compared to other plans the programmers might offer. If there’s an explicit procedure for the programmers to whitelist consequences, the AGI would have to be somehow configured in such fashion as to not prefer programmers to whitelist the consequences or seek out consequences that the programmers would be expected to (perhaps mistakenly) whitelist.

## Causal environment hacking

Since an impact measure seems likely to be defined over the causes of sense data, it has the standard problem with remote hacking of causal environments. (This is a ubiquitous enough problem that you’re going to want some general solution such as not having the AGI model remote hackers in the first place. But this subproblem should still be noted each time we introduce a dependency on the modeled causes of sense data.)

## Edge instantiation

Building a safe impact measure is hard because if we roll a naive impact measure, it might have its minimum in a weird, undesirable place. Freezing all matter into the universe into stasis would be an example of this.

## Unforeseen maximum

The actions and modes the AGI has available to potentially minimize its impact will include more actions and modes than we can imagine and check ourselves.

## Nearest unblocked neighbor

If the maximizing plan has high impact, the new best plan after introducing an impact measure may be a similar plan that has, through some leaky definition, found a weird special case that doesn’t count as an impact. To put it another way, to the extent an imperfect impact measure would otherwise hamper the AI if it were perfect, the AI experiences an instrumental pressure to seek out any loopholes or leaks in the definition.

## Context change

As the AI gains in cognitive capability, this widens the option space which could potentially contain instances of ‘low impact’ that are unforeseen or unblocked.

In this particular case, increased intelligence may also:

• Change the set of impacts that the AI can foresee, so that e.g. its causal model goes from ‘some actions causally connect to some other things’ to ‘all actions causally connect to all things’.

• Widen the space of possibilities it considers, to include for the first time very-low-probability scenarios of very large impacts.

• Include new offsetting actions, so that the AI for the first time has the option of editing brains not to notice changes it makes, or introduce randomized Unfriendly-AI generators to offset decreases in the probability of Unfriendly AI.

## Reflective stability

Does a low-impact AI want to only build a successor that’s a low-impact AI? If it builds an environmental subagent, is that subagent low impact?

Even if the AGI is supposed to not be self-modifying or to be building subagents, is there a worrying divergence and pressure to be held in check between how the AI thinks and how the AI would prefer to think? Does a low-impact AGI want relevant cognitive computations in general to be low impact?

To the extent that low impact is a feature of the utility function rather than the optimization style, this doesn’t have any obvious problems (apart from Armstrong’s dual-AGI no-impact counterfactual issue), but it’s a standard thing to check, and would become much more important if low impact was supposedly being achieved through any feature of the optimization style rather than utilities over outcomes.

# Related /​ further problems

A shutdown utility function is one which incentivizes the AI to safely switch itself off, without, say, creating a subagent that assimilates all matter in the universe to make absolutely sure the AI is never again switched back on.

Abortable plans are those which are composed with the intention that it be possible to midway activate an ‘abort’ plan, such that the partial implementation of the original plan, combined with the execution of the abort plan, together have a minimum impact. For example, if an abortable AI was building self-replicating nanomachines to paint a car pink, it would give all the nanomachines a quiet self-destruct button, so that at any time the ‘abort’ plan could be executed after having partially implemented to the plan to paint the car pink, such that these two plans together would have a minimum impact.

Children:

• Shutdown utility function

A special case of a low-impact utility function where you just want the AGI to switch itself off harmlessly (and not create subagents to make absolutely sure it stays off, etcetera).

• Abortable plans

Plans that can be undone, or switched to having low further impact. If the AI builds abortable nanomachines, they’ll have a quiet self-destruct option that includes any replicated nanomachines.

Parents:

An advanced AI that’s meant to pursue a series of limited-scope goals given it by the user. In Bostrom’s terminology, a Genie.

• This seems like a straw alternative. More realistically, we could imagine an agent which avoids perturbing a variable if it predicts the human would say “changing that variable is problematic” when asked. Then:

1. We don’t have to explicitly cover injunctions, just to provide information that allows the agent to predict human judgments.

2. If the AI is bad at making predictions, then it may just end up with lots of variables for which it thinks the human might say “changing that variable is problematic.” Behaving appropriately with respect to this uncertainty could recover the desired behavior.

Consider an agent who is learning to predict when a human considers a change problematic. Now suppose that the agent is not able to learn the complex value-laden concept of “important change,” but is able to learn the simpler concept of “big change.”

This agent can use the concept of “big change” in order to make predictions about “important change,” namely: “if a change is big, it might be important.”

So any agent who is able to learn the concept of “big change” should be able to make predictions at least as well as if it simply guessed that every big change had an appropriate probability of being important. For example, if 1% of big changes are important, then a reasonable learner, who is smart enough to learn the concept of “big change,” will predict at least as well as if it simply predicted that each big change was important with 1% probability.

If we use such a learner appropriately, this seems like it can obtain behavior at least as good as if the agent was first been taught a measure of impact and then used that measure to avoid (or flag) high-impact consequences.

To me it feels much more promising to learn an impact measure implicitly as an input into what changes are “important.” The alternative feels like a non-starter:

• The track record for learning looks a lot better than figuring things out “by hand.”

• The learned approach is easy to integrate with existing and foreseeable systems, while the by hand approach seems to require big changes in AI architectures.

• On the object level, the notion of low impact really doesn’t look like it is going to have a clean theoretical specification (you point out many of the concerns).

I would like to better understand our disagreement, though I’m not sure if it’s a priority and so you should feel free to ignore. But if you want to clarify: does one of these two concerns capture your position regarding learning an impact measure?

1. We might be able to specify an impact measure much more effectively than the agent can learn it (perhaps because we can directly specify a measure that will generalize well to radically different contexts, whereas a learned measure would not be robust to big context changes)

2. Even if the agent could learn an impact measure, and even if it could predict objectionable changes effectively by using that impact measure conservatively, we shouldn’t expect an objectionable-change-predictor to actually use this particular strategy or an equally effective alternative (perhaps it uses some other strategy which perhaps achieves a higher payoff in simple environments but then generalizes worse).

(For reference, the main context change I have in mind is moving from “weak agent proposing dumb plans” to “smarter agent proposing cleverer plans,” where “cleverer” may involve some optimization for being apparently low impact).

Alternatively, I may be misunderstanding where your position. I agree that even if you want an agent to learn an impact measure, it is worth thinking about what kind of impact measure it might learn and how that measure will generalize. So it’s possible that we don’t actually disagree about how the ultimate agent might look, but are just emphasizing different parts of how to get there.

• We don’t have to explicitly cover injunctions, just to provide information that allows the agent to predict human judgments.

I should edit to clarify that I worry the process of teaching the agent the category of everything to avoid will fail, not that I’m assuming an explicit list with no further extrapolation. If I thought there was a straightforwardly trainable simple concept covering everything an AGI shouldn’t do, which left enough space to include pivotal acts, and that didn’t max out in weird edge cases when subjected to strong optimization pressures, I would be a lot less worried than I currently am!

If the AI is bad at making predictions, then it may just end up with lots of variables for which it thinks the human might say “changing that variable is problematic.” Behaving appropriately with respect to this uncertainty could recover the desired behavior.

This is an issue of prioritization. There’s some finite amount of checking humans can do which has to be distributed over everything there is to be checked. Let’s say a team of 10 checkers can each do 4 detailed checks and 40 quick checks per day without getting sloppy. If the catastrophic act is on some level noticed, but it’s assigned a probability-of-catastrophe level which ranks it as the 1000th item on the list, it’s going to slip through if the AGI is allowed to do anything at all. Again, your intuition seems to be that there should be a few easy obvious things to check, whereas I’m nervous that any checking procedure strong enough to catch the disasters is going to produce a huge number of false positives because the AI will not internally contain the information and cognitive capacity required to tell the difference.

If we use such a learner appropriately, this seems like it can obtain behavior at least as good as if the agent was first been taught a measure of impact and then used that measure to avoid (or flag) high-impact consequences.

We differ in how much we think predictors can safely do automatically. My reason for wanting to think about low impact explicitly has two parts.

First, I’m concerned that for realistic limited AGIs of the sort we’ll actually see in the real world, we will not want to amplify its intelligence up to the point where all learning can be taken for granted, we will want to use known algorithms, and therefore, considering something like ‘low impact’ explicitly and as part of machine learning may improve our chances of ending up with a low-impact AGI.

Second, if there turns out to be an understandable core to low impact, then by explicitly understanding this core we can decrease our nervousness about what a trained AGI might have been trained to do. By default we’d need to worry about an AGI blindly trained to flag possibly dangerous things, learning some unknown peculiar generalization of low impact that will, like a neural network being fooled by the right pattern of static, fail in some weird edge case the next time its option set expands. If we understand explicitly what generalization of low impact is being learned, it would boost our confidence (compared to the blind training case) of the next expansion of options not being fooled by the right kind of staticky image (under optimization pressure from a planning module trying to avoid dangerous impacts).

This appears to me to go back to our central disagreement-generator about how much the programmers need to explicitly understand and consider. I worry that things which seem like ‘predictions’ in principle won’t generalize well from previously labeled data, especially for things with reflective degrees of freedom, double-especially for limited AGI systems of the sort that we will actually see in practice in any endgame with a hope of ending well. Or more simple terms, I think that trying to have safety systems that we don’t understand and that have been generalized from labeled data without us fully understanding the generalization and its possible edge cases are nigh-inevitable recipes for disaster. Or in simpler simpler terms, you can’t possibly get away with building a powerful AGI you understand that poorly.