A standard agent:

1. Observes reality

2. Uses its observations to build a model of reality

3. Uses its model to forecast the effects of possible actions or policies

4. Chooses among policies on the basis of its utility function over the consequences of those policies

5. Carries out the chosen policy

(…and then observes the actual results of its actions, and updates its model, and considers new policies, etcetera.)

It’s conceivable that a cognitively powerful program could carry out some, but not all, of these activities. We could call this an “advanced pseudoagent” or “advanced nonagent”.

# Example: Planning Oracle

Imagine that we have an Oracle agent which outputs a plan $$\pi_0$$ which is meant to maximize lives saved or eudaimonia etc., assuming that the human operators decide to carry out the plan. By hypothesis, the agent does not assess the probability that the plan will be carried out, or try to maximize the probability that the plan will be carried out.

We could look at this as modifying step 4 of the loop: rather than this pseudoagent selecting the output whose expected consequences optimize its utility function, it selects the output that optimizes utility assuming some other event occurs (the humans deciding to carry out the plan).

We could also look at the whole Oracle schema as interrupting step 5 of the loop. If the Oracle works as intended, its purpose is not to immediately output optimized actions into the world; rather it is meant to output plans for humans to carry out. This though is more of a metaphorical or big-picture property. If not for the modification of step four where the Oracle calculates $$\mathbb E [U | \operatorname{do}(\pi_0), HumansObeyPlan]$$ instead of $$\mathbb E [U | \operatorname{do}(\pi_0)],$$ the Oracle’s outputted plans would just be its actions within the agent schema above. (And it would optimize the general effects of its plan-outputting actions, including the problem of getting the humans to carry out the plans.)

# Example: Imitation-based agents

Imitation-based agents would modify steps 3 and 4 of the loop by “trying to output an action indistinguishable from the output of the human imitated” rather than forecasting consequences or optimizing over consequences, except perhaps insofar as forecasting consequences is important for guessing what the human would do, or they’re internally imitating a human mode of thought that involves mentally imagining the consequences and choosing between them. “Imitation-based agents” might justly be called pseudoagents, in this schema.

(But the “pseudoagent” terminology is relatively new, and a bit awkward, and it won’t be surprising if we all go on saying “imitation-based agents” or “act-based agents”. The point of having terms like ‘pseudoagent’ or ‘advanced nonagent’ is to have a name for the general concept, not to reserve and guard the word ‘agent’ for only 100% real pure agents.)

# Safety benefits and difficulties

Advanced pseudoagents and nonagents are usually proposed in the hope of averting some advanced safety issue that seems to arise from the agenty part of “advanced agency”, while preserving other advanced cognitive powers that seem useful.

A proposal like this can fail to the extent that it’s not pragmatically possible to unentangle one aspect of agency from another; or to the extent that removing that much agency would make the AI safe but useless.

Some hypothetical examples that would, if they happened, constitute cases of failed safety or unworkable tradeoffs in pseudoagent compromises:

• Somebody proposes to obtain an Oracle merely in virtue of only giving the AI a text output channel, and only taking what it says as suggestions, thereby interrupting the loop between the agent’s policies and it acting in the world. If this is all that changes, then from the Oracle’s perspective it’s still an agent, its text output is its motor channel, and it still immediately outputs whatever act it expects to maximize subjective expected utility, treating the humans as part of the environment to be optimized. It’s an agent that somebody is trying to use as part of a larger process with an interrupted agent loop, but the AI design itself is a pure agent.

• Somebody advocates for designing an AI that only computes and outputs probability estimates; and never searches for any EU-maximizing policies, let alone outputs them. It turns out that this AI cannot well-manage its internal and reflective operations, because it can’t use consequentialism to select the best thought to think next. As a result, the AI design fails to bootstrap, or fails to work sufficiently well before competing AI designs that use internal consequentialism. (Safe but useless, much like a rock.)

• Somebody advocates that an imitative agent design will avoid invoking the advanced safety issues that seem like they should be associated with consequentialist reasoning, because the imitation-based pseudoagent never does any consequentialist reasoning or planning; it only tries to produce an output extremely similar to its training set of observed human outputs. But it turns out (arguendo) that the pseudoagent, to imitate the human, has to imitate consequentialist reasoning, and so the implied dangers end up pretty much the same.

• An agent is supposed to just be an extremely powerful policy-reinforcement learner instead of an expected utility optimizer. After a huge amount of optimization and mutation on a very general representation for policies, it turns out that the best policies, the ones that were the most reinforced by the highest rewards, are computing consequentialist models internally. The actual result ends up being that the AI is doing consequentialist reasoning that is obscured and hidden, since it takes place outside the designed and easily visible high-level-loop of the AI.

Coming up with a proposal for an advanced pseudoagent, that still did something pivotal and was actually safer, would reasonably require: (a) understanding how to slice up agent properties along their natural joints; (b) understanding which advanced-agency properties lead to which expected safety problems and how; and (c) understanding which internal cognitive functions would be needed to carry out some particular pivotal task; adding up to (d) see an exploitable prying-apart of the advanced-AI joints.

What’s often proposed in practice is more along the lines of:

• “We just need to build AIs without emotions so they won’t have drives that make them compete with us.” (Can you translate that into the language of utility functions and consequentialist planning, please?)

• “Let’s just build an AI that answers human questions.” (It’s doing a lot more than that internally, so how are the internal operations organized? Also, what do you do with a question-answering AI that averts the consequences of somebody else building a more agenty AI?)

Coming up with a sensible proposal for a pseudoagent is hard. The reason for talking about “agents” in talking about future AIs isn’t because the speaker wants to give AIs lots of power and have them wandering the world doing whatever they like under their own drives (for this entirely separate concept see autonomous AGI). The reason we talk about observe-model-predict-act expected-utility consequentialists, is that this seems to carve a lot of important concepts at their joints. Some alternative proposals exist, but they often have a feel of “carving against the joints” or trying to push through an unnatural arrangement, and aren’t as natural or as simple to describe.

Parents: