Goodness estimate biaser

A “goodness estimate biaser” is a system setup or phenomenon that seems foreseeably likely to cause the actual goodness of some AI plan to be systematically lower than the AI’s estimate of that plan’s goodness. We want the AI’s estimate to be unbiased.

Ordinary examples

Subtle and unsubtle estimate-biasing issues in machine learning are well-known and appear far short of advanced agency:

● A machine learning algorithm’s performance on the training data is not an unbiased estimate of its performance on the test data. Some of what the algorithm seems to learn may be particular to noise in the training data. This fitted noise will not be fitted within the test data. So test performance is not just unequal to, but systematically lower than, training performance; if we were treating the training performance as an estimate of test performance, it would not be an unbiased estimate.

● The Winner’s Curse from auction theory observes that if bidders have noise in their unbiased estimates of the auctioned item’s value, then the highest bidder, who receives the item, is more likely to have upward noise in their individually unbiased estimate, conditional on their having won. (E.g., three bidders with Gaussian noise in their value estimates submit bids on an item whose true value to them is 1.0; the winning bidder is likely to have valued the item at more than 1.0.)

The analogous Optimizer’s Curse observes that if we make locally unbiased but noisy estimates of the subjective expected utility of several plans, then selecting the plan with ‘highest expected utility’ is likely to select an estimate with upward noise. Barring compensatory adjustments, this means that actual utility will be systematically lower than expected utility, even if all expected utility estimates are individually unbiased. Worse, if we have 10 plans whose expected utility can be unbiasedly estimated with low noise, plus 10 plans whose expected utility can be unbiasedly estimated with high noise, then selecting the plan with apparently highest expected utility favors the noisiest estimates!

In AI alignment

We can see many of the alleged foreseeable difficulties in AI alignment as involving similar processes that allegedly produce systematic downward biases in what we see as actual goodness, compared to an AI’s estimate of goodness:

● Edge instantiation suggests that if we take an imperfectly or incompletely learned value function, then looking for the maximum or extreme of that value function is much more likely than usual to magnify what we see as the gaps or imperfections (because of fragility of value, plus the Optimizer’s Curse); or destroy whatever aspects of value the AI didn’t learn about (because optimizing a subset of properties is liable to set all other properties to extreme values).

We can see this as implying both “The AI’s apparent goodness in non-extreme cases is an upward-biased estimate of its goodness in extreme cases” and “If the AI learns its goodness estimator less than perfectly, the AI’s estimates of the goodness of its best plans will systematically overestimate what we see as the actual goodness.”

● Nearest unblocked strategy generally, and especially over instrumentally convergent incorrigibility, suggests that if there are naturally-arising AI behaviors we see as bad (e.g. routing around shutdown), there may emerge a pseudo-adversarial selection of strategies that route around our attempted patches to those problems. E.g., the AI constructs an environmental subagent to continue carrying on its goals, while cheerfully obeying ‘the letter of the law’ by allowing its current hardware to be shut down. This pseudo-adversarial selection (though the AI does not have an explicit goal of thwarting us or selecting low-goodness strategies per se) again implies that actual goodness is likely to be systematically lower than the AI’s estimate of what it’s learned as ‘goodness’; again to an increasing degree as the AI becomes smarter and searches a wider policy space.

Mild optimization and conservative strategies can be seen as proposals to ‘regularize’ powerful optimization in a way that decreases the degree to which goodness in training is a biased (over)estimate of goodness in execution.