Relevant limited AI

It is an open problem to propose a limited AI that would be relevant to the value achievement dilemma—an agent cognitively constrained along some dimensions that render it much safer, but still able to perform some task useful enough to prevent catastrophe.

Basic difficulty

Consider an Oracle AI that is so constrained as to be allowed only to output proofs in HOL of input theorems; these proofs are then verified by a simple and secure-seeming verifier in a sandbox whose exact code is unknown to the Oracle, and this verifier outputs 1 if the proof is true and 0 otherwise, then discards the proof-data. Suppose also that the Oracle is in a shielded box, etcetera.

It’s possible that this Provability Oracle has been so constrained that it is cognitively containable (it has no classes of options we don’t know about). If the verifier is unhackable, it gives us trustworthy knowledge that a theorem is provable. But this limited system is not obviously useful in a way that enables humanity to extricate itself from its larger dilemma. Nobody has yet stated a plan which could save the world if only we had a superhuman capacity to detect which theorems were provable in Zermelo-Fraenkel set theory.

Saying “The solution is for humanity to only build Provability Oracles!” does not resolve the value achievement dilemma because humanity does not have the coordination ability to ‘choose’ to develop only one kind of AI over the indefinite future, and the Provability Oracle has no obvious use that prevents non-Oracle AIs from ever being developed. Thus our larger value achievement dilemma would remain unsolved. It’s not obvious how the Provability Oracle would even constitute significant strategic progress.

Open problem

Describe a cognitive task or real-world task for a AI to carry out, that makes great progress upon the value achievement dilemma if executed correctly, and that can be done with a limited AI that:

  1. Has a real-world solution state that is exceptionally easy to pinpoint using a utility function, thereby avoiding some of edge instantiation, unforeseen maximums, context change, programmer maximization, and the other pitfalls of advanced safety, if there is otherwise a trustworthy solution for low-impact AI; or

  2. Seems exceptionally implementable using a known-algorithm non-self-improving agent, thereby averting problems of stable self-modification, if there is otherwise a trustworthy solution for a known-algorithm non-self-improving agent; or

  3. Constrains the agent’s option space so drastically as to make the strategy space not be rich (and the agent hence containable), while still containing a trustworthy, otherwise unfindable solution to some challenge that resolves the larger dilemma.

Additional difficulties

(Fill in this section later; all the things that go wrong when somebody eagerly says something along the lines of “We just need AI that does X!”)


  • AI alignment

    The great civilizational problem of creating artificially intelligent computer systems such that running them is a good idea.