Safe but useless

“This type of safety im­plies use­less­ness” (or con­versely, “any AI pow­er­ful enough to be use­ful will still be un­safe”) is an ac­cu­sa­tion lev­eled against a pro­posed AI safety mea­sure that must, to make the AI safe, be en­forced to the point that it will make the AI use­less.

For a non-AI metaphor, con­sider a scis­sors and its dan­ger­ous blades. We can have a “safety scis­sors” that is only just sharp enough to cut pa­per—but this is still sharp enough to do some dam­age if you work at it. If you try to make the scis­sors even safer by en­cas­ing the dan­ger­ous blades in foam rub­ber, the scis­sors can’t cut pa­per any more. If the scis­sors can cut pa­per, it’s still un­safe. Maybe you could in prin­ci­ple cut clay with a scis­sors like that, but this is no defense un­less you can tell us some­thing very use­ful that can be done by cut­ting clay.

Similarly, there’s an ob­vi­ous way to try cut­ting down the al­lowed out­put of an Or­a­cle AGI to the point where all it can do is tell us that a given the­o­rem is prov­able from the ax­ioms of Zer­melo-Fraenkel set the­ory. This might pre­vent the AGI from hack­ing the hu­man op­er­a­tors into let­ting it out, since all that can leave the box is a sin­gle yes-or-no bit, sent at some par­tic­u­lar time. An un­trusted su­per­in­tel­li­gence in­side this scheme would have the op­tion of strate­gi­cally not tel­ling us when a the­o­rem is prov­able in ZF; but if the bit from the proof-ver­ifier said that the in­put the­o­rem was ZF-prov­able, we could very likely trust that.

But now we run up against the prob­lem that no­body knows how to ac­tu­ally save the world by virtue of some­times know­ing for sure that a the­o­rem is prov­able in ZF. The scis­sors has been blunted to where it’s prob­a­bly com­pletely safe, but can only cut clay; and no­body knows how to do enough good by cut­ting clay.

Ideal mod­els of “safe but use­less” agents

Should you have cause to do a math­e­mat­i­cal study of this is­sue, then an ex­cel­lent ideal model of a safe but use­less agent, em­body­ing max­i­mal safety and min­i­mum use­ful­ness, would be a rock.


  • Advanced safety

    An agent is re­ally safe when it has the ca­pac­ity to do any­thing, but chooses to do what the pro­gram­mer wants.