Querying the AGI user
If we’re supposing that an advanced agent is checking something Potentially Bad with its user to find out if the thing is Considered Bad by that user, we need to worry about the following generic issues:
Can the AI tell which things are Potentially Bad in a way that includes all things that are Actually Bad?
Can the user reliably tell which Potentially Bad things are Actually Bad?
Does the AI, emergently or deliberately, seek out Potentially Bad things that the user will not label as Considered Bad, thereby potentially optimizing for Actually Bad things that the user mislabels as Not Bad? (E.g., if the agent learns to avoid new tries similar to those already labeled bad, we’re excluding the Considered Bad space, but what’s left may still contain Actually Bad things via nearest unblocked strategy or a similar phenomenon.)
Is the criterion for Potentially Bad so broad, and Actually Bad things hard enough to reliably prioritize within that space, that 10% of the time an Actually Bad thing will not be in the top 1,000 Potentially Bad things the user can afford the time to check?
Can the AI successfully communicate to the user the details of what set off the flag for Potential Badness, or even communicate to the user exactly what was flagged as Potentially Bad, if this is an important part of the user making the decision?
Do the AI’s communication goals risk?
Are the details of Potential Badness or even the subject of Potential Badness so inscrutable as to be impenetrable? (E.g., AlphaGo trying to explain to a human why a Go move is potentially bad, or for that matter, a Go professional trying to explain to an amateur why a Go move is potentially bad—we might just be left with blind trust, at which point we might as well just tell the AI not to do Potentially Bad things rather than asking it to pointlessly check with the user.)
Does the AI, emergently or instrumentally, optimize for the user not labeling things as Potentially Bad, thereby potentially leading to user deception?
- Task-directed AGI
An advanced AI that’s meant to pursue a series of limited-scope goals given it by the user. In Bostrom’s terminology, a Genie.