Ad-hoc hack (alignment theory)

An “ad-hoc hack” is when you modify or patch the algorithm of the AI with regards to something that would ordinarily have simple, principled, or nailed-down structure, or where it seems like that part ought to have some simple answer instead. E.g., instead of defining a von Neumann-Morgenstern coherent utility function, you try to solve some problem by introducing something that’s almost a VNM utility function but has a special case in line 3 which activates only on Tuesday. This seems unusually likely to break other things, e.g. reflective consistency, or anything else that depends on the coherence or simplicity of utility functions. Such hacks should be avoided in advanced-agent designs whenever possible, for analogous reasons to why they would be avoided in cryptography or designing a space probe. It may be interesting and productive anyway to look for a weird hack that seems to produce the desired behavior, because then you understand at least one system that produces the behavior you want—even if it would be unwise to actually build an AGI like that, the weird hack might give us the inspiration to find a simpler or more coherent system later. But then we should also be very suspicious of the hack, and look for ways that it fails or produces weird side effects.

An example of a productive weird hack was Benya Fallenstein’s Parametric Polymorphism proposal for tiling agents. You wouldn’t want to build a real AGI like that, but it was helpful for showing what could be done—which properties could definitely be obtained together within a tiling agent, even if by a weird route. This in turn helped suggest relatively less hacky proposals later.


  • AI safety mindset

    Asking how AI designs could go wrong, instead of imagining them going right.