Ad-hoc hack (alignment theory)

An “ad-hoc hack” is when you mod­ify or patch the al­gorithm of the AI with re­gards to some­thing that would or­di­nar­ily have sim­ple, prin­ci­pled, or nailed-down struc­ture, or where it seems like that part ought to have some sim­ple an­swer in­stead. E.g., in­stead of defin­ing a von Neu­mann-Mor­gen­stern co­her­ent util­ity func­tion, you try to solve some prob­lem by in­tro­duc­ing some­thing that’s al­most a VNM util­ity func­tion but has a spe­cial case in line 3 which ac­ti­vates only on Tues­day. This seems un­usu­ally likely to break other things, e.g. re­flec­tive con­sis­tency, or any­thing else that de­pends on the co­her­ence or sim­plic­ity of util­ity func­tions. Such hacks should be avoided in ad­vanced-agent de­signs when­ever pos­si­ble, for analo­gous rea­sons to why they would be avoided in cryp­tog­ra­phy or de­sign­ing a space probe. It may be in­ter­est­ing and pro­duc­tive any­way to look for a weird hack that seems to pro­duce the de­sired be­hav­ior, be­cause then you un­der­stand at least one sys­tem that pro­duces the be­hav­ior you want—even if it would be un­wise to ac­tu­ally build an AGI like that, the weird hack might give us the in­spira­tion to find a sim­pler or more co­her­ent sys­tem later. But then we should also be very sus­pi­cious of the hack, and look for ways that it fails or pro­duces weird side effects.

An ex­am­ple of a pro­duc­tive weird hack was Benya Fallen­stein’s Para­met­ric Poly­mor­phism pro­posal for tiling agents. You wouldn’t want to build a real AGI like that, but it was helpful for show­ing what could be done—which prop­er­ties could definitely be ob­tained to­gether within a tiling agent, even if by a weird route. This in turn helped sug­gest rel­a­tively less hacky pro­pos­als later.


  • AI safety mindset

    Ask­ing how AI de­signs could go wrong, in­stead of imag­in­ing them go­ing right.