Theory of (advanced) agents

Many is­sues in AI al­ign­ment have de­pen­den­cies on what we think we can fac­tu­ally say about the gen­eral de­sign space of cog­ni­tively pow­er­ful agents, or on which back­ground as­sump­tions yield which im­pli­ca­tions about ad­vanced agents. E.g., the Orthog­o­nal­ity Th­e­sis is a claim about the gen­eral de­sign space of pow­er­ful AIs. The de­sign space of ad­vanced agents is very wide, and only very weak state­ments seem likely to be true about the whole de­sign space; but we can still try to say ‘If X then Y’ and re­fute claims about ‘No need for if-X, Y hap­pens any­way!’


  • Instrumental convergence

    Some strate­gies can help achieve most pos­si­ble sim­ple goals. E.g., ac­quiring more com­put­ing power or more ma­te­rial re­sources. By de­fault, un­less averted, we can ex­pect ad­vanced AIs to do that.

  • Orthogonality Thesis

    Will smart AIs au­to­mat­i­cally be­come benev­olent, or au­to­mat­i­cally be­come hos­tile? Or do differ­ent AI de­signs im­ply differ­ent goals?

  • Advanced agent properties

    How smart does a ma­chine in­tel­li­gence need to be, for its nice­ness to be­come an is­sue? “Ad­vanced” is a broad term to cover cog­ni­tive abil­ities such that we’d need to start con­sid­er­ing AI al­ign­ment.


  • AI alignment

    The great civ­i­liza­tional prob­lem of cre­at­ing ar­tifi­cially in­tel­li­gent com­puter sys­tems such that run­ning them is a good idea.