Known-algorithm non-self-improving agent
“Known-algorithm non-self-improving” (KANSI) is a strategic scenario and class of possibly-attainable AI designs, where the first pivotal powerful AI has been constructed out of known, human-understood algorithms and is not engaging in extensive self-modification. Its advantage might be achieved by, e.g., being run on a very large cluster of computers. If you could build an AI that was capable of cracking protein folding and building nanotechnology, by running correctly structured algorithms akin to deep learning on Google’s or Amazon’s computing cluster, and the builders were sufficiently paranoid/sensible to have people continuously monitoring this AI’s processes and all the problems it was trying to solve and not having this AI engage in self-modification or self-improvement, this would fall into the KANSI class of scenarios. This would imply that huge classes of problems in reflective stability, ontology identification, limiting potentially dangerous capabilities, etcetera, would be much simpler.
Restricting ‘good’ or approved AI development to KANSI designs would mean deliberately foregoing whatever capability gains might be possible through self-improvement. It’s not known whether a KANSI AI could be first to some pivotal level of capability. This would depend on unknown background settings about how much capability can be gained, at what stage, by self-modification. Depending on these background variables, making a KANSI AI be first to a capability threshold might or might not be something that could be accomplished by any reasonable level of effort and coordination. This is one reason among several why MIRI does not, e.g., restrict its attention to KANSI designs.
Just intending to build a non-self-improving AI out of known algorithms is insufficient to ensure KANSI as a property; this might require further solutions along the lines of Corrigibility. E.g., humans can’t modify their own brain functions, but because we’re general consequentialists and we don’t always think the way we want to think, we created quite simple innovations like, e.g., calculators, out of environmental objects in a world that didn’t have any built-in calculators, so that we could think about arithmetic in a different way than we did by default. A KANSI design with a large divergence between how it thinks and how it wants to think might behave similarly, or require constant supervision to detect most cases of the AI starting to behave similarly—and then some cases might slip through the cracks. Since our present study and understanding of reflective stability is very primitive, we’re plausibly still in the field of things we should be studying even if we want to build a KANSI agent, just to have the KANSI agent not be too wildly divergent in distance between how it thinks about X, and how it would prefer to think about X if given the choice.
Parents:
- Strategic AGI typology
What broad types of advanced AIs, corresponding to which strategic scenarios, might it be possible or wise to create?
Eliezer seems to have, and this page seems to reflect, strong intuitions about “self-modification” beyond what you would expect from synonymy with “AI systems doing AI design and implementation.” In my view of the world, there is no meaningful distinction between these things, and this post sounds confused. I think it would be worth pushing more on this divergence.
AI work is already done with the aid of powerful computational tools. It seems clear that these tools will become more powerful over time, and that at some point human involvement won’t be helpful for further AI progress. (It’s not clear how discontinuous progress will be on those tools. I think it will probably be reasonably smooth. I’m open to the possibility of abrupt progress but it’s not clear to me how that really changes the picture.) Improvements in tools could yield either more or less human understanding and effective control of the AI systems they improve, depending on the character of those tools.
If you can solve the control/alignment problem with a “KANSI” agent, then it’s not clear to me how the introduction of “self-modification” changes the character of the problem.
Here is my understanding of Eliezer’s picture (translated into my worldview): we might be able to build AI systems that are extremely good at helping us build capable AI systems, but not nearly as good at helping us solve AI alignment/control or building alignable/controllable AI. In this case, we will either need to have a very generally scalable solution to alignment/control in place (which we can apply to new AI systems as they are developed, without further help from the designers of those new AI systems), or else we may simply be doomed (if no such very scalable solution is possible, e.g. because the only way to solve alignment is to build a certain kind of AI system).
Interestingly, this difficulty is not directly related to the fact that the tools are themselves AI systems which pose a alignment/control problem. Instead the difficulty comes from the uneven capabilities of these systems (from the human perspective), namely that they are very good at AI design but not very good at helping with AI control.
This is at odds with what is written above, so it seems like I don’t yet see the real picture. But I’ll press on anyway.
One approach to this scenario is to refrain from getting help from our AI-designer AI systems, and instead sticking with weak AI systems and proceeding along a slower development trajectory. The world could successfully follow such a trajectory only by coordinating pretty well, which might be achieved either with political progress or with a sudden world takeover.
This overall picture makes sense to me. But, it doesn’t seem meaningfully distinct from the rest of the broad category “maybe we could build highly inefficient AI systems and then coordinate to avoid competitive pressures to use more efficient alternatives.” As usual, this approach seems clearly doomed to me, only accessible or desirable if the world becomes convinced that the AI situation is extraordinarily dire.
The distinction arises because maybe, even once we are coordinating to do AI development slowly, AI systems may design new AI systems of their own accord (and those systems may not be well-controlled). But this seems to be saying: if we mess up the alignment/control problem, then we may find ourselves with a new AI which is not aligned/controlled. But so what? We’ve already lost the game once our AI is doing things we don’t want it to, it’s not like we are losing any more.
To make the distinction really relevant, it seems to me you need an extreme view of takeoff speed. Then maybe the possibility of self-modification can turn a local failure into a catastrophe. Translated into my worldview, the story would be something like: once we are developing AI slowly, our project is vulnerable to more reckless competitors. Even if we successfully coordinate to stop all external competitors, our AI project may itself spawn some competitors internally. Despite our apparent strategic advantage, these internal competitors will rapidly become powerful enough to jeopardize the project (or else conceal themselves while they grow more powerful). And so we want to do additional research to ensure that no such internal competitor will emerge.
I don’t think this really meshes with Eliezer’s view, I’m just laying out my understanding of the view so that it can be corrected.
This indeed is the class of worrisome scenarios, and one should consider that (a) Eliezer thinks that aligning the rocket is harder than fueling it in general, and (b) that this was certainly true of e.g. Eurisko which was able to get some amount of self-improvement but with all control issues being kicked squarely back to Douglas Lenat. We can also see natural selection’s creation of humans in the same light, etcetera. On my view it seems extremely probable that, whatever we have in the way of AI algorithms (short of full FAI) creating other AI algorithms, they’ll be helping out not at all with alignment and control and things like reflective stability and so on.
The case where KANSI becomes important is where we get to the level where AGI becomes possible, at a point where there are not huge foregone advantages from whatever types of AI creation of AI algorithms of a type where existing transparency or control work doesn’t generalize. You can define a neural network undergoing gradient descent as “improving itself” but relative to current systems this doesn’t change the algorithm to the point where we no longer understand what’s going on. KANSI is relevant in the scenario where we first reach possible-advanced-AGI levels at a point where an organization with lots of resources and maybe a realistically-sized algorithmic lead, that foregoes the class of AI-improving-AI benefits that would make important subprocesses very hard to understand, is not at a disadvantage relative to a medium-sized organization with fewer resources. This is the level where we can put a big thing together out of things vaguely analogous to deep belief networks or whatever, and just run our current algorithms or minor variations on them, and have the AI’s representation be reasonably transparent and known so that we can monitor the AI’s thoughts—without some huge amount of work having gone into making transparency reflectively stable and corrigible through self-improvement or getting the AI to help us out with that, etcetera, because we’re just taking known algorithms and running on them on a vast amount of computing power.
You often say this, but I’m obviously not yet convinced.
As I see it the biggest likely gap is that you can empirically validate work in AI, but maybe cannot validate work on alignment/control except by consulting a human. This is problematic if either human feedback ends up being a major cost/obstacle (e.g. because AI systems are extremely cheap/fast, or because they are too far beyond humans for humans to provide meaningful oversight), or if task definitions that involve human feedback end up being harder by virtue of being mushier goals that don’t line up as well with the actual structure of reality.
These objections are more plausible for establishing that control work is a comparative advantage of humans. In that context I would accept them as plausible arguments, though I think there is a pretty good chance of working around them.
But those considerations don’t seem to imply that AI will help out “not at all.” It seems pretty plausible that you are drawing on some other intuitions that I haven’t considered.
Another possible gap is that control may just be harder than capabilities. But in that case the development of AI wouldn’t really change the game, it would just make the game go faster, so this doesn’t seem relevant to the present discussion. (If humans can solve the control problem anyway, humans+AI systems would have a comparable chance.)
Another possible gap is that there are many more iterations of AI design, and a failure at any time cascades into future iterations. I’ve pointed out that there can’t be many big productivity improvements before any earlier thinking about AI is thoroughly obsolete, but I’m certainly willing to grant that forcing control to keep up for a while does make the problem materially harder (moreso the more that our solutions to the control problem are closely tied to details of the AI systems we are building). I agree that sticking with the same AI designs for longer can in some respects make the control problem easier. But it seems like you are talking about a difference-in-kind for safety work, rather than another way to slightly improve safety at the expense of efficacy.
Note: I’m saying that if you can solve the AI control/alignment problem for the AI systems in year N, then the involvement of those AI systems in subsequent AI design doesn’t exert a significant additional pressure that makes it harder to solve the control/alignment problem in year N+1. It seems like this is the relevant question in the context of the OP.
One can imagine an agent that is smart about finding and training itself on new features. You seed it with one set of features, but over time it replaces that set with much better features fitting the data. To me it even seems possible that something like that could get to AGI level. This is not “self-modification” in the classic sense, so I’m wondering where that falls in this classification scheme.