Don't try to solve the entire alignment problem

Eliezer Yudkowsky9 Jun 2016 21:09 UTC

On first approaching the alignment problem for advanced agents, aka “robust and beneficial AGI”, aka “Friendly AI”, a very common approach is to try to come up with one idea that solves all of AI alignment. A simple design concept; a simple utility function; a simple development strategy; one guideline for everyone to adhere to; or a large diagram full of boxes with lines to other boxes; that is allegedly sufficient to realize around as much benefit from beneficial superintelligences as can possibly be realized.

Without knowing the details of your current idea, this article can’t tell you why it’s wrong—though frankly we’ve got a strong prior against it at this point. But some very standard advice would be:

Glance over what current discussants think of as standard challenges and difficulties of the overall problem, i.e., why people think the alignment might be hard, and what standard questions a new approach would face.
Consider focusing your attention down on a single subproblem of alignment, and trying to make progress there—not necessarily solve it completely, but contribute nonobvious knowledge about the problem that wasn’t there before. (If you have a broad new approach that solves all of alignment, maybe you could walk through exactly how it solves one crisply identified subproblem?)
Check out the flaws in previous proposals that people currently think won’t work. E.g. various versions of utility indifference.

A good initial goal is not “persuade everyone in the field to agree with a new idea” but rather “come up with a contribution to an open discussion that is sufficiently crisply stated that, if it were in fact wrong, it would be possible for somebody else to shoot it down today.” I.e., an idea such that if you’re wrong, this can be pointed out in the form of a crisply derivable consequence of a crisply specified idea, rather than it taking 20 years to see what happens. For there to be sustained progress, propositions need to be stated modularly enough and crisply enough that there can be a conversation about them that goes beyond “does not / does too”—ideas need to be stated in forms that have sufficiently clear and derivable consequences that if there’s a problem, people can see the problem and agree on it.

Alternatively, poke a clearly demonstrable flaw in some solution currently being critiqued. Since most proposals in alignment theory get shot down, trying to participate in the critiquing process has a great advantage over trying to invent solutions, in that you’ll probably have started with the true premise “proposal X is broken or incomplete” rather than the false premise “proposal X works and solves everything”.

Psychologizing a little about why people might try to solve all of alignment theory in one shot, one might recount Robyn Dawes’s advice that:

Research shows that people come up with better solutions when they discuss the problem as thoroughly as possible before discussing any answers.
Dawes has observed that people seem more likely to violate this principle as the problem becomes more difficult.

…and finally remark that building a nice machine intelligence correctly on the first try must be pretty darned difficult, since so many people solve it in the first 15 seconds.

It’s possible that everyone working in this field is just missing the obvious and that there is some simple idea which solves all the problems. But realistically, you should be aware that everyone in this field has already heard a dozen terrible Total Solutions, and probably hasn’t had anything fun happen as a result of discussing them, resulting in some amount of attentional fatigue. (Similarly: If not everyone believes you, or even if it’s hard to get people to listen to your solution instead of talking with people they already know, that’s not necessarily because of some deep-seated psychological problem on their part, such as being uninterested in outsiders’ ideas. Even if you’re not an obvious crank, people are still unlikely to take the time out to engage with you unless you signal awareness of what they think are the usual issues and obstacles. It’s not so different here from other fields.)

Eliezer Yudkowsky9 Jun 2016 21:09 UTC

Parents:

AI safety mindset
Asking how AI designs could go wrong, instead of imagining them going right.

Paul Christiano 29 Jun 2016 2:00 UTC
I share the concern that people working on value alignment won’t understand what has been done before, or recognize e.g. MIRI’s competencies, and so will reinvent the wheel (or, worse, fail to reinvent the wheel).

I think this post (and MIRI’s communication more broadly) run a serious risk of seeming condescending. I don’t think it matters much in the context of this post but I do think it matters more broadly. Also, I am concerned that MIRI will fail to realize that the research community really knows quite a lot more than MIRI about how to do good research, and so will be dismissive of mainstream views about how to do good research. In some sense the situation is symmetrical, and I think the best outcome is for everyone to recognize each other’s expertise and treat each other with respect. (And I think that each side tends to ignore the other because the other ignores the one, and its obvious to both sides that the other side is doing badly because of it.)

In particular, it is very unclear whether people working on value alignment today are using the correct background assumptions, language, and division of problems are good. I don’t think it’s the case that new results should be able to be couched in the terms of the existing discussion.

So while this post might be appropriate for a random person discussing value alignment on LessWrong, I suspect it is inappropriate for a random ML researcher encountering the topic for the first time. Even if they have incorrect ideas about how to approach the value alignment problem, I think that seeing this would not help.

At the object level: I’m not convinced by the headline recommendation of the piece. I think that many plausible attacks on the problem (e.g. IRL, act-based agents) are going to look more like “solving everything at once” than like addressing one of the subproblems that you see as important. Of course those approaches will themselves be made out of pieces, but the pieces don’t line up with the pieces in the current breakdown.

To illustrate the point, consider a community that thinks about AI safety from the perspective of what could go wrong. They say: “well, a robot could physically harm a person, or the robot could offend someone, or the robot could steal stuff…” You come to them and say “What you need is a better account of logical uncertainty.” They respond: “What?” You respond by laying out a long agenda which in your view solves value alignment. The person responds: “OK, so you have some elaborate agenda which you think solves all of our problems. I have no idea what to make of that. How about you start by solving a subproblem, like preventing robots from physically injuring people?”

I know this is an unfair comparison, but I hope it helps illustrate how someone might feel about the current situation. I think it’s worthwhile to try to get people to understand the thinking that has been done and build on it, but I think that it’s important to be careful about alienating people unnecessarily and being unjustifiably condescending about it. Saying things like “most of the respect in the field…” also seems like it’s just going to make things worse, especially when talking to people who have significantly more academic credibility than almost everyone anyone involved in AI control research.

Incidentally, the format of Arbital currently seems to exacerbate this kind of difficulty. It seems like the site is intended to uncover the truth (and e.g. posts are not labelled by author, they are just presented as consensus). A side effect is that if a post annoys me, it’s not really clear what to do other than to be annoyed at Arbital itself.
- Eliezer Yudkowsky 29 Jun 2016 4:26 UTC
  I agree this page is problematic in present form and probably needs to be rewritten by Rob Bensinger. As it stands, I suppose, it’s trying too hard to rescue the sort of person who does try to solve the whole problem in 15 seconds using one simple trick—a class which unfortunately includes a number of prestigious people whose brain goes into some kind of… mode.
  
  I see a lot of force in your worry here, but I don’t know what I can do about it myself, except maybe rewrite the page so that it pretends to only be addressing outright cranks. I’ve made a couple of edits to that end, and to remove some of the particular language you singled out as problematic.
  
  If it were up to you, how would you handle the problem of the 15-Second-Solvers, possibly ones with high status? Just give up and accept that we shall always have them with us?
  
  Separately: I do worry that act-based agency is overreaching in how much it claims to solve with how simple of an idea. But it’s not like you’re putting it in uncritiqueably vague or delayed form, and it’s not like you came up with it in 15 seconds (so far as I know). Building a persistent edifice and defending it in detail is fine.
  
  The current technical agenda isn’t supposed to be a complete solution to ‘building AGI without getting killed’. Why would it be? Did something give you that impression?
  
  Separately: re: Arbital: As of June 2016, we’re not currently focused on debate features (like the probability bars) and are just trying to make Arbital useful for conveying explanations. Explanation does seem to be a subset of debate features-wise, but it also means that Arbital’s main use case is for explaining knowledge that’s already there. Which is why the public face of Arbital is focusing on math where things are less controversial. In other words, all I can do is bow my head and say ‘sorry’ about the fact that at present, all we can do is have different people write different Arbital pages and try to courteously link opposing views where they’ve been written up.
Paul Christiano 29 Jun 2016 2:00 UTC
Re: “poking holes in things,” what is an example of a proposal you would ask people to poke a hole in? Do you think that any of MIRI’s agenda is in a state where it can have holes poked in it? Do you think that MIRI’s results are in a state where holes can be poked in it? It seems like existing results are only related tangentially to safety, telling someone who cares about AI control that they should critique a result on logical uncertainty seems like a bad move. You can critique IRL on various grounds, but again the current state is just that there are a lot of open questions, and I don’t know exactly what you are going to say here beyond listing the obvious limitations / remaining difficulties.
- Eliezer Yudkowsky 29 Jun 2016 4:37 UTC
  Just to not leave you completely dangling here, how about utility indifference?