Solomonoff induction: Intro Dialogue (Math 2)

(A dialogue between Ashley, a computer scientist who’s never heard of Solomonoff’s theory of inductive inference, and Blaine, who thinks it is the best thing since sliced bread.)

Ashley: Good evening, Msr. Blaine.

Blaine: Good evening, Msr. Ashley.

Ashley: I’ve heard there’s this thing called “Solomonoff’s theory of inductive inference”.

Blaine: The rumors have spread, then.

Ashley: Yeah, so, what the heck is that about?

Blaine: Invented in the 1960s by the mathematician Ray Solomonoff, the key idea in Solomonoff induction is to do sequence prediction by using Bayesian updating on a prior composed of a mixture of all computable probability distributions--

Ashley. Wait. Back up a lot. Before you try to explain what Solomonoff induction is, I’d like you to try to tell me what it does, or why people study it in the first place. I find that helps me organize my listening. Right now I don’t even know why I should be interested in this.

Blaine: Um, okay. Let me think for a second…

Ashley: Also, while I can imagine things that “sequence prediction” might mean, I haven’t yet encountered it in a technical context, so you’d better go a bit further back and start more at the beginning. I do know what “computable” means and what a “probability distribution” is, and I remember the formula for Bayes’s Rule although it’s been a while.

Blaine: Okay. So… one way of framing the usual reason why people study this general field in the first place, is that sometimes, by studying certain idealized mathematical questions, we can gain valuable intuitions about epistemology. That’s, uh, the field that studies how to reason about factual questions, how to build a map of reality that reflects the territory--

Ashley: I have some idea what ‘epistemology’ is, yes. But I think you might need to start even further back, maybe with some sort of concrete example or something.

Blaine: Okay. Um. So one anecdote that I sometimes use to frame the value of computer science to the study of epistemology, is Edgar Allen Poe’s argument in 1833 that chess was uncomputable.

Ashley: That doesn’t sound like a thing that actually happened.

Blaine: I know, but it totally did happen and not in a metaphorical sense either! Edgar Allen Poe wrote an essay explaining why no automaton would ever be able to play chess, and he specifically mentioned “Mr. Babbage’s computing engine” as an example. You see, in the nineteenth century, there was for a time this sensation known as the Mechanical Turk—supposedly a machine, an automaton, that could play chess. At the grandmaster level, no less. Now today, when we’re accustomed to the idea that it takes a reasonable powerful computer to do that, we can know immediately that the Mechanical Turk must have been a fraud and that there must have been a concealed operator inside—a midget, as it turned out. Today we know that this sort of thing is hard to build into a machine. But in the 19th century, even that much wasn’t known. So when Edgar Allen Poe, who besides being an author was also an accomplished magician, set out to write an essay about the Mechanical Turk, he spent the second half of the essay dissecting what was known about the Turk’s appearance to (correctly) figure out where the human operator was hiding. But Poe spent the first half of the essay arguing that no automaton—nothing like Mr. Babbage’s computing engine—could possibly play chess, which was how he knew a priori that the Turk had a concealed human operator.

Ashley: And what was Poe’s argument?

Blaine: Poe observed that in an algebraical problem, each step followed from the previous step of necessity, which was why the steps in solving an algebraical problem could be represented by the deterministic motions of gears in something like Mr. Babbage’s computing engine. But in a chess problem, Poe said, there are many possible chess moves, and no move follows with necessity from the position of the board; and even if you did select one move, the opponent’s move would not follow with necessity, so you couldn’t represent it with the determined motion of automatic gears. Therefore, Poe said, whatever was operating the Mechanical Turk must have the nature of Cartesian mind, rather than the nature of deterministic matter, and this was knowable a priori. And then he started figuring out where the required operator was hiding.

Ashley: That’s some amazingly impressive reasoning for being completely wrong.

Blaine: I know! Isn’t it great?

Ashley: I mean, that sounds like Poe correctly identified the hard part of playing computer chess, the branching factor of moves and countermoves, which is the reason why no simple machine could do it. And he just didn’t realize that a deterministic machine could deterministically check many possible moves in order to figure out the game tree. So close, and yet so far.

Blaine: More than a century later, in 1950, Claude Shannon published the first paper ever written on computer chess. And in passing, Shannon gave the formula for playing perfect chess if you had unlimited computing power, the algorithm you’d use to extrapolate the entire game tree. We could say that Shannon gave a short program that would solve chess if you ran it on a hypercomputer, where a hypercomputer is an ideal computer that can run any finite computation immediately. And then Shannon passed on to talking about the problem of locally guessing how good a board position was, so that you could play chess using only a small search. I say all this to make a point about the value of knowing how to solve problems using hypercomputers, even though hypercomputers don’t exist. Yes, there’s often a huge gap between the unbounded solution and the practical solution. It wasn’t until 1997, forty-seven years after Shannon’s paper giving the unbounded solution, that Deep Blue actually won the world chess championship--

Ashley: And that wasn’t just a question of faster computing hardware running Shannon’s ideal search algorithm. There were a lot of new insights along the way, most notably the alpha-beta pruning algorithm and a lot of improvements in positional evaluation.

Blaine: Right! But I think some people overreact to that forty-seven year gap, and act like it’s worthless to have an unbounded understanding of a computer program, just because you might still be forty-seven years away from a practical solution. But if you don’t even have a solution that would run on a hypercomputer, you’re Poe in 1833, not Shannon in 1950. The reason I tell the anecdote about Poe is to illustrate that Poe was confused about computer chess in a way that Shannon was not. When we don’t know how to solve a problem even given infinite computing power, the very work we are trying to do is in some sense murky to us. When we can state code that would solve the problem given a hypercomputer, we have become less confused. Once we have the unbounded solution we understand, in some basic sense, the kind of work we are trying to perform, and then we can try to figure out how to do it efficiently.

Ashley: Which may well require new insights into the structure of the problem, or even a conceptual revolution in how we imagine the work we’re trying to do.

Blaine: Yes, but the point is that you can’t even get started on that if you’re arguing about how playing chess has the nature of Cartesian mind rather than matter. At that point you’re not 50 years away from winning the chess championship, you’re 150 years away, because it took an extra 100 years to move humanity’s understanding to the point where Claude Shannon could trivially see how to play perfect chess using a large-enough computer. I’m not trying to exalt the unbounded solution by denigrating the work required to get a bounded solution. I’m not saying that when we have an unbounded solution we’re practically there and the rest is a matter of mere lowly efficiency. I’m trying to compare having the unbounded solution to the horrific confusion of not understanding what we’re trying to do.

Ashley: Okay. I think I understand why, on your view, it’s important to know how to solve problems using infinitely fast computers, or hypercomputers as you call them. When we can say how to answer a question using infinite computing power, that means we crisply understand the question itself, in some sense; while if we can’t figure out how to solve a problem using unbounded computing power, that means we’re confused about the problem, in some sense. I mean, anyone who’s ever tried to teach the more doomed sort of undergraduate to write code knows what it means to be confused about what it takes to compute something.

Blaine: Right.

Ashley: So what does this have to do with “Solomonoff induction”?

Blaine: Ah! Well, suppose I asked you how to do epistemology using infinite computing power?

Ashley: My good fellow, I would at once reply, “Beep. Whirr. Problem ‘do epistemology’ not crisply specified.” At this stage of affairs, I do not think this reply indicates any fundamental confusion on my part; rather I think it is you who must be clearer.

Blaine: Given unbounded computing power, how would you reason in order to construct an accurate map of reality?

Ashley: That still strikes me as rather underspecified.

Blaine: Perhaps. But even there I would suggest that it’s a mark of intellectual progress to be able to take vague and underspecified ideas like ‘do good epistemology’ and turn them into crisply specified problems. Imagine that I went up to my friend Cecil, and said, “How would you do good epistemology given unlimited computing power and a short Python program?” and Cecil at once came back with an answer—a good and reasonable answer, once it was explained. Cecil would probably know something quite interesting that you do not presently know.

Ashley: I confess to being rather skeptical of this hypothetical. But if that actually happened—if I agreed, to my own satisfaction, that someone had stated a short Python program that would ‘do good epistemology’ if run on an unboundedly fast computer—then I agree that I’d probably have learned something quite interesting about epistemology.

Blaine: What Cecil knows about, in this hypothetical, is Solomonoff induction. In the same way that Claude Shannon answered “Given infinite computing power, how would you play perfect chess?”, Ray Solomonoff answered “Given infinite computing power, how would you perfectly find the best hypothesis that fits the facts?”

Ashley: Suddenly, I find myself strongly suspicious of whatever you are about to say to me.

Blaine: That’s understandable.

Ashley: In particular, I’ll ask at once whether “Solomonoff induction” assumes that our hypotheses are being given to us on a silver platter along with the exact data we’re supposed to explain, or whether the algorithm is organizing its own data from a big messy situation and inventing good hypotheses from scratch.

Blaine: Great question! It’s the second one.

Ashley: Really? Okay, now I have to ask whether Solomonoff induction is a recognized concept in good standing in the field of academic computer science, because that does not sound like something modern-day computer science knows how to do.

Blaine: I wouldn’t say it’s a widely known concept, but it’s one that’s in good academic standing. The method isn’t used in modern machine learning because it requires an infinitely fast computer and isn’t easily approximated the way that chess is.

Ashley: This really sounds very suspicious. Last time I checked, we hadn’t begun to formalize the creation of good new hypotheses from scratch. I’ve heard about claims to have ‘automated’ the work that, say, Newton did in inventing classical mechanics, and I’ve found them all to be incredibly dubious. Which is to say, they were rigged demos and lies.

Blaine: I know, but--

Ashley: And then I’m even more suspicious of a claim that someone’s algorithm would solve this problem if only they had infinite computing power. Having some researcher claim that their Good-Old-Fashioned-AI semantic network would be intelligent if run on a computer so large that, conveniently, nobody can ever test their theory, is not going to persuade me.

Blaine: Do I really strike you as that much of a charlatan? What have I ever done to you, that you would expect me to try pulling a scam like that?

Ashley: That’s fair. I shouldn’t accuse you of planning that scam when I haven’t seen you say it. But I’m pretty sure the problem of “coming up with good new hypotheses in a world full of messy data” is AI-complete. And even Mentif-

Blaine: Do not say the name, or he will appear!

Ashley: Sorry. Even the legendary first and greatest of all AI crackpots, He-Who-Googles-His-Name, could assert that his algorithms would be all-powerful on a computer large enough to make his claim unfalsifiable. So what?

Blaine: That’s a very sensible reply and this, again, is exactly the kind of mental state that reflects a problem that is confusing rather than just hard to implement. It’s the sort of confusion Poe might feel in 1833, or close to it. In other words, it’s just the sort of conceptual issue we would have solved at the point where we could state a short program that could run on a hypercomputer. Which Ray Solomonoff did in 1964.

Ashley: Okay, let’s hear about this supposed general solution to epistemology.

Blaine: First, try to solve the following puzzle. 1, 3, 4, 7, 11, 18, 29…?

Ashley: Let me look at those for a moment… 47.

Blaine: Congratulations on engaging in, as we snooty types would call it, ‘sequence prediction’.

Ashley: I’m following you so far.

Blaine: The smarter you are, the more easily you can find the hidden patterns in sequences and predict them successfully. You had to notice the resemblance to the Fibonacci rule to guess the next number. Someone who didn’t already know about Fibonacci, or who was worse at mathematical thinking, would have taken longer to understand the sequence or maybe never learned to predict it at all.

Ashley: Still with you.

Blaine: It’s not a sequence of numbers per se… but can you see how the question, “The sun has risen on the last million days. What is the probability that it rises tomorrow?” could be viewed as a kind of sequence prediction problem?

Ashley: Only if some programmer neatly parses up the world into a series of “Did the Sun rise on day X starting in 4.5 billion BCE, 0 means no and 1 means yes? 1, 1, 1, 1, 1…” and so on. Which is exactly the sort of shenanigan that I see as cheating. In the real world, you go outside and see a brilliant ball of gold touching the horizon, not a giant “1″.

Blaine: Suppose I have a robot running around with a webcam showing it a 1920x1080 pixel field that refreshes 60 times a second with 32-bit colors. I could view that as a giant sequence and ask the robot to predict what it will see happen when it rolls out to watch a sunrise the next day.

Ashley: I can’t help but notice that the ‘sequence’ of webcam frames is absolutely enormous, like, the sequence is made up of 66-megabit ‘numbers’ appearing 3600 times per minute… oh, right, computers much bigger than the universe. And now you’re smiling evilly, so I guess that’s the point. I also notice that the sequence is no longer deterministically predictable, that it is no longer a purely mathematical object, and that the sequence of webcam frames observed will depend on the robot’s choices. This makes me feel a bit shaky about the analogy to predicting the mathematical sequence 1, 1, 2, 3, 5.

Blaine: I’ll try to address those points in order. First, Solomonoff induction is about assigning probabilities to the next item in the sequence. I mean, if I showed you a box that said 1, 1, 2, 3, 5, 8 you would not be absolutely certain that the next item would be 13. There could be some more complicated rule that just looked Fibonacci-ish but then diverged. You might guess with 90% probability but not 100% probability, or something like that.

Ashley: This has stopped feeling to me like math.

Blaine: There is a large branch of math, to say nothing of computer science, that deals in probabilities and statistical prediction. We are going to be describing absolutely lawful and deterministic ways of assigning probabilities after seeing 1, 3, 4, 7, 11, 18.

Ashley: Okay, but if you’re later going to tell me that this lawful probabilistic prediction rule underlies a generally intelligent reasoner, I’m already skeptical. No matter how large a computer it’s run on, I find it hard to imagine that some simple set of rules for assigning probabilities is going to encompass truly and generally intelligent answers about sequence prediction, like Terence Tao would give after looking at the sequence for a while. We just have no idea how Terence Tao works, so we can’t duplicate his abilities in a formal rule, no matter how much computing power that rule gets… you’re smiling evilly again. I’ll be quite interested if that evil smile turns out to be justified.

Blaine: Indeed.

Ashley: I also find it hard to imagine that this deterministic mathematical rule for assigning probabilities would notice if a box was outputting an encoded version of “To be or not to be” from Shakespeare by mapping A to Z onto 1 to 26, which I would notice eventually though not immediately upon seeing 20, 15, 2, 5, 15, 18… And you’re still smiling evilly.

Blaine: Indeed. That is exactly what Solomonoff induction does. Furthermore, we have theorems establishing that Solomonoff induction can do it way better than you or Terence Tao.

Ashley: A theorem proves this. As in a necessary mathematical truth. Even though we have no idea how Terence Tao works empirically… and there’s evil smile number four. Okay. I am very skeptical, but willing to be convinced.

Blaine: So if you actually did have a hypercomputer, you could cheat, right? And Solomonoff induction is the most ridiculously cheating cheat in the history of cheating.

Ashley: Go on.

Blaine: We just run all possible computer programs to see which are the simplest computer programs that best predict the data seen so far, and use those programs to predict what comes next. This mixture contains, among other things, an exact copy of Terence Tao, thereby allowing us to prove theorems about their relative performance.

Ashley: Is this an actual reputable math thing? I mean really?

Blaine: I’ll deliver the formalization later, but you did ask me to first state the point of it all. The point of Solomonoff induction is that it gives us a gold-standard ideal for sequence prediction, and this gold-standard prediction only errs by a bounded amount, over infinite time, relative to the best computable sequence predictor. We can also see it as formalizing the intuitive idea that was expressed by William Ockham a few centuries earlier that simpler theories are more likely to be correct, and as telling us that ‘simplicity’ should be measured in algorithmic complexity, which is the size of a computer program required to output a hypothesis’s predictions.

Ashley: I think I would have to read more on this subject to actually follow that. What I’m hearing is that Solomonoff induction is a reputable idea that is important because it gives us a kind of ideal for sequence prediction. This ideal also has something to do with Occam’s Razor, and stakes a claim that the simplest theory is the one that can be represented by the shortest computer program. You identify this with “doing good epistemology”.

Blaine: Yes, those are legitimate takeaways. Another way of looking at it is that Solomonoff induction is an ideal but uncomputable answer to the question “What should our priors be?”, which is left open by understanding Bayesian updating.

Ashley: Can you say how Solomonoff induction answers the question of, say, the prior probability that Canada is planning to invade the United States? I once saw a crackpot website that tried to invoke Bayesian probability about it, but only after setting the prior at 10% or something like that, I don’t recall exactly. Does Solomonoff induction let me tell him that he’s making a math error, instead of just calling him silly in an informal fashion?

Blaine: If you’re expecting to sit down with Leibniz and say, “Gentlemen, let us calculate” then you’re setting your expectations too high. Solomonoff gives us an idea of how we should compute that quantity given unlimited computing power. It doesn’t give us a firm recipe for how we can best approximate that ideal in real life using bounded computing power, or human brains. That’s like expecting to play perfect chess after you read Shannon’s 1950 paper. But knowing the ideal, we can extract some intuitive advice that might help our online crackpot if only he’d listen.

Ashley: But according to you, Solomonoff induction does say in principle what is the prior probability that Canada will invade the United States.

Blaine: Yes, up to a choice of universal Turing machine.

Ashley (looking highly skeptical): So I plug a universal Turing machine into the formalism, and in principle, I get out a uniquely determined probability that Canada invades the USA.

Blaine: Exactly!

Ashley: Uh huh. Well, go on.

Blaine: So, first, we have to transform this into a sequence prediction problem.

Ashley: Like a sequence of years in which Canada has and hasn’t invaded the US, mostly zero except around 1812--

Blaine: No! To get a good prediction about Canada we need much more data than that, and I don’t mean a graph of Canadian GDP either. Imagine a sequence that contains all the sensory data you have ever received over your lifetime. Not just the hospital room that you saw when you opened your eyes right after your birth, but the darkness your brain received as input while you were still in your mother’s womb. Every word you’ve ever heard. Every letter you’ve ever seen on a computer screen, not as ASCII letters but as the raw pattern of neural impulses that gets sent down from your retina.

Ashley: That seems like a lot of data and some of it is redundant, like there’ll be lots of similar pixels for blue sky--

Blaine: That data is what you got as an agent. If we want to translate the question of the prediction problem Ashley faces into theoretical terms, we should give the sequence predictor all the data that you had available, including all those repeating blue pixels of the sky. Who knows? Maybe there was a Canadian warplane somewhere in there, and you didn’t notice.

Ashley: But it’s impossible for my brain to remember all that data. If we neglect for the moment how the retina actually works and suppose that I’m seeing the same 1920x1080 @60Hz feed the robot would, that’s far more data than my brain can realistically learn per second.

Blaine: So then Solomonoff induction can do better than you can, using its unlimited computing power and memory. That’s fine.

Ashley: But what if you can do better by forgetting more?

Blaine: If you have limited computing power, that makes sense. With unlimited computing power, that really shouldn’t happen and that indeed is one of the lessons of Solomonoff induction. An unbounded Bayesian never expects to do worse by updating on another item of evidence—for one thing, you can always just do the same policy you would have used if you hadn’t seen that evidence. That kind of lesson is one of the lessons that might not be intuitively obvious, but which you can feel more deeply by walking through the math of probability theory. With unlimited computing power, nothing goes wrong as a result of trying to process 4 gigabits per second; every extra bit just produces a better expected future prediction.

Ashley: Okay, so we start with literally all the data I have available. That’s 4 gigabits per second if we imagine 1920 by 1080 frames of 32-bit pixels repeating 60 times per second. Though I remember hearing 100 megabits per second would be a better estimate of what the retina sends out, and that it’s pared down to 1 megabit per second very quickly by further processing.

Blaine: Right. We start with all of that data, going back to when you were born. Or maybe when your brain formed in the womb, though it shouldn’t make much difference.

Ashley: I note that there are some things I know that don’t come from my sensory inputs at all. Chimpanzees learn to be afraid of skulls and snakes much faster than they learn to be afraid of other arbitrary shapes. I was probably better at learning to walk in Earth gravity than I would have been at navigating in zero G. Those are heuristics I’m born with, based on how my brain was wired, which ultimately stems from my DNA specifying the way that proteins should fold to form neurons—not from any photons that entered my eyes later.

Blaine: So, for purposes of following along with the argument, let’s say that your DNA is analogous to the code of a computer program that makes predictions. What you’re observing here is that humans have 750 megabytes of DNA, and even if most of that is junk and not all of what’s left is specifying brain behavior, it still leaves a pretty large computer program that could have a lot of prior information programmed into it. Let’s say that your brain, or rather, your infant pre-brain wiring algorithm, was effectively a 7.5 megabyte program—if it’s actually 75 megabytes, that makes little difference to the argument. By exposing that 7.5 megabyte program to all the information coming in from your eyes, ears, nose, proprioceptive sensors telling you where your limbs were, and so on, your brain updated itself into forming the modern Ashley, whose hundred trillion synapses might be encoded by, say, one petabyte of information.

Ashley: The thought does occur to me that some environmental phenomena have effects on me that can’t be interpreted as “sensory information” in any simple way, like the direct effect that alcohol has on my neurons, and how that feels to me from the inside. But it would be perverse to claim that this prevents you from trying to summarize all the information that the Ashley-agent receives into a single sequence, so I won’t press the point.

(Eliezer, whispering: More on this topic later.)

Ashley: Oh, and for completeness’s sake, wouldn’t there also be further information embedded in the laws of physics themselves? Like, the way my brain executes implicitly says something about the laws of physics in the universe I’m in.

Blaine: Metaphorically speaking, our laws of physics would play the role of a particular choice of Universal Turing Machine, which has some effect on which computations count as “simple” inside the Solomonoff formula. But normally, the UTM should be very simple compared to the amount of data in the sequence we’re trying to predict, just like the laws of physics are very simple compared to a human brain. In terms of algorithmic complexity, the laws of physics are very simple compared to watching a 1920x1080 @60Hz visual field for a day.

Ashley: Part of my mind feels like the laws of physics are quite complicated compared to going outside and watching a sunset. Like, I realize that’s false, but I’m not sure how to say out loud exactly why it’s false…

Blaine: Because the algorithmic complexity of a system isn’t measured by how long a human has to go to college to understand it, it’s measured by the size of the computer program required to generate it. The language of physics is differential equations, and it turns out that this is something difficult to beat into some human brains, but differential equations are simple to program into a simple Turing Machine.

Ashley: Right, like, the laws of physics actually have much fewer details to them than, say, human nature. At least on the Standard Model of Physics. I mean, in principle there could be another decillion undiscovered particle families out there.

Blaine: The concept of “algorithmic complexity” isn’t about seeing something with lots of gears and details, it’s about the size of computer program required to compress all those details. The Mandelbrot set looks very complicated visually, you can keep zooming in using more and more detail, but there’s a very simple rule that generates it, so we say the algorithmic complexity is very low.

Ashley: All the visual information I’ve seen is something that happens within the physical universe, so how can it be more complicated than the universe? I mean, I have a sense on some level that this shouldn’t be a problem, but I don’t know why it’s not a problem.

Blaine: That’s because particular parts of the universe can have much higher algorithmic complexity than the entire universe! Consider a library that contains all possible books. It’s very easy to write a computer program that generates all possible books. So any particular book in the library contains much more algorithmic information than the entire library; it contains the information required to say ‘look at this particular book here’. If pi is normal, then somewhere in its digits is a copy of Shakespeare’s Hamlet—but the number saying which particular digit of pi to start looking at, will be just about exactly as large as Hamlet itself. The copy of Shakespeare’s Hamlet that exists in the decimal expansion of pi, is more complex than pi itself. If you zoomed way in and restricted your vision to a particular part of the Mandelbrot set, what you saw might be much more algorithmically complex than the entire Mandelbrot set, because the specification has to say where in the Mandelbrot set you are. Similarly, the world Earth is much more algorithmically complex than the laws of physics. Likewise, the visual field you see over the course of a second can easily be far more algorithmically complex than the laws of physics.

Ashley: Okay, I think I get that. And similarly, even though the ways that proteins fold up are very complicated, in principle we could get all that info using just the simple fundamental laws of physics plus the relatively simple DNA code for the protein. There’s all sorts of obvious caveats about epigenetics and so on, but those caveats aren’t likely to change the numbers by a whole order of magnitude.

Blaine: Right!

Ashley: So the laws of physics are, like, a few kilobytes, and my brain has say 75 megabytes of innate wiring instructions. And then I get to see a lot more information than that over my lifetime, like a megabit per second after my initial visual system finishes preprocessing it, and then most of that is forgotten. Uh… what does that have to do with Solomonoff induction again?

Blaine: Solomonoff induction quickly catches up to any single computer program at sequence prediction, even if the original program is very large and contains a lot of prior information about the environment. If a program is 75 megabytes long, it can only predict 75 megabytes worth of data better than the Solomonoff inductor before the Solomonoff inductor catches up to it. That doesn’t mean that a Solomonoff inductor knows everything a baby does after the first second of exposure to a webcam feed, but it does mean that after the first second the Solomonoff inductor is already no more surprised than a baby by the vast majority of pixels in the next frame. Every time the Solomonoff inductor assigns half as much probability as the baby to the next pixel it sees, that’s one bit spent permanently out of the 75 megabytes of error that can happen before the Solomonoff inductor catches up to the baby. That your brain is written in the laws of physics also has some implicit correlation with the environment, but that’s like saying that a program is written in the same programming language as the environment. The language can contribute something to the power of the program, and the environment being written in the same programming language can be a kind of prior knowledge. But if Solomonoff induction starts from a standard Universal Turing Machine as its language, that doesn’t contribute any more bits of lifetime error than the complexity of that programming language in the UTM.

Ashley: Let me jump back a couple of steps and return to the notion of my brain wiring itself up in response to environmental information. I’d expect an important part of that process was my brain learning to control the environment, not just passively observing it. Like, it mattered to my brain’s wiring algorithm that my brain saw the room shift in a certain way when it sent out signals telling my eyes to move.

Blaine: Indeed. But talking about the sequential control problem is more complicated math. AIXI is the ideal agent that uses Solomonoff induction as its epistemology and expected reward as its decision theory. That introduces extra complexity, so it makes sense to talk about just Solomonoff induction first. We can talk about AIXI later. So imagine for the moment that we were just looking at your sensory data, and trying to predict what would come next in that.

Ashley: Wouldn’t it make more sense to look at the brain’s inputs and outputs, if we wanted to predict the next input? Not just look at the series of previous inputs?

Blaine: It’d make the problem easier for a Solomonoff inductor to solve, sure; but it also makes the problem more complicated. Let’s talk instead about what would happen if you took the complete sensory record of your life, gave it to an ideally smart agent, and asked the agent to predict what you would see next. Maybe the agent could do an even better job of prediction if we also told it about your brain’s outputs, but I don’t think that subtracting the outputs would leave it helpless to see patterns in the inputs.

Ashley: It sounds like a pretty hard problem to me, maybe even an unsolvable one. I’m thinking of the distinction in computer science between needing to learn from non-chosen data, versus learning when you can choose particular queries. Learning can be much faster in the second case.

Blaine: In terms of what can be predicted in principle given the data, what facts are actually reflected in it that Solomonoff induction might uncover, we shouldn’t imagine a human trying to analyze the data, we should imagine an entire advanced civilization pondering it for years. If you look at it from that angle, then the alien civilization isn’t going to be balked by the fact that it’s looking at the answers to the queries that Ashley’s brain chose, instead of the answers to the queries it chose itself. Like, if the Ashley had already read Shakespeare’s Hamlet—if the image of those pages had already crossed the sensory stream—and then the Ashley saw a mysterious box outputting 20, 15, 2, 5, 15, 18, I think somebody eavesdropping on that sensory data would be equally able to guess that this was encoding ‘tobeor’ and guess that the next thing the Ashley saw might be the box outputting 14. You wouldn’t even need an entire alien civilization of superintelligent cryptographers to guess that. And it definitely wouldn’t be a killer problem that Ashley was controlling the eyeball’s saccades, even if you could learn even faster by controlling the eyeball yourself. So far as the computer-science distinction goes, Ashley’s eyeball is being controlled to make intelligent queries and seek out useful information, it’s just Ashley controlling the eyeball instead of you—that eyeball is not a query-oracle answering random questions.

Ashley: Okay, I think this example here is helping my understanding of what we’re doing here. In the case above, the next item in the Ashley-sequence wouldn’t actually be 14. It would be this huge 1920 x 1080 visual field that showed the box flashing a little picture of ’14′.

Blaine: Sure. Otherwise it would be a rigged demo, as you say.

Ashley: I think I’m confused about the idea of predicting the visual field. It seems to me that what with all the dust specks in my visual field, and maybe my deciding to tilt my head using motor instructions that won’t appear in the sequence, there’s no way to exactly predict the 66-megabit integer representing the next visual frame. So it must be doing something other than the equivalent of guessing “14” in a simpler sequence, but I’m not sure what.

Blaine: Indeed, there’d be some element of thermodynamic and quantum randomness preventing that exact prediction even in principle. So instead of predicting one particular next frame, we put a probability distribution on it.

Ashley: A probability distribution over possible 66-megabit frames? Like, a table with 2 to the 66 million entries, summing to 1?

Blaine: Sure. 2^(32 x 1920 x 1080) isn’t a large number when you have unlimited computing power. As Martin Gardner once observed, “Most finite numbers are very much larger.” Like I said, Solomonoff induction is an epistemic ideal that requires an unreasonably large amount of computing power.

Ashley: I don’t deny that big computations can sometimes help us understand little ones. But at the point when we’re talking about probability distributions that large, I have some trouble holding onto what the probability distribution is supposed to mean.

Blaine: Really? Just imagine a probability distribution over N possibilities, then let N go to 2 to the 66 million. If we were talking about a letter ranging from A to Z, then putting 100 times as much probability mass on (X, Y, Z) as on the rest of the alphabet, would say that, although you didn’t know exactly what letter would happen, you expected it would be toward the end of the alphabet. You would have used 26 probabilities, summing to 1, to precisely state that prediction. In Solomonoff induction, since we have unlimited computing power, we express our uncertainty about a 1920 x 1080 video frame the same way. All the various pixel fields you could see if your eye jumped to a plausible place, saw a plausible number of dust specks, and saw the box flash something that visually encoded ’14′, would have high probability. Pixel fields where the box vanished and was replaced with a glow-in-the-dark unicorn would have very low, though not zero, probability.

Ashley: Can we really get away with viewing things that way?

Blaine: If we could not make identifications like these in principle, there would be no principled way in which we could say that you had ever expected to see something happen—no way to say that one visual field your eyes saw, had higher probability than any other sensory experience. We couldn’t justify science; we couldn’t say that, having performed Galileo’s experiment by rolling an inclined cylinder down a plane, Galileo’s theory was thereby to some degree supported by having assigned a high relative probability to the only actual observations our eyes ever report.

Ashley: I feel a little unsure of that jump, but I suppose I can go along with that for now. Then the question of “What probability does Solomonoff induction assign to Canada invading?” is to be identified, in principle, with the question “Given my past life experiences and all the visual information that’s entered my eyes, what is the relative probability of seeing visual information that encodes Google News with the headline ‘CANADA INVADES USA’ at some point during the next 300 million seconds?”

Blaine: Right!

Ashley: And Solomonoff induction has an in-principle way of assigning this a relatively low probability, which that online crackpot could do well to learn from as a matter of principle, even if he couldn’t begin to carry out the exact calculations that involve assigning probabilities to exponentially vast tables.

Blaine: Precisely!

Ashley: Fairness requires that I congratulate you on having come further in formalizing ‘do good epistemology’ as a sequence prediction problem than I previously thought you might. I mean, you haven’t satisfied me yet, but I wasn’t expecting you to get even this far.

Blaine: Next, we consider how to represent a hypothesis inside this formalism.

Ashley: Hmm. You said something earlier about updating on a probabilistic mixture of computer programs, which leads me to suspect that in this formalism, a hypothesis or way the world can be is a computer program that outputs a sequence of integers.

Blaine: There’s indeed a version of Solomonoff induction that works like that. But I prefer the version where a hypothesis assigns probabilities to sequences. Like, if the hypothesis is that the world is a fair coin, then we shouldn’t try to make that hypothesis predict “heads—tails—tails—tails—heads” but should let it just assign a ¹⁄₃₂ prior probability to the sequence HTTTH.

Ashley: I can see that for coins, but I feel a bit iffier on what this means as a statement about the real world.

Blaine: A single hypothesis inside the Solomonoff mixture would be a computer program that took in a series of video frames, and assigned a probability to each possible next video frame. Or for greater simplicity and elegance, imagine a program that took in a sequence of bits, ones and zeroes, and output a rational number for the probability of the next bit being ‘1’. We can readily go back and forth between a program like that, and a probability distribution over sequences. Like, if you can answer all of the questions, “What’s the probability that the coin comes up heads on the first flip?”, “What’s the probability of the coin coming up heads on the second flip, if it came up heads on the first flip?”, and “What’s the probability that the coin comes up heads on the second flip, if it came up tails on the first flip?” then we can turn that into a probability distribution over sequences of two coinflips. Analogously, if we have a program that outputs the probability of the next bit, conditioned on a finite number of previous bits taken as input, that program corresponds to a probability distribution over infinite sequences of bits.

$$\displaystyle \mathbb{P}_{prog}(bits_{1 \dots N}) = \prod_{i=1}^{N} InterpretProb(prog(bits_{1 \dots i-1}), bits_i)$$

$$\displaystyle InterpretProb(prog(x), y) = \left\{ \begin{array}{ll} InterpretFrac(prog(x)) & \text{if } y = 1 \\ 1-InterpretFrac(prog(x)) & \text{if } y = 0 \\ 0 & \text{if $prog(x)$ does not halt} \end{array} \right\} $$

Ashley: I think I followed along with that in theory, though it’s not a type of math I’m used to (yet). So then in what sense is a program that assigns probabilities to sequences, a way the world could be—a hypothesis about the world?

Blaine: Well, I mean, for one thing, we can see the infant Ashley as a program with 75 megabytes of information about how to wire up its brain in response to sense data, that sees a bunch of sense data, and then experiences some degree of relative surprise. Like in the baby-looking-paradigm experiments where you show a baby an object disappearing behind a screen, and the baby looks longer at those cases, and so we suspect that babies have a concept of object permanence.

Ashley: That sounds like a program that’s a way Ashley could be, not a program that’s a way the world could be.

Blaine: Those indeed are dual perspectives on the meaning of Solmonoff induction. Maybe we can shed some light on this by considering a simpler induction rule, Laplace’s Rule of Succession, invented by the Reverend Thomas Bayes in the 1750s, and named after Pierre-Simon Laplace, the inventor of Bayesian reasoning.

Ashley: Pardon me?

Blaine: Suppose you have a biased coin with an unknown bias, and every possible bias between 0 and 1 is equally probable.

Ashley: Okay. Though in the real world, it’s quite likely that an unknown frequency is exactly 0, 1, or ¹⁄₂. If you assign equal probability density to every part of the real number field between 0 and 1, the probability of 1 is 0. Indeed, the probability of all rational numbers put together is zero.

Blaine: The original problem considered by Thomas Bayes was about an ideal billiard ball bouncing back and forth on an ideal billiard table many times and eventually slowing to a halt; and then bouncing other billiards to see if they halted to the left or the right of the first billiard. You can see why, in first considering the simplest form of this problem without any complications, we might consider every position of the first billiard to be equally probable.

Ashley: Sure. Though I note with pointless pedantry that if the billiard was really an ideal rolling sphere and the walls were perfectly reflective, it’d never halt in the first place.

Blaine: Suppose we’re told that, after rolling the original billiard ball and then 5 more billiard balls, one billiard ball was to the right of the original, an R. The other four were to the left of the original, or Ls. Again, that’s 1 R and 4 Ls. Given only this data, what is the probability that the next billiard ball rolled will be on the left of the original, another L?

Ashley: Five sevenths.

Blaine: Ah, you’ve heard this problem before?

Ashley: No, but it’s obvious.

Blaine: Uh… really?

Ashley: Combinatorics. Consider just the orderings of the balls, instead of their exact positions. Designate the original ball with the symbol |, the next five balls as LLLLR, and the next ball to be rolled as +. Given that the current ordering of these six balls is LLLL|R and that all positions and spacings of the underlying balls are equally likely, after rolling the +, there will be seven equally likely orderings +LLLL|R, L+LLL|R, LL+LL|R, and so on up to LLLL|L+R and LLLL|R+. In five of those seven orderings, the + is on the left of the |. In general, if we see M of L and N of R, the probability of the next item being an L is (M + 1) / (M + N + 2).

Blaise: Gosh… Well, the much more complicated proof originally devised by Thomas Bayes starts by considering every position of the original ball to be equally likely a priori, the additional balls as providing evidence about that position, and then integrating over the posterior probabilities of the original ball’s possible positions to arrive at the probability that the next ball lands on the left or right.

Ashley: Heh. And is all that extra work useful if you also happen to know a little combinatorics?

Blaise: Well, it tells me exactly how my beliefs about the original ball change with each new piece of evidence—the new posterior probability function on the ball’s position. Suppose I instead asked you something along the lines of, “Given 4 L and 1 R, where do you think the original ball + is most likely to be on the number line? How likely is it to be within 0.1 distance of there?”

Ashley: That’s fair; I don’t see a combinatoric answer for the later part. You’d have to actually integrate over the density function $f^M(1-f)^N \ \mathrm{d}f$.

Blaise: Anyway, let’s just take at face value that Laplace’s Rule of Succession says that, after observing M 1s and N 0s, the probability of getting a 1 next is (M + 1) / (M + N + 2).

Ashley: But of course.

Blaise: We can consider Laplace’s Rule as a short Python program that takes in a sequence of 1s and 0s, and spits out the probability that the next bit in the sequence will be 1. We can also consider it as a probability distribution over infinite sequences, like this:

0 : ¹⁄₂
1 : ¹⁄₂
00 : ¹⁄₂ * ²⁄₃ = ¹⁄₃
01 : ¹⁄₂ * ¹⁄₃ = ¹⁄₆
000 : ¹⁄₂ * ²⁄₃ * ³⁄₄ = ¹⁄₄
001 : ¹⁄₂ * ²⁄₃ * ¹⁄₄ = ¹⁄₁₂
010 : ¹⁄₂ * ¹⁄₃ * ¹⁄₂ = ¹⁄₁₂

Blaise: …and so on. Now, we can view this as a rule someone might espouse for predicting coinflips, but also view it as corresponding to a particular class of possible worlds containing randomness. I mean, Laplace’s Rule isn’t the only rule you could use. Suppose I had a barrel containing ten white balls and ten green balls. If you already knew this about the barrel, then after seeing M white balls and N green balls, you’d predict the next ball being white with probability (10 - M) / (20 - M—N). If you use Laplace’s Rule, that’s like believing the world was like a billiards table with an original ball rolling to a stop at a random point and new balls ending up on the left or right. If you use (10 - M) / (20 - M—N), that’s like the hypothesis that there’s ten green balls and ten white balls in a barrel. There isn’t really a sharp border between rules we can use to predict the world, and rules for how the world behaves -

Ashley: Well, that sounds just plain wrong. The map is not the territory, don’cha know? If Solomonoff induction can’t tell the difference between maps and territories, maybe it doesn’t contain all epistemological goodness after all.

Blaise: Maybe it’d be better to say that there’s a dualism between good ways of computing predictions and being in actual worlds where that kind of predicting works well? Like, you could also see Laplace’s Rule as implementing the rules for a world with randomness where the original billiard ball ends up in a random place, so that the first thing you see is equally likely to be 1 or 0. Then to ask what probably happens on round 2, we tell the world what happened on round 1 so that it can update what the background random events were.

Ashley: Mmmaybe.

Blaise: If you go with the version where Solomonoff induction is over programs that just spit out a determined string of ones and zeroes, we could see those programs as corresponding to particular environments—ways the world could be that would produce our sensory input, the sequence. We could jump ahead and consider the more sophisticated decision-problem that appears in AIXI: an environment is a program that takes your motor outputs as its input, and then returns your sensory inputs as its output. Then we can see a program that produces Bayesian-updated predictions as corresponding to a hypothetical probabilistic environment that implies those updates, although they’ll be conjugate systems rather than mirror images.

Ashley: Did you say something earlier about the deterministic and probabilistic versions of Solomonoff induction giving the same answers? Like, is it a distinction without a difference whether we ask about simple programs that reproduce the observed data versus simple programs that assign high probability to the data? I can’t see why that should be true, especially since Turing machines don’t include a randomness source.

Blaise: I’m told the answers are the same but I confess I can’t quite see why, unless there’s some added assumption I’m missing. So let’s talk about programs that assign probabilities for now, because I think that case is clearer. The next key idea is to prefer simple programs that assign high probability to our observations so far.

Ashley: It seems like an obvious step, especially considering that you were already talking about “simple programs” and Occam’s Razor a while back. Solomonoff induction is part of the Bayesian program of inference, right?

Blaise: Indeed. Very much so.

Ashley: Okay, so let’s talk about the program, or hypothesis, for “This barrel has an unknown frequency of white and green balls”, versus the hypothesis “This barrel has 10 white and 10 green balls”, versus the hypothesis, “This barrel always puts out a green ball after a white ball and vice versa.” Let’s say we see a green ball, then a white ball, the sequence GW. The first hypothesis assigns this probability ¹⁄₂ * ¹⁄₃ = ¹⁄₆, the second hypothesis assigns this probability ¹⁰⁄₂₀ * ⁹⁄₁₉ or roughly ¹⁄₄, and the third hypothesis assigns probability ¹⁄₂ * 1. Now it seems to me that there’s some important sense in which, even though Laplace’s Rule assigned a lower probability to the data, it’s significantly simpler than the second and third hypotheses and is the wiser answer. Does Solomonoff induction agree?

Blaise: I think you might be taking into account some prior knowledge that isn’t in the sequence itself, there. Like, things that alternate either 101010… or 010101… are objectively simple in the sense that a short computer program simulates them or assigns probabilities to them. It’s just unlikely to be true about an actual barrel of white and green balls. If 10 is literally the first sense data that you ever see, when you are a fresh new intelligence with only two bits to rub together, then “The universe consists of alternating bits” is no less reasonable than “The universe produces bits with an unknown random frequency anywhere between 0 and 1.”

Ashley: Conceded. But as I was going to say, we have three hypotheses that assigned ¹⁄₆, ~1/4, and ¹⁄₂ to the observed data; but to know the posterior probabilities of these hypotheses we need to actually say how relatively likely they were a priori, so we can multiply by the odds ratio. Like, if the prior odds were 3:2:1, the posterior odds would be 3:2:1 * (2/12 : ³⁄₁₂ : ⁶⁄₁₂) = 3:2:1 * 2:3:6 = 6:6:6 = 1:1:1. Now, how would Solomonoff induction assign prior probabilities to those computer programs? Because I remember you saying, way back when, that you thought Solomonoff was the answer to “How should Bayesians assign priors?”

Blaise: Well, how would you do it?

Ashley: I mean… yes, the simpler rules should be favored, but it seems to me that there’s some deep questions as to the exact relative ‘simplicity’ of the rules (M + 1) / (M + N + 2), or the rule (10 - M) / (20 - M—N), or the rule “alternate the bits”…

Blaise: Suppose I ask you to just make up some simple rule.

Ashley: Okay, if I just say the rule I think you’re looking for, the rule would be, “The complexity of a computer program is the number of bits needed to specify it to some arbitrary but reasonable choice of compiler or Universal Turing Machine, and the prior probability is ¹⁄₂ to the power of the number of bits. Since, e.g., there’s 32 possible 5-bit programs, so each such program has probability ¹⁄₃₂. So if it takes 16 bits to specify Laplace’s Rule of Succession, which seems a tad optimistic, then the prior probability would be ¹⁄₆₅₅₃₆, which seems a tad pessimistic.

Blaise: Now just apply that rule to the infinity of possible computer programs that assign probabilities to the observed data, update their posterior probabilities based on the probability they’ve assigned to the evidence so far, sum over all of them to get your next prediction, and we’re done. And yes, that requires a hypercomputer that can solve the halting problem, but we’re talking ideals here. Let $\mathcal P$ be the set of all programs and $s_1s_2\ldots s_n$ also written $s_{\preceq n}$ be the sense data so far, then

$\displaystyle \mathbb{Sol}(s_{\preceq n}) := \sum_{\mathrm{prog} \in \mathcal{P}} 2^{-\mathrm{length}(\mathrm{prog})} \cdot {\prod_{j=1}^n \mathop{InterpretProb}(\mathrm{prog}(s_{\preceq j-1}), s_j)}$

$\displaystyle \mathbb{P}(s_{n+1}=1\mid s_{\preceq n}) = \frac{\mathbb{Sol}(s_1s_2\ldots s_n 1)}{\mathbb{Sol}(s_1s_2\ldots s_n 1) + \mathbb{Sol}(s_1s_2\ldots s_n 0)}.$

Ashley: Uh.

Blaine: Yes?

Ashley: Um…

Blaine: What is it?

Ashley: You invoked a countably infinite set, so I’m trying to figure out if my predicted probability for the next bit must necessarily converge to a limit as I consider increasingly large finite subsets in any order.

Blaine (sighs): Of course you are.

Ashley: I think you might have left out some important caveats. Like, if I take the rule literally, then the program “0” has probability ¹⁄₂, the program “1″ has probability ¹⁄₂, the program “01” has probability ¹⁄₄ and now the total probability is 1.25 which is too much. So I can’t actually normalize it because the series sums to infinity. Now, this just means we need to, say, decide that the probability of a program having length 1 is ¹⁄₂, the probability of it having length 2 is ¹⁄₄, and so on out to infinity, but it’s an added postulate.

Blaine: The conventional method is to require a prefix-free code. If “0111” is a valid program then “01110″ cannot be a valid program. With that constraint, assigning “1/2 to the power of the length of the code”, to all valid codes, will sum to less than 1; and we can normalize their relative probabilities to get the actual prior.

Ashley: Okay. And you’re sure that it doesn’t matter in what order we consider more and more programs as we approach the limit, because… no, I see it. Every program has positive probability mass, with the total set summing to 1, and Bayesian updating doesn’t change that. So as I consider more and more programs, in any order, there’s only so many large contributions that can be made from the mix—only so often that the final probability can change. Like, let’s say there are at most 99 programs with probability 1% that assign probability 0 to the next bit being a 1; that’s only 99 times the final answer can go down by as much as 0.01, as the limit is approached.

Blaine: This idea generalizes, and is important. List all possible computer programs, in any order you like. Use any definition of simplicity that you like, so long as for any given amount of simplicity, there are only a finite number of computer programs that simple. As you go on carving off chunks of prior probability mass and assigning them to programs, it must be the case that as programs get more and complicated, their prior probability approaches zero! - though it’s still positive for every finite program, because of Cromwell’s Rule. You can’t have more than 99 programs assigned 1% prior probability and still obey Cromwell’s Rule, which means there must be some most complex program that is assigned 1% probability, which means every more complicated program must have less than 1% probability out to the end of the infinite list.

Ashley: Huh. I don’t think I’ve ever heard that justification for Occam’s Razor before. I think I like it. I mean, I’ve heard a lot of appeals to the empirical simplicity of the world, and so on, but this is the first time I’ve seen a logical proof that, in the limit, more complicated hypotheses must be less likely than simple ones.

Blaine: Behold the awesomeness that is Solomonoff Induction!

Ashley: Uh, but you didn’t actually use the notion of computational simplicity to get that conclusion, you just required that the supply of probability mass is finite and the supply of potential complications is infinite. Any way of counting discrete complications would imply that conclusion, even if it went by surface wheels and gears.

Blaine: Well, maybe. But it so happens that Yudkowsky did invent or reinvent that argument after pondering Solomonoff induction, and if it predates him (or Solomonoff) then Yudkowsky doesn’t know the source. Concrete inspiration for simplified arguments is also a credit to a theory, especially if the simplified argument didn’t exist before that.

Ashley: Fair enough. My next question is about the choice of Universal Turing Machine—the choice of compiler for our program codes. There’s an infinite number of possibilities there, and in principle, the right choice of compiler can make our probability for the next thing we’ll see be anything we like. At least I’d expect this to be the case, based on how the “problem of induction” usually goes. So with the right choice of Universal Turing Machine, our online crackpot can still make it be the case that Solomonoff induction predicts Canada invading the USA.

Blaine: One way of looking at the problem of good epistemology, I’d say, is that the job of a good epistemology is not to make it impossible to err. You can still blow off your foot if you really insist on pointing the shotgun at your foot and pulling the trigger. The job of good epistemology is to make it more obvious when you’re about to blow your own foot off with a shotgun. On this dimension, Solomonoff Induction excels. If you claim that we ought to pick an enormously complicated compiler to encode our hypotheses, in order to make the ‘simplest hypothesis that fits the evidence’ be one that predicts Canada invading the USA, then it should be obvious to everyone except you that you are in the process of screwing up.

Ashley: Ah, but of course they’ll say that their code is just the simple and natural choice of Universal Turing Machine, because they’ll exhibit a meta-UTM which outputs that UTM given only a short code. And if you say the meta-UTM is complicated--

Blaine: Flon’s Law says, “There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code.” You can’t make it impossible for people to screw up, but you can make it more obvious. And Solomonoff induction would make it even more obvious than might at first be obvious, because--

Ashley: Your Honor, I move to have the previous sentence taken out and shot.

Blaine: Let’s say that the whole of your sensory information is the string 10101010… Consider the stupid hypothesis, “This program has a 99% probability of producing a 1 on every turn”, which you jumped to after seeing the first bit. What would you need to claim your priors were like—what Universal Turing Machine would you need to endorse—in order to maintain blind faith in that hypothesis in the face of ever-mounting evidence?

Ashley: You’d need a Universal Turing Machine blind-utm that assigned a very high probability to the blind program “def ProbNextElementIsOne(previous_sequence): return 0.99”. Like, if blind-utm sees the code 0, it executes the blind program “return 0.99″. And to defend yourself against charges that your UTM blind-utm was not itself simple, you’d need a meta-UTM, blind-meta which, when it sees the code 10, executes blind-utm. And to really wrap it up, you’d need to take a fixed point through all towers of meta and use diagonalization to create the UTM blind-diag that, when it sees the program code 0, executes “return 0.99”, and when it sees the program code 10, executes blind-diag. I guess I can see some sense in which, even if that doesn’t resolve Hume’s problem of induction, anyone actually advocating that would be committing blatant shenanigans on a commonsense level, arguably more blatant than it would have been if we hadn’t made them present the UTM.

Blaine: Actually, the shenanigans have to be much worse than that in order to fool Solomonoff induction. Like, Solomonoff induction using your blind-diag isn’t fooled for a minute, even taking blind-diag entirely on its own terms.

Ashley: Really?

Blaise: Assuming 60 sequence items per second? Yes, absolutely, Solomonoff induction shrugs off the delusion in the first minute, unless there’s further and even more blatant shenanigans. We did require that your blind-diag be a Universal Turing Machine, meaning that it can reproduce every computable probability distribution over sequences, given some particular code to compile. Let’s say there’s a 200-bit code laplace for Laplace’s Rule of Succession, “lambda sequence: return (sequence.count(‘1’) + 1) / (len(sequence) + 2)”, so that its prior probability relative to the 1-bit code for blind is 2^-200. Let’s say that the sense data is around ⁵⁰⁄₅₀ 1s and 0s. Every time we see a 1, blind gains a factor of 2 over laplace (99% vs. 50% probability), and every time we see a 0, blind loses a factor of 50 over laplace (1% vs. 50% probability). On average, every 2 bits of the sequence, blind is losing a factor of 25 or, say, a bit more than 4 bits, i.e., on average blind is losing two bits of probability per element of the sequence observed. So it’s only going to take 100 bits, or a little less than two seconds, for laplace to win out over blind.

Ashley: I see. I was focusing on a UTM that assigned lots of prior probability to blind, but what I really needed was a compiler that, while still being universal and encoding every possibility somewhere, still assigned a really tiny probability to laplace, faircoin that encodes “return 0.5”, and every other hypothesis that does better, round by round, than blind. So what I really need to carry off the delusion is obstinate-diag that is universal, assigns high probability to blind, requires billions of bits to specify laplace, and also requires billions of bits to specify any UTM that can execute laplace as a shorter code than billions of bits. Because otherwise we will say, “Ah, but given the evidence, this other UTM would have done better.” I agree that those are even more blatant shenanigans than I thought.

Blaise: Yes. And even then, even if your UTM takes two billion bits to specify faircoin, Solomonoff induction will lose its faith in blind after seeing a billion bits. Which will happen before the first year is out, if we’re getting 60 bits per second. And if you turn around and say, “Oh, well, I didn’t mean that was my UTM, I really meant this was my UTM, this thing over here where it takes a trillion bits to encode faircoin”, then that’s probability-theory-violating shenanigans where you’re changing your priors as you go.

Ashley: That’s actually a very interesting point—that what’s needed for Bayesian to maintain a delusion in the face of mounting evidence is not so much a blindly high prior for the delusory hypothesis, as a blind skepticism of all its alternatives. But what if their UTM requires a googol bits to specify faircoin? What if blind and blind-diag, or programs pretty much isomorphic to them, are the only programs that can be specified in less than a googol bits?

Blaise: Then your desire to shoot your own foot off has been made very, very visible to anyone who understands Solomonoff induction. We’re not going to get absolutely objective prior probabilities as a matter of logical deduction, not without principles that are unknown to me and beyond the scope of Solomonoff induction. But we can make the stupidity really blatant and force you to construct a downright embarrassing Universal Turing Machine.

Ashley: I guess I can see that. I mean, I guess that if you’re presenting a ludicrously complicated Universal Turing Machine that just refuses to encode the program that would predict Canada not invading, that’s more visibly silly than a verbal appeal that says, “But you must just have faith that Canada will invade.” I guess part of me is still hoping for a more objective sense of “complicated”.

Blaine: We could say that reasonable UTMs should contain a small number of wheels and gears in a material instantiation under our universe’s laws of physics, which might in some ultimate sense provide a prior over priors. Like, the human brain evolved from DNA-based specifications, and the things you can construct out of relatively small numbers of physical objects are ‘simple’ under the ‘prior’ implicitly searched by natural selection.

Ashley: Ah, but what if I think it’s likely that our physical universe or the search space of DNA won’t give us a good idea of what’s complicated?

Blaine: For your alternative notion of what’s complicated to go on being believed even as other hypotheses are racking up better experimental predictions, you need to assign a ludicrously low probability that our universe’s space of physical systems buildable using a small number of objects, could possibly provide better predictions of that universe than your complicated alternative notion of prior probability. We don’t need to appeal that it’s a priori more likely than not that “a universe can be predicted well by low-object-number machines built using that universe’s physics.” Instead, we appeal that it would violate Cromwell’s Rule, and be exceedingly special pleading, to assign the possibility of a physically learnable universe a probability of less than $2^{-1,000,000}$. It then takes only a megabit of exposure to notice that the universe seems to be regular.

Ashley: In other words, so long as you don’t start with an absolute and blind prejudice against the universe being predictable by simple machines encoded in our universe’s physics—so long as, on this planet of seven billion people, you don’t assign probabilities less than $2^{-1,000,000}$ to the other person being right about what is a good Universal Turing Machine—then the pure logic of Bayesian updating will rapidly force you to the conclusion that induction works. Hm. I don’t know that good pragmatic answers to the problem of induction were ever in short supply. Still, on the margins, it’s a more forceful pragmatic answer than the last one I remember hearing.

Blaise: Yay! Now isn’t Solomonoff induction wonderful?

Ashley: Maybe? You didn’t really use the principle of computational simplicity to derive that lesson. You just used that some inductive principle ought to have a prior probability of more than $2^{-1,000,000}$.

Blaise: …

Ashley: Can you give me an example of a problem where the computational definition of simplicity matters and can’t be factored back out of an argument?

Blaise: As it happens, yes I can. I can give you three examples of how it matters.

Ashley: Vun… two… three! Three examples! Ah-ah-ah!

Blaise: Must you do that every—oh, never mind. Example one is that galaxies are not so improbable that no one could ever believe in them, example two is that the limits of possibility include Terrence Tao, and example three is that diffraction is a simpler explanation of rainbows than divine intervention.

Ashley: These statements are all so obvious that no further explanation of any of them is required.

Blaine: On the contrary! And I’ll start with example one. Back when the Andromeda galaxy was a hazy mist seen through a telescope, and someone first suggested that maybe that hazy mist was an incredibly large number of distant stars—that many “nebulae” were actually distant galaxies, and our own Milky Way was only one of them—there was a time when Occam’s Razor was invoked against that hypothesis.

Ashley: What? Why?

Blaine: They invoked Occam’s Razor against the galactic hypothesis, because if that were the case, then there would be a much huger number of stars in the universe, and the stars would be entities, and Occam’s Razor said “Entities are not to be multiplied beyond necessity.”

Ashley: That’s not how Occam’s Razor works. The “entities” of a theory are its types, not its objects. If you say that the hazy mists are distant galaxies of stars, then you’ve reduced the number of laws because you’re just postulating a previously seen type, namely stars organized into galaxies, instead of a new type of hazy astronomical mist.

Blaine: Okay, but imagine that it’s the nineteenth century and somebody replies to you, “Well, I disagree! William of Ockham said not to multiply entities, this galactic hypothesis obviously creates a huge number of entities, and that’s the way I see it!”

Ashley: I think I’d give them your spiel about there being no human epistemology that can stop you from shooting off your own foot.

Blaine: I don’t think you’d be justified in giving them that lecture. I’ll parenthesize at this point that you ought to be very careful when you say “I can’t stop you from shooting off your own foot”, lest it become a Fully General Scornful Rejoinder. Like, if you say that to someone, you’d better be able to explain exactly why Occam’s Razor counts types as entities but not objects. In fact, you’d better explain that to someone before you go advising them not to shoot off their own foot. And once you’ve told them what you think is foolish and why, you might as well stop there. Except in really weird cases of people presenting us with enormously complicated and jury-rigged Universal Turing Machines, and then we say the shotgun thing.

Ashley: That’s fair. So, I’m not sure what I’d have answered before starting this conversation, which is much to your credit, friend Blaine. But now that I’ve had this conversation, it’s obvious that it’s new types and not new objects that use up the probability mass we need to distribute over all hypotheses. Like, I need to distribute my probability mass over “Hypothesis 1: there are stars” and “Hypothesis 2: there are stars plus huge distant hazy mists”. I don’t need to distribute my probability mass over all the actual stars in the galaxy!

Blaine: In terms of Solomonoff induction, we penalize a program’s lines of code rather than its runtime or RAM used, because we need to distribute our probability mass over possible alternatives each time we add a line of code. There’s no corresponding choice between mutually exclusive alternatives when a program uses more runtime or RAM.

(Eliezer, whispering: Unless we need a leverage prior to consider the hypothesis of being a particular agent inside all that RAM or runtime.)

Ashley: Or to put it another way, any fully detailed model of the universe would require some particular arrangement of stars, and the more stars there are, the more possible arrangements there are. But when we look through the telescope and see a hazy mist, we get to sum over all arrangements of stars that would produce that hazy mist. If some galactic hypothesis required a hundred billion stars to all be in particular exact places without further explanation or cause, then that would indeed be a grave improbability.

Blaine: Precisely. And if you needed all the hundred billion stars to be in particular exact places, that’s just the kind of hypothesis that would take a huge computer program to specify.

Ashley: But does it really require learning Solomonoff induction to understand that point? Maybe the bad argument against galaxies was just a motivated error somebody made in the nineteenth century, because they didn’t want to live in a big universe for emotional reasons.

Blaine: The same debate is playing out today over no-collapse versions of quantum mechanics, also somewhat unfortunately known as “many-worlds interpretations”. Now, regardless of what anyone thinks of all the other parts of that debate, there’s a particular sub-argument where somebody says, “It’s simpler to have a collapse interpretation because all those extra quantum “worlds” are extra entities that are unnecessary under Occam’s Razor since we can’t see them.” And Solomonoff induction tells us that this invocation of Occam’s Razor is flatly misguided because Occam’s Razor does not work like that. Basically, they’re trying to cut down the RAM and runtime of the universe, at the expensive of adding an extra line of code, namely the code for the collapse postulate that prunes off parts of the wavefunction that are in undetectably weak causal contact with us.

Ashley: Hmm. Now that you put it that way, it’s not so obvious to me that it makes sense to have no prejudice against sufficiently enormous universes. I mean, the universe we see around us is exponentially vast but not superexponentially vast—the visible atoms are $10^{80}$ in number or so, not $10^{10^{80}}$ or “bigger than Graham’s Number”. Maybe there’s some fundamental limit on how much gets computed.

Blaine: You, um, know that on the Standard Model, the universe doesn’t just cut out and stop existing at the point where our telescopes stop seeing it? There isn’t a giant void surrounding a little bubble of matter centered perfectly on Earth? It calls for a literally infinite amount of matter? I mean, I guess if you don’t like living in a universe with more than $10^{80}$ entities, a universe where too much gets computed, you could try to specify extra laws of physics that create an abrupt spatial boundary with no further matter beyond them, somewhere out past where our telescopes can see -

Ashley: All right, point taken.

(Eliezer, whispering: Though I personally suspect that the spatial multiverse and the quantum multiverse are the same multiverse, and that what lies beyond the reach of our telescopes is not entangled with us—meaning that the universe is as finitely large as the superposition of all possible quantum branches, rather than being literally infinite in space.)

Blaine: I mean, there is in fact an alternative formalism to Solomonoff induction, namely Levin search, which says that program complexities are further penalized by the logarithm of their runtime. In other words, it would make ‘explanations’ or ‘universes’ that require a long time to run, be inherently less probable. Some people like Levin search more than Solomonoff induction because it’s more computable. I dislike Levin search because (a) it has no fundamental epistemic justification and (b) it assigns probability zero to quantum mechanics.

Ashley: Can you unpack that last part?

Blaine: If, as is currently suspected, there’s no way to simulate quantum computers using classical computers without an exponential slowdown, then even in principle, this universe requires exponentially vast amounts of classical computing power to simulate. Let’s say that with sufficiently advanced technology, you can build a quantum computer with a million qubits. On Levin’s definition of complexity, for the universe to be like that is as improbable a priori as any particular set of laws of physics that must specify on the order of one million equations. Can you imagine how improbable it would be to see a list of one hundred thousand differential equations, without any justification or evidence attached, and be told that they were the laws of physics? That’s the kind of penalty that Levin search or Schmidhuber’s Speed Prior would attach to any laws of physics that could run a quantum computation of a million qubits, or, heck, any physics that claimed that a protein was being folded in a way that ultimately went through considering millions of quarks interacting. If you’re not absolutely certain a priori that the universe isn’t like that, you don’t believe in Schmidhuber’s Speed Prior. Even with a collapse postulate, the amount of computation that goes on before a collapse would be prohibited by the Speed Prior.

Ashley: Okay, yeah. If you’re phrasing it that way—that the Speed Prior assigns probability nearly zero to quantum mechanics, so we shouldn’t believe in the Speed Prior—then I can’t easily see a way to extract out the same point without making reference to ideas like penalizing algorithmic complexity but not penalizing runtime. I mean, maybe I could extract the lesson back out but it’s easier to say, or more obvious, by pointing to the idea that Occam’s Razor should penalize algorithmic complexity but not runtime.

Blaine: And that isn’t just implied by Solomonoff induction, it’s pretty much the whole idea of Solomonoff induction, right?

Ashley: Maaaybe.

Blaine: For example two, that Solomonoff induction outperforms even Terence Tao, we want to have a theorem that says Solomonoff induction catches up to every computable way of reasoning in the limit. Since we iterated through all possible computer programs, we know that somewhere in there is a simulated copy of Terence Tao in a simulated room, and if this requires a petabyte to specify, then we shouldn’t have to make more than a quadrillion bits of error relative to Terence Tao before zeroing in on the Terence Tao hypothesis. I mean, in practice, I’d expect far less than a quadrillion bits of error before the system was behaving like it was vastly smarter than Terence Tao. It’d take a lot less than a quadrillion bits to give you some specification of a universe with simple physics that gave rise to a civilization of vastly greater than intergalactic extent. Like, Graham’s Number is a very simple number so it’s easy to specify a universe that runs for that long before it returns an answer. It’s not obvious how you’d extract Solomonoff predictions from that civilization and incentivize them to make good ones, but I’d be surprised if there were no Turing machine of fewer than one thousand states which did that somehow.

Ashley: …

Blaine: And for all I know there might be even better ways than that of getting exceptionally good predictions, somewhere in the list of the first decillion computer programs. That is, somewhere in the first 100 bits.

Ashley: So your basic argument is, “Never mind Terence Tao, Solomonoff induction dominates God.”

Blaine: Solomonoff induction isn’t the epistemic prediction capability of a superintelligence. It’s the epistemic prediction capability of something that eats superintelligences like potato chips.

Ashley: Is there any point to contemplating an epistemology so powerful that it will never begin to fit inside the universe?

Blaine: Maybe? I mean a lot of times, you just find people failing to respect the notion of ordinary superintelligence, doing the equivalent of supposing that a superintelligence behaves like a bad Hollywood genius and misses obvious-seeming moves. And a lot of times you find them insisting that “there’s a limit to how much information you can get from the data” or something along those lines. “That Alien Message” is intended to convey the counterpoint, that smarter entities can extract more info than is immediately apparent on the surface of things. Similarly, thinking about Solomonoff induction might also cause someone to realize that if, say, you simulated zillions of possible simple universes, you could look at which agents were seeing exact data like the data you got, and figure out where you were inside that range of possibilities, so long as there was literally any correlation to use. And if you say that an agent can’t extract that data, you’re making a claim about which shortcuts to Solomonoff induction are and aren’t computable. In fact, you’re probably pointing at some particular shortcut and claiming nobody can ever figure that out using a reasonable amount of computing power even though the info is there in principle. Contemplating Solomonoff induction might help people realize that, yes, the data is there in principle. Like, until I ask you to imagine a civilization running for Graham’s Number of years inside a Graham-sized memory space, you might not imagine them trying all the methods of analysis that you personally can imagine being possible.

Ashley: If somebody is making that mistake in the first place, I’m not sure you can beat it out of them by telling them the definition of Solomonoff induction.

Blaine: Maybe not. But to brute-force somebody into imagining that sufficiently advanced agents have Level 1 protagonist intelligence, that they are epistemically efficient rather than missing factual questions that are visible even to us, you might need to ask them to imagine an agent that can see literally anything seeable in the computational limit just so that their mental simulation of the ideal answer isn’t running up against stupidity assertions. Like, I think there’s a lot of people who could benefit from looking over the evidence they already personally have, and asking what a Solomonoff inductor could deduce from it, so that they wouldn’t be running up against stupidity assertions about themselves. It’s the same trick as asking yourself what God, Richard Feynman, or a “perfect rationalist” would believe in your shoes. You just have to pick a real or imaginary person that you respect enough for your model of that person to lack the same stupidity assertions that you believe about yourself.

Ashley: Well, let’s once again try to factor out the part about Solomonoff induction in particular. If we’re trying to imagine something epistemically smarter than ourselves, is there anything we get from imagining a complexity-weighted prior over programs in particular? That we don’t get from, say, trying to imagine the reasoning of one particular Graham-Number-sized civilization?

Blaine: We get the surety that even anything we imagine Terence Tao himself as being able to figure out, is something that is allowed to be known after some bounded number of errors versus Terence Tao, because Terence Tao is inside the list of all computer programs and gets promoted further each time the dominant paradigm makes a prediction error relative to him. We can’t get that dominance property without invoking “all possible ways of computing” or something like it—we can’t incorporate the power of all reasonable processes, unless we have a set such that all the reasonable processes are in it. The enumeration of all possible computer programs is one such set.

Ashley: Hm.

Blaine: Example three, diffraction is a simpler explanation of rainbows than divine intervention. I don’t think I need to belabor this point very much, even though in one way it might be the most central one. It sounds like “Jehovah placed rainbows in the sky as a sign that the Great Flood would never come again” is a ‘simple’ explanation, you can explain it to a child in nothing flat. Just the diagram of diffraction through a raindrop, to say nothing of the Principle of Least Action underlying diffraction, is something that humans don’t usually learn until undergraduate physics, and it sounds more alien and less intuitive than Jehovah. In what sense is this intuitive sense of simplicity wrong? What gold standard are we comparing it to, that could be a better sense of simplicity than just ‘how hard is it for me to understand’? The answer is Solmonoff induction and the rule which says that simplicity is measured by the size of the computer program, not by how hard things are for human beings to understand. Diffraction is a small computer program; any programmer who understands diffraction can simulate it without too much trouble. Jehovah would be a much huger program—a complete mind that implements anger, vengeance, belief, memory, consequentialism, etcetera. Solomonoff induction is what tells us to retrain our intuitions so that differential equations feel like less burdensome explanations than heroic mythology.

Ashley: Now hold on just a second, if that’s actually how Solomonoff induction works then it’s not working very well. I mean, Abraham Lincoln was a great big complicated mechanism from an algorithmic standpoint—he had a hundred trillion synapses in his brain—but that doesn’t mean I should look at the historical role supposedly filled by Abraham Lincoln, and look for simple mechanical rules that would account for the things Lincoln is said to have done. If you’ve already seen humans and you’ve already learned to model human minds, it shouldn’t cost a vast amount to say there’s one more human, like Lincoln, or one more entity that is cognitively humanoid, like the Old Testament jealous-god version of Jehovah. It may be wrong but it shouldn’t be vastly improbable a priori. If you’ve already been forced to acknowledge the existence of some humanlike minds, why not others? Shouldn’t you get to reuse the complexity that you postulated to explain humans, in postulating Jehovah? In fact, shouldn’t that be what Solomonoff induction does? If you have a computer program that can model and predict humans, it should only be a slight modification of that program—only slightly longer in length and added code—to predict the modified-human entity that is Jehovah.

Blaine: Hm. That’s fair. I may have to retreat from that example somewhat. In fact, that’s yet another point to the credit of Solomonoff induction! The ability of programs to reuse code, incorporates our intuitive sense that if you’ve already postulated one kind of thing, it shouldn’t cost as much to postulate a similar kind of thing elsewhere!

Ashley: Uh huh.

Blaine: Well, but even if I was wrong that Solomonoff induction should make Jehovah seem very improbable, it’s still Solomonoff induction that says that the alternative hypothesis of ‘diffraction’ shouldn’t itself be seen as burdensome—even though diffraction might require a longer time to explain to a human, it’s still at heart a simple program.

Ashley: Hmm. I’m trying to think if there’s some notion of ‘simplicity’ that I can abstract away from ‘simple program’ as the nice property that diffraction has as an explanation for rainbows, but I guess anything I try to say is going to come down to some way of counting the wheels and gears inside the explanation, and justify the complexity penalty on probability by the increased space of possible configurations each time we add a new gear. And I can’t make it be about surface details because that will make whole humans seem way too improbable. If I have to use simply specified systems and I can’t use surface details or runtime, that’s probably going to end up basically equivalent to Solomonoff induction. So in that case we might as well use Solomonoff induction, which is probably simpler than whatever I’ll think up and will give us the same advice. Okay, you’ve mostly convinced me.

Blaine: Mostly? What’s left?

Ashley: Well, several things. Most of all, I think of how the ‘language of thought’ or ‘language of epistemology’ seems to be different in some sense from the ‘language of computer programs’. Like, when I think about the laws of Newtonian gravity, or when I think about my Mom, it’s not just one more line of code tacked onto a big black-box computer program. It’s more like I’m crafting an explanation with modular parts—if it contains a part that looks like Newtonian mechanics, I step back and reason that it might contain other parts with differential equations. If it has a line of code for a Mom, it might have a line of code for a Dad. I’m worried that if I understood how humans think like that, maybe I’d look at Solomonoff induction and see how it doesn’t incorporate some further key insight that’s needed to do good epistemology.

Blaine: Solomonoff induction literally incorporates a copy of you thinking about whatever you’re thinking right now.

Ashley: Okay, great, but that’s inside the system. If Solomonoff learns to promote computer programs containing good epistemology, but is not itself good epistemology, then it’s not the best possible answer to “How do you compute epistemology?” Like, natural selection produced humans but population genetics is not an answer to “How does intelligence work?” because the intelligence is in the inner content rather than the outer system. In that sense, it seems like a reasonable worry that Solomonoff induction might incorporate only some principles of good epistemology rather than all the principles, even if the internal content rather than the outer system might bootstrap the rest of the way.

Blaine: Hm. If you put it that way…

(long pause)

Blaine: …then, I guess I have to agree. I mean, Solomonoff induction doesn’t explicitly say anything about, say, the distinction between analytic propositions and empirical propositions, and knowing that is part of good epistemology on my view. So if you want to say that Solomonoff induction is something that bootstraps to good epistemology rather than being all of good epistemology by itself, I guess I have no choice but to agree. I do think the outer system already contains a lot of good epistemology and inspires a lot of good advice all on its own. Especially if you give it credit for formally reproducing principles that are “common sense”, because correctly formalizing common sense is no small feat.

Ashley: Got a list of the good advice you think is derivable?

Blaine: Um. Not really, but off the top of my head:

The best explanation is the one with the best mixture of simplicity and matching the evidence.
“Simplicity” and “matching the evidence” can both be measured in bits, so they’re commensurable.
The simplicity of a hypothesis is the number of bits required to formally specify it, for example as a computer program.
When a hypothesis assigns twice as much probability to the exact observations seen so far as some other hypothesis, that’s one bit’s worth of relatively better matching the evidence.
You should actually be making your predictions using all the explanations, not just the single best one, but explanations that poorly match the evidence will drop down to tiny contributions very quickly.
Good explanations lets you compress lots of data into compact reasons which strongly predict seeing just that data and no other data.
Logic can’t dictate prior probabilities absolutely, but if you assign probability less than $2^{-1,000,000}$ to the prior that mechanisms constructed using a small number of objects from your universe might be able to well predict that universe, you’re being unreasonable.
So long as you don’t assign infinitesimal prior probability to hypotheses that let you do induction, they will very rapidly overtake hypotheses that don’t.
It is a logical truth, not a contingent one, that more complex hypotheses must in the limit be less probable than simple ones.
Epistemic rationality is a precise art with no user-controlled degrees of freedom in how much probability you ideally ought to assign to a belief. If you think you can tweak the probability depending on what you want the answer to be, you’re doing something wrong.
Things that you’ve seen in one place might reappear somewhere else.
Once you’ve learned a new language for your explanations, like differential equations, you can use it to describe other things, because your best hypotheses will now already encode that language.
We can learn meta-reasoning procedures as well as object-level facts by looking at which meta-reasoning rules are simple and have done well on the evidence so far.
So far, we seem to have no a priori reason to believe that universes which are more expensive to compute are less probable.
People were wrong about galaxies being a priori improbable because that’s not how Occam’s Razor works. Today, other people are equally wrong about other parts of a continuous wavefunction counting as extra entities.
If something seems “weird” to you but would be a consequence of simple rules that fit the evidence so far, well, there’s no term in these explicit laws of epistemology which add an extra penalty term for weirdness.
Your epistemology shouldn’t have extra rules in it that aren’t needed to do Solomonoff induction or something like it, including rules like “science is not allowed to examine this particular part of reality”--

Ashley: This list isn’t finite, is it.

Blaine: Well, there’s a lot of outstanding debate about epistemology where you can view that debate through the lens of Solomonoff induction and see what Solomonoff suggests.

Ashley: But if you don’t mind my stopping to look at your last item, #17 above—again, it’s attempts to add completeness clauses to Solomonoff induction that make me the most nervous. I guess you could say that a good rule of epistemology ought to be one that’s promoted by Solomonoff induction—that it should arise, in some sense, from the simple ways of reasoning that are good at predicting observations. But that doesn’t mean a good rule of epistemology ought to explicitly be in Solomonoff induction or it’s out.

Blaine: Can you think of good epistemology that doesn’t seem to be contained in Solomonoff induction? Besides the example I already gave of distinguishing logical propositions from empirical ones.

Ashley: I’ve been trying to. First, it seems to me that when I reason about laws of physics and how those laws of physics might give rise to higher levels of organization like molecules, cells, human beings, the Earth, and so on, I’m not constructing in my mind a great big chunk of code that reproduces my observations. I feel like this difference might be important and it might have something to do with ‘good epistemology’.

Blaine: I guess it could be? I think if you’re saying that there might be this unknown other thing and therefore Solomonoff induction is terrible, then that would be the nirvana fallacy. Solomonoff induction is the best formalized epistemology we have right now--

Ashley: I’m not saying that Solomonoff induction is terrible. I’m trying to look in the direction of things that might point to some future formalism that’s better than Solomonoff induction. Here’s another thing: I feel like I didn’t have to learn how to model the human beings around me from scratch based on environmental observations. I got a jump-start on modeling other humans by observing myself, and by recruiting my brain areas to run in a sandbox mode that models other people’s brain areas—empathy, in a word. I guess I feel like Solomonoff induction doesn’t incorporate that idea. Like, maybe inside the mixture there are programs which do that, but there’s no explicit support in the outer formalism.

Blaine: This doesn’t feel to me like much of a disadvantage of Solomonoff induction--

Ashley: I’m not saying it would be a disadvantage if we actually had a hypercomputer to run Solomonoff induction. I’m saying it might point in the direction of “good epistemology” that isn’t explicitly included in Solomonoff induction. I mean, now that I think about it, a generalization of what I just said is that Solomonoff induction assumes I’m separated from the environment by a hard, Cartesian wall that occasionally hands me observations. Shouldn’t a more realistic view of the universe be about a simple program that contains me somewhere inside it, rather than a simple program that hands observations to some other program?

Blaine: Hm. Maybe. How would you formalize that? It seems to open up a big can of worms--

Ashley: But that’s what my actual epistemology actually says. My world-model is not about a big computer program that provides inputs to my soul, it’s about an enormous mathematically simple physical universe that instantiates Ashley as one piece of it. And I think it’s good and important to have epistemology that works that way. It wasn’t obvious that we needed to think about a simple universe that embeds us. Descartes did think in terms of an impervious soul that had the universe projecting sensory information onto its screen, and we had to get away from that kind of epistemology.

Blaine: You understand that Solomonoff induction makes only a bounded number of errors relative to any computer program which does reason the way you prefer, right? If thinking of yourself as a contiguous piece of the universe lets you make better experimental predictions, programs which reason that way will rapidly be promoted.

Ashley: It’s still unnerving to see a formalism that seems, in its own structure, to harken back to the Cartesian days of a separate soul watching a separate universe projecting sensory information on a screen. Who knows, maybe that would somehow come back to bite you?

Blaine: Well, it wouldn’t bite you in the form of repeatedly making wrong experimental predictions.

Ashley: But it might bite you in the form of having no way to represent the observation of, “I drank this ‘wine’ liquid and then my emotions changed; could my emotions themselves be instantiated in stuff that can interact with some component of this liquid? Can alcohol touch neurons and influence them, meaning that I’m not a separate soul?” If we interrogated the Solomonoff inductor, would it be able to understand that reasoning? Which brings up that dangling question from before about modeling the effect that my actions and choices have on the environment, and whether, say, an agent that used Solomonoff induction would be able to correctly predict “If I drop an anvil on my head, my sequence of sensory observations will end.”

Eliezer: And that’s my cue to step in! For more about the issues Ashley raised with agents being a contiguous part of the universe, see naturalistic reflection once that section exists. Meanwhile, we’ll consider next the question of actions and choices, as we encounter the agent that uses Solomonoff induction for beliefs and expected reward maximization for selecting actions—the perfect rolling sphere of advanced agent theory, AIXI. Look forward to Ashley and Blaine’s dialogue continuing there, once that section is up, which it currently isn’t.

Donald Hobson 17 Jan 2021 12:56 UTC
Suppose you have a program P that outputs a probability distribution over the next possible bit. With a fairly small constant amount of extra code, you can make an algorithm P’ that takes in an arbitrary bitstream and decompresses them according to P. In other words, there are data compression algorithms that can use the probabilities to compress the environmental sequence. Now consider the set of programs that look like P’(b) for some bitstring b. The number of extra bits needed to encode b is the same as the number of bits P would loose by not being able to predict every step perfectly. In other words the set of all P’(b) deterministic programs combine to give basically the same probability distribution as P. (up to constant fiddle factors that depend on choice of TM)

In short, you can compress the random noise in the environment, and put it in your predictive program. This gives a deterministic program. And this costs only a constant more bits than admitting your uncertainty.
Alexei Andreev 15 Jan 2016 21:06 UTC
Looks like there is a typo with the fraction. sense 1…n . 1?
- Eliezer Yudkowsky 16 Jan 2016 0:16 UTC
  I was trying to say “append 1 to previous sequence”. I guess it needs explanation.
- Nate Soares 3 Mar 2016 4:47 UTC
  I tweaked this to make it a bit clearer while also cleaning up the latex (I think it’s standard to use mathrm on multi-letter variable names, to disambiguate them from the multiplication of many single variables, etc. etc.), but I still think that parts of the equation (such as InterpretProb) need types and explanations. (I think you also want to explain how it handles non-halting programs, which requires renormalization in the Solomonoff induction case, unless you want to talk about the semimeasure instead.)
Zack M. Davis 15 May 2016 22:03 UTC
I imagine Ashley Watson leaning against the fourth wall as she delivers this line with an air of knowing satisfaction at how well she’s fulfilling the purpose of her hypothetical existence.
Eric Bruylant 4 Aug 2016 12:57 UTC
Link to a description of the library of babel?
Jacob Thoennes 4 Apr 2018 23:17 UTC
“diffraction” should be “refraction” throughout