Solomonoff induction: Intro Dialogue (Math 2)

(A di­alogue be­tween Ash­ley, a com­puter sci­en­tist who’s never heard of Solomonoff’s the­ory of in­duc­tive in­fer­ence, and Blaine, who thinks it is the best thing since sliced bread.)

Ash­ley: Good evening, Msr. Blaine.

Blaine: Good evening, Msr. Ash­ley.

Ash­ley: I’ve heard there’s this thing called “Solomonoff’s the­ory of in­duc­tive in­fer­ence”.

Blaine: The ru­mors have spread, then.

Ash­ley: Yeah, so, what the heck is that about?

Blaine: In­vented in the 1960s by the math­e­mat­i­cian Ray Solomonoff, the key idea in Solomonoff in­duc­tion is to do se­quence pre­dic­tion by us­ing Bayesian up­dat­ing on a prior com­posed of a mix­ture of all com­putable prob­a­bil­ity dis­tri­bu­tions--

Ash­ley. Wait. Back up a lot. Be­fore you try to ex­plain what Solomonoff in­duc­tion is, I’d like you to try to tell me what it does, or why peo­ple study it in the first place. I find that helps me or­ga­nize my listen­ing. Right now I don’t even know why I should be in­ter­ested in this.

Blaine: Um, okay. Let me think for a sec­ond…

Ash­ley: Also, while I can imag­ine things that “se­quence pre­dic­tion” might mean, I haven’t yet en­coun­tered it in a tech­ni­cal con­text, so you’d bet­ter go a bit fur­ther back and start more at the be­gin­ning. I do know what “com­putable” means and what a “prob­a­bil­ity dis­tri­bu­tion” is, and I re­mem­ber the for­mula for Bayes’s Rule al­though it’s been a while.

Blaine: Okay. So… one way of fram­ing the usual rea­son why peo­ple study this gen­eral field in the first place, is that some­times, by study­ing cer­tain ideal­ized math­e­mat­i­cal ques­tions, we can gain valuable in­tu­itions about episte­mol­ogy. That’s, uh, the field that stud­ies how to rea­son about fac­tual ques­tions, how to build a map of re­al­ity that re­flects the ter­ri­tory--

Ash­ley: I have some idea what ‘episte­mol­ogy’ is, yes. But I think you might need to start even fur­ther back, maybe with some sort of con­crete ex­am­ple or some­thing.

Blaine: Okay. Um. So one anec­dote that I some­times use to frame the value of com­puter sci­ence to the study of episte­mol­ogy, is Edgar Allen Poe’s ar­gu­ment in 1833 that chess was un­com­putable.

Ash­ley: That doesn’t sound like a thing that ac­tu­ally hap­pened.

Blaine: I know, but it to­tally did hap­pen and not in a metaphor­i­cal sense ei­ther! Edgar Allen Poe wrote an es­say ex­plain­ing why no au­toma­ton would ever be able to play chess, and he speci­fi­cally men­tioned “Mr. Bab­bage’s com­put­ing en­g­ine” as an ex­am­ple. You see, in the nine­teenth cen­tury, there was for a time this sen­sa­tion known as the Me­chan­i­cal Turk—sup­pos­edly a ma­chine, an au­toma­ton, that could play chess. At the grand­mas­ter level, no less. Now to­day, when we’re ac­cus­tomed to the idea that it takes a rea­son­able pow­er­ful com­puter to do that, we can know im­me­di­ately that the Me­chan­i­cal Turk must have been a fraud and that there must have been a con­cealed op­er­a­tor in­side—a midget, as it turned out. To­day we know that this sort of thing is hard to build into a ma­chine. But in the 19th cen­tury, even that much wasn’t known. So when Edgar Allen Poe, who be­sides be­ing an au­thor was also an ac­com­plished ma­gi­cian, set out to write an es­say about the Me­chan­i­cal Turk, he spent the sec­ond half of the es­say dis­sect­ing what was known about the Turk’s ap­pear­ance to (cor­rectly) figure out where the hu­man op­er­a­tor was hid­ing. But Poe spent the first half of the es­say ar­gu­ing that no au­toma­ton—noth­ing like Mr. Bab­bage’s com­put­ing en­g­ine—could pos­si­bly play chess, which was how he knew a pri­ori that the Turk had a con­cealed hu­man op­er­a­tor.

Ash­ley: And what was Poe’s ar­gu­ment?

Blaine: Poe ob­served that in an alge­braical prob­lem, each step fol­lowed from the pre­vi­ous step of ne­ces­sity, which was why the steps in solv­ing an alge­braical prob­lem could be rep­re­sented by the de­ter­minis­tic mo­tions of gears in some­thing like Mr. Bab­bage’s com­put­ing en­g­ine. But in a chess prob­lem, Poe said, there are many pos­si­ble chess moves, and no move fol­lows with ne­ces­sity from the po­si­tion of the board; and even if you did se­lect one move, the op­po­nent’s move would not fol­low with ne­ces­sity, so you couldn’t rep­re­sent it with the de­ter­mined mo­tion of au­to­matic gears. There­fore, Poe said, what­ever was op­er­at­ing the Me­chan­i­cal Turk must have the na­ture of Carte­sian mind, rather than the na­ture of de­ter­minis­tic mat­ter, and this was know­able a pri­ori. And then he started figur­ing out where the re­quired op­er­a­tor was hid­ing.

Ash­ley: That’s some amaz­ingly im­pres­sive rea­son­ing for be­ing com­pletely wrong.

Blaine: I know! Isn’t it great?

Ash­ley: I mean, that sounds like Poe cor­rectly iden­ti­fied the hard part of play­ing com­puter chess, the branch­ing fac­tor of moves and coun­ter­moves, which is the rea­son why no sim­ple ma­chine could do it. And he just didn’t re­al­ize that a de­ter­minis­tic ma­chine could de­ter­minis­ti­cally check many pos­si­ble moves in or­der to figure out the game tree. So close, and yet so far.

Blaine: More than a cen­tury later, in 1950, Claude Shan­non pub­lished the first pa­per ever writ­ten on com­puter chess. And in pass­ing, Shan­non gave the for­mula for play­ing perfect chess if you had un­limited com­put­ing power, the al­gorithm you’d use to ex­trap­o­late the en­tire game tree. We could say that Shan­non gave a short pro­gram that would solve chess if you ran it on a hy­per­com­puter, where a hy­per­com­puter is an ideal com­puter that can run any finite com­pu­ta­tion im­me­di­ately. And then Shan­non passed on to talk­ing about the prob­lem of lo­cally guess­ing how good a board po­si­tion was, so that you could play chess us­ing only a small search. I say all this to make a point about the value of know­ing how to solve prob­lems us­ing hy­per­com­put­ers, even though hy­per­com­put­ers don’t ex­ist. Yes, there’s of­ten a huge gap be­tween the un­bounded solu­tion and the prac­ti­cal solu­tion. It wasn’t un­til 1997, forty-seven years af­ter Shan­non’s pa­per giv­ing the un­bounded solu­tion, that Deep Blue ac­tu­ally won the world chess cham­pi­onship--

Ash­ley: And that wasn’t just a ques­tion of faster com­put­ing hard­ware run­ning Shan­non’s ideal search al­gorithm. There were a lot of new in­sights along the way, most no­tably the alpha-beta prun­ing al­gorithm and a lot of im­prove­ments in po­si­tional eval­u­a­tion.

Blaine: Right! But I think some peo­ple over­re­act to that forty-seven year gap, and act like it’s worth­less to have an un­bounded un­der­stand­ing of a com­puter pro­gram, just be­cause you might still be forty-seven years away from a prac­ti­cal solu­tion. But if you don’t even have a solu­tion that would run on a hy­per­com­puter, you’re Poe in 1833, not Shan­non in 1950. The rea­son I tell the anec­dote about Poe is to illus­trate that Poe was con­fused about com­puter chess in a way that Shan­non was not. When we don’t know how to solve a prob­lem even given in­finite com­put­ing power, the very work we are try­ing to do is in some sense murky to us. When we can state code that would solve the prob­lem given a hy­per­com­puter, we have be­come less con­fused. Once we have the un­bounded solu­tion we un­der­stand, in some ba­sic sense, the kind of work we are try­ing to perform, and then we can try to figure out how to do it effi­ciently.

Ash­ley: Which may well re­quire new in­sights into the struc­ture of the prob­lem, or even a con­cep­tual rev­olu­tion in how we imag­ine the work we’re try­ing to do.

Blaine: Yes, but the point is that you can’t even get started on that if you’re ar­gu­ing about how play­ing chess has the na­ture of Carte­sian mind rather than mat­ter. At that point you’re not 50 years away from win­ning the chess cham­pi­onship, you’re 150 years away, be­cause it took an ex­tra 100 years to move hu­man­ity’s un­der­stand­ing to the point where Claude Shan­non could triv­ially see how to play perfect chess us­ing a large-enough com­puter. I’m not try­ing to ex­alt the un­bounded solu­tion by den­i­grat­ing the work re­quired to get a bounded solu­tion. I’m not say­ing that when we have an un­bounded solu­tion we’re prac­ti­cally there and the rest is a mat­ter of mere lowly effi­ciency. I’m try­ing to com­pare hav­ing the un­bounded solu­tion to the hor­rific con­fu­sion of not un­der­stand­ing what we’re try­ing to do.

Ash­ley: Okay. I think I un­der­stand why, on your view, it’s im­por­tant to know how to solve prob­lems us­ing in­finitely fast com­put­ers, or hy­per­com­put­ers as you call them. When we can say how to an­swer a ques­tion us­ing in­finite com­put­ing power, that means we crisply un­der­stand the ques­tion it­self, in some sense; while if we can’t figure out how to solve a prob­lem us­ing un­bounded com­put­ing power, that means we’re con­fused about the prob­lem, in some sense. I mean, any­one who’s ever tried to teach the more doomed sort of un­der­grad­u­ate to write code knows what it means to be con­fused about what it takes to com­pute some­thing.

Blaine: Right.

Ash­ley: So what does this have to do with “Solomonoff in­duc­tion”?

Blaine: Ah! Well, sup­pose I asked you how to do episte­mol­ogy us­ing in­finite com­put­ing power?

Ash­ley: My good fel­low, I would at once re­ply, “Beep. Whirr. Prob­lem ‘do episte­mol­ogy’ not crisply speci­fied.” At this stage of af­fairs, I do not think this re­ply in­di­cates any fun­da­men­tal con­fu­sion on my part; rather I think it is you who must be clearer.

Blaine: Given un­bounded com­put­ing power, how would you rea­son in or­der to con­struct an ac­cu­rate map of re­al­ity?

Ash­ley: That still strikes me as rather un­der­speci­fied.

Blaine: Per­haps. But even there I would sug­gest that it’s a mark of in­tel­lec­tual progress to be able to take vague and un­der­speci­fied ideas like ‘do good episte­mol­ogy’ and turn them into crisply speci­fied prob­lems. Imag­ine that I went up to my friend Ce­cil, and said, “How would you do good episte­mol­ogy given un­limited com­put­ing power and a short Python pro­gram?” and Ce­cil at once came back with an an­swer—a good and rea­son­able an­swer, once it was ex­plained. Ce­cil would prob­a­bly know some­thing quite in­ter­est­ing that you do not presently know.

Ash­ley: I con­fess to be­ing rather skep­ti­cal of this hy­po­thet­i­cal. But if that ac­tu­ally hap­pened—if I agreed, to my own satis­fac­tion, that some­one had stated a short Python pro­gram that would ‘do good episte­mol­ogy’ if run on an un­bound­edly fast com­puter—then I agree that I’d prob­a­bly have learned some­thing quite in­ter­est­ing about episte­mol­ogy.

Blaine: What Ce­cil knows about, in this hy­po­thet­i­cal, is Solomonoff in­duc­tion. In the same way that Claude Shan­non an­swered “Given in­finite com­put­ing power, how would you play perfect chess?”, Ray Solomonoff an­swered “Given in­finite com­put­ing power, how would you perfectly find the best hy­poth­e­sis that fits the facts?”

Ash­ley: Sud­denly, I find my­self strongly sus­pi­cious of what­ever you are about to say to me.

Blaine: That’s un­der­stand­able.

Ash­ley: In par­tic­u­lar, I’ll ask at once whether “Solomonoff in­duc­tion” as­sumes that our hy­pothe­ses are be­ing given to us on a silver plat­ter along with the ex­act data we’re sup­posed to ex­plain, or whether the al­gorithm is or­ga­niz­ing its own data from a big messy situ­a­tion and in­vent­ing good hy­pothe­ses from scratch.

Blaine: Great ques­tion! It’s the sec­ond one.

Ash­ley: Really? Okay, now I have to ask whether Solomonoff in­duc­tion is a rec­og­nized con­cept in good stand­ing in the field of aca­demic com­puter sci­ence, be­cause that does not sound like some­thing mod­ern-day com­puter sci­ence knows how to do.

Blaine: I wouldn’t say it’s a widely known con­cept, but it’s one that’s in good aca­demic stand­ing. The method isn’t used in mod­ern ma­chine learn­ing be­cause it re­quires an in­finitely fast com­puter and isn’t eas­ily ap­prox­i­mated the way that chess is.

Ash­ley: This re­ally sounds very sus­pi­cious. Last time I checked, we hadn’t be­gun to for­mal­ize the cre­ation of good new hy­pothe­ses from scratch. I’ve heard about claims to have ‘au­to­mated’ the work that, say, New­ton did in in­vent­ing clas­si­cal me­chan­ics, and I’ve found them all to be in­cred­ibly du­bi­ous. Which is to say, they were rigged de­mos and lies.

Blaine: I know, but--

Ash­ley: And then I’m even more sus­pi­cious of a claim that some­one’s al­gorithm would solve this prob­lem if only they had in­finite com­put­ing power. Hav­ing some re­searcher claim that their Good-Old-Fash­ioned-AI se­man­tic net­work would be in­tel­li­gent if run on a com­puter so large that, con­ve­niently, no­body can ever test their the­ory, is not go­ing to per­suade me.

Blaine: Do I re­ally strike you as that much of a char­latan? What have I ever done to you, that you would ex­pect me to try pul­ling a scam like that?

Ash­ley: That’s fair. I shouldn’t ac­cuse you of plan­ning that scam when I haven’t seen you say it. But I’m pretty sure the prob­lem of “com­ing up with good new hy­pothe­ses in a world full of messy data” is AI-com­plete. And even Men­tif-

Blaine: Do not say the name, or he will ap­pear!

Ash­ley: Sorry. Even the leg­endary first and great­est of all AI crack­pots, He-Who-Googles-His-Name, could as­sert that his al­gorithms would be all-pow­er­ful on a com­puter large enough to make his claim un­falsifi­able. So what?

Blaine: That’s a very sen­si­ble re­ply and this, again, is ex­actly the kind of men­tal state that re­flects a prob­lem that is con­fus­ing rather than just hard to im­ple­ment. It’s the sort of con­fu­sion Poe might feel in 1833, or close to it. In other words, it’s just the sort of con­cep­tual is­sue we would have solved at the point where we could state a short pro­gram that could run on a hy­per­com­puter. Which Ray Solomonoff did in 1964.

Ash­ley: Okay, let’s hear about this sup­posed gen­eral solu­tion to episte­mol­ogy.

Blaine: First, try to solve the fol­low­ing puz­zle. 1, 3, 4, 7, 11, 18, 29...?

Ash­ley: Let me look at those for a mo­ment… 47.

Blaine: Con­grat­u­la­tions on en­gag­ing in, as we snooty types would call it, ‘se­quence pre­dic­tion’.

Ash­ley: I’m fol­low­ing you so far.

Blaine: The smarter you are, the more eas­ily you can find the hid­den pat­terns in se­quences and pre­dict them suc­cess­fully. You had to no­tice the re­sem­blance to the Fibonacci rule to guess the next num­ber. Some­one who didn’t already know about Fibonacci, or who was worse at math­e­mat­i­cal think­ing, would have taken longer to un­der­stand the se­quence or maybe never learned to pre­dict it at all.

Ash­ley: Still with you.

Blaine: It’s not a se­quence of num­bers per se… but can you see how the ques­tion, “The sun has risen on the last mil­lion days. What is the prob­a­bil­ity that it rises to­mor­row?” could be viewed as a kind of se­quence pre­dic­tion prob­lem?

Ash­ley: Only if some pro­gram­mer neatly parses up the world into a se­ries of “Did the Sun rise on day X start­ing in 4.5 billion BCE, 0 means no and 1 means yes? 1, 1, 1, 1, 1...” and so on. Which is ex­actly the sort of shenani­gan that I see as cheat­ing. In the real world, you go out­side and see a brilli­ant ball of gold touch­ing the hori­zon, not a gi­ant “1″.

Blaine: Sup­pose I have a robot run­ning around with a we­b­cam show­ing it a 1920x1080 pixel field that re­freshes 60 times a sec­ond with 32-bit col­ors. I could view that as a gi­ant se­quence and ask the robot to pre­dict what it will see hap­pen when it rolls out to watch a sun­rise the next day.

Ash­ley: I can’t help but no­tice that the ‘se­quence’ of we­b­cam frames is ab­solutely enor­mous, like, the se­quence is made up of 66-megabit ‘num­bers’ ap­pear­ing 3600 times per minute… oh, right, com­put­ers much big­ger than the uni­verse. And now you’re smil­ing evilly, so I guess that’s the point. I also no­tice that the se­quence is no longer de­ter­minis­ti­cally pre­dictable, that it is no longer a purely math­e­mat­i­cal ob­ject, and that the se­quence of we­b­cam frames ob­served will de­pend on the robot’s choices. This makes me feel a bit shaky about the anal­ogy to pre­dict­ing the math­e­mat­i­cal se­quence 1, 1, 2, 3, 5.

Blaine: I’ll try to ad­dress those points in or­der. First, Solomonoff in­duc­tion is about as­sign­ing prob­a­bil­ities to the next item in the se­quence. I mean, if I showed you a box that said 1, 1, 2, 3, 5, 8 you would not be ab­solutely cer­tain that the next item would be 13. There could be some more com­pli­cated rule that just looked Fibonacci-ish but then di­verged. You might guess with 90% prob­a­bil­ity but not 100% prob­a­bil­ity, or some­thing like that.

Ash­ley: This has stopped feel­ing to me like math.

Blaine: There is a large branch of math, to say noth­ing of com­puter sci­ence, that deals in prob­a­bil­ities and statis­ti­cal pre­dic­tion. We are go­ing to be de­scribing ab­solutely lawful and de­ter­minis­tic ways of as­sign­ing prob­a­bil­ities af­ter see­ing 1, 3, 4, 7, 11, 18.

Ash­ley: Okay, but if you’re later go­ing to tell me that this lawful prob­a­bil­is­tic pre­dic­tion rule un­der­lies a gen­er­ally in­tel­li­gent rea­soner, I’m already skep­ti­cal. No mat­ter how large a com­puter it’s run on, I find it hard to imag­ine that some sim­ple set of rules for as­sign­ing prob­a­bil­ities is go­ing to en­com­pass truly and gen­er­ally in­tel­li­gent an­swers about se­quence pre­dic­tion, like Ter­ence Tao would give af­ter look­ing at the se­quence for a while. We just have no idea how Ter­ence Tao works, so we can’t du­pli­cate his abil­ities in a for­mal rule, no mat­ter how much com­put­ing power that rule gets… you’re smil­ing evilly again. I’ll be quite in­ter­ested if that evil smile turns out to be jus­tified.

Blaine: In­deed.

Ash­ley: I also find it hard to imag­ine that this de­ter­minis­tic math­e­mat­i­cal rule for as­sign­ing prob­a­bil­ities would no­tice if a box was out­putting an en­coded ver­sion of “To be or not to be” from Shake­speare by map­ping A to Z onto 1 to 26, which I would no­tice even­tu­ally though not im­me­di­ately upon see­ing 20, 15, 2, 5, 15, 18… And you’re still smil­ing evilly.

Blaine: In­deed. That is ex­actly what Solomonoff in­duc­tion does. Fur­ther­more, we have the­o­rems es­tab­lish­ing that Solomonoff in­duc­tion can do it way bet­ter than you or Ter­ence Tao.

Ash­ley: A the­o­rem proves this. As in a nec­es­sary math­e­mat­i­cal truth. Even though we have no idea how Ter­ence Tao works em­piri­cally… and there’s evil smile num­ber four. Okay. I am very skep­ti­cal, but will­ing to be con­vinced.

Blaine: So if you ac­tu­ally did have a hy­per­com­puter, you could cheat, right? And Solomonoff in­duc­tion is the most ridicu­lously cheat­ing cheat in the his­tory of cheat­ing.

Ash­ley: Go on.

Blaine: We just run all pos­si­ble com­puter pro­grams to see which are the sim­plest com­puter pro­grams that best pre­dict the data seen so far, and use those pro­grams to pre­dict what comes next. This mix­ture con­tains, among other things, an ex­act copy of Ter­ence Tao, thereby al­low­ing us to prove the­o­rems about their rel­a­tive perfor­mance.

Ash­ley: Is this an ac­tual rep­utable math thing? I mean re­ally?

Blaine: I’ll de­liver the for­mal­iza­tion later, but you did ask me to first state the point of it all. The point of Solomonoff in­duc­tion is that it gives us a gold-stan­dard ideal for se­quence pre­dic­tion, and this gold-stan­dard pre­dic­tion only errs by a bounded amount, over in­finite time, rel­a­tive to the best com­putable se­quence pre­dic­tor. We can also see it as for­mal­iz­ing the in­tu­itive idea that was ex­pressed by William Ock­ham a few cen­turies ear­lier that sim­pler the­o­ries are more likely to be cor­rect, and as tel­ling us that ‘sim­plic­ity’ should be mea­sured in al­gorith­mic com­plex­ity, which is the size of a com­puter pro­gram re­quired to out­put a hy­poth­e­sis’s pre­dic­tions.

Ash­ley: I think I would have to read more on this sub­ject to ac­tu­ally fol­low that. What I’m hear­ing is that Solomonoff in­duc­tion is a rep­utable idea that is im­por­tant be­cause it gives us a kind of ideal for se­quence pre­dic­tion. This ideal also has some­thing to do with Oc­cam’s Ra­zor, and stakes a claim that the sim­plest the­ory is the one that can be rep­re­sented by the short­est com­puter pro­gram. You iden­tify this with “do­ing good episte­mol­ogy”.

Blaine: Yes, those are le­gi­t­i­mate take­aways. Another way of look­ing at it is that Solomonoff in­duc­tion is an ideal but un­com­putable an­swer to the ques­tion “What should our pri­ors be?”, which is left open by un­der­stand­ing Bayesian up­dat­ing.

Ash­ley: Can you say how Solomonoff in­duc­tion an­swers the ques­tion of, say, the prior prob­a­bil­ity that Canada is plan­ning to in­vade the United States? I once saw a crack­pot web­site that tried to in­voke Bayesian prob­a­bil­ity about it, but only af­ter set­ting the prior at 10% or some­thing like that, I don’t re­call ex­actly. Does Solomonoff in­duc­tion let me tell him that he’s mak­ing a math er­ror, in­stead of just call­ing him silly in an in­for­mal fash­ion?

Blaine: If you’re ex­pect­ing to sit down with Leib­niz and say, “Gentle­men, let us calcu­late” then you’re set­ting your ex­pec­ta­tions too high. Solomonoff gives us an idea of how we should com­pute that quan­tity given un­limited com­put­ing power. It doesn’t give us a firm recipe for how we can best ap­prox­i­mate that ideal in real life us­ing bounded com­put­ing power, or hu­man brains. That’s like ex­pect­ing to play perfect chess af­ter you read Shan­non’s 1950 pa­per. But know­ing the ideal, we can ex­tract some in­tu­itive ad­vice that might help our on­line crack­pot if only he’d listen.

Ash­ley: But ac­cord­ing to you, Solomonoff in­duc­tion does say in prin­ci­ple what is the prior prob­a­bil­ity that Canada will in­vade the United States.

Blaine: Yes, up to a choice of uni­ver­sal Tur­ing ma­chine.

Ash­ley (look­ing highly skep­ti­cal): So I plug a uni­ver­sal Tur­ing ma­chine into the for­mal­ism, and in prin­ci­ple, I get out a uniquely de­ter­mined prob­a­bil­ity that Canada in­vades the USA.

Blaine: Ex­actly!

Ash­ley: Uh huh. Well, go on.

Blaine: So, first, we have to trans­form this into a se­quence pre­dic­tion prob­lem.

Ash­ley: Like a se­quence of years in which Canada has and hasn’t in­vaded the US, mostly zero ex­cept around 1812--

Blaine: No! To get a good pre­dic­tion about Canada we need much more data than that, and I don’t mean a graph of Cana­dian GDP ei­ther. Imag­ine a se­quence that con­tains all the sen­sory data you have ever re­ceived over your life­time. Not just the hos­pi­tal room that you saw when you opened your eyes right af­ter your birth, but the dark­ness your brain re­ceived as in­put while you were still in your mother’s womb. Every word you’ve ever heard. Every let­ter you’ve ever seen on a com­puter screen, not as ASCII let­ters but as the raw pat­tern of neu­ral im­pulses that gets sent down from your retina.

Ash­ley: That seems like a lot of data and some of it is re­dun­dant, like there’ll be lots of similar pix­els for blue sky--

Blaine: That data is what you got as an agent. If we want to trans­late the ques­tion of the pre­dic­tion prob­lem Ash­ley faces into the­o­ret­i­cal terms, we should give the se­quence pre­dic­tor all the data that you had available, in­clud­ing all those re­peat­ing blue pix­els of the sky. Who knows? Maybe there was a Cana­dian war­plane some­where in there, and you didn’t no­tice.

Ash­ley: But it’s im­pos­si­ble for my brain to re­mem­ber all that data. If we ne­glect for the mo­ment how the retina ac­tu­ally works and sup­pose that I’m see­ing the same 1920x1080 @60Hz feed the robot would, that’s far more data than my brain can re­al­is­ti­cally learn per sec­ond.

Blaine: So then Solomonoff in­duc­tion can do bet­ter than you can, us­ing its un­limited com­put­ing power and mem­ory. That’s fine.

Ash­ley: But what if you can do bet­ter by for­get­ting more?

Blaine: If you have limited com­put­ing power, that makes sense. With un­limited com­put­ing power, that re­ally shouldn’t hap­pen and that in­deed is one of the les­sons of Solomonoff in­duc­tion. An un­bounded Bayesian never ex­pects to do worse by up­dat­ing on an­other item of ev­i­dence—for one thing, you can always just do the same policy you would have used if you hadn’t seen that ev­i­dence. That kind of les­son is one of the les­sons that might not be in­tu­itively ob­vi­ous, but which you can feel more deeply by walk­ing through the math of prob­a­bil­ity the­ory. With un­limited com­put­ing power, noth­ing goes wrong as a re­sult of try­ing to pro­cess 4 gi­gabits per sec­ond; ev­ery ex­tra bit just pro­duces a bet­ter ex­pected fu­ture pre­dic­tion.

Ash­ley: Okay, so we start with liter­ally all the data I have available. That’s 4 gi­gabits per sec­ond if we imag­ine 1920 by 1080 frames of 32-bit pix­els re­peat­ing 60 times per sec­ond. Though I re­mem­ber hear­ing 100 megabits per sec­ond would be a bet­ter es­ti­mate of what the retina sends out, and that it’s pared down to 1 megabit per sec­ond very quickly by fur­ther pro­cess­ing.

Blaine: Right. We start with all of that data, go­ing back to when you were born. Or maybe when your brain formed in the womb, though it shouldn’t make much differ­ence.

Ash­ley: I note that there are some things I know that don’t come from my sen­sory in­puts at all. Chim­panzees learn to be afraid of skulls and snakes much faster than they learn to be afraid of other ar­bi­trary shapes. I was prob­a­bly bet­ter at learn­ing to walk in Earth grav­ity than I would have been at nav­i­gat­ing in zero G. Those are heuris­tics I’m born with, based on how my brain was wired, which ul­ti­mately stems from my DNA spec­i­fy­ing the way that pro­teins should fold to form neu­rons—not from any pho­tons that en­tered my eyes later.

Blaine: So, for pur­poses of fol­low­ing along with the ar­gu­ment, let’s say that your DNA is analo­gous to the code of a com­puter pro­gram that makes pre­dic­tions. What you’re ob­serv­ing here is that hu­mans have 750 megabytes of DNA, and even if most of that is junk and not all of what’s left is spec­i­fy­ing brain be­hav­ior, it still leaves a pretty large com­puter pro­gram that could have a lot of prior in­for­ma­tion pro­grammed into it. Let’s say that your brain, or rather, your in­fant pre-brain wiring al­gorithm, was effec­tively a 7.5 megabyte pro­gram—if it’s ac­tu­ally 75 megabytes, that makes lit­tle differ­ence to the ar­gu­ment. By ex­pos­ing that 7.5 megabyte pro­gram to all the in­for­ma­tion com­ing in from your eyes, ears, nose, pro­pri­o­cep­tive sen­sors tel­ling you where your limbs were, and so on, your brain up­dated it­self into form­ing the mod­ern Ash­ley, whose hun­dred trillion synapses might be en­coded by, say, one petabyte of in­for­ma­tion.

Ash­ley: The thought does oc­cur to me that some en­vi­ron­men­tal phe­nom­ena have effects on me that can’t be in­ter­preted as “sen­sory in­for­ma­tion” in any sim­ple way, like the di­rect effect that al­co­hol has on my neu­rons, and how that feels to me from the in­side. But it would be per­verse to claim that this pre­vents you from try­ing to sum­ma­rize all the in­for­ma­tion that the Ash­ley-agent re­ceives into a sin­gle se­quence, so I won’t press the point.

(Eliezer, whisper­ing: More on this topic later.)

Ash­ley: Oh, and for com­plete­ness’s sake, wouldn’t there also be fur­ther in­for­ma­tion em­bed­ded in the laws of physics them­selves? Like, the way my brain ex­e­cutes im­plic­itly says some­thing about the laws of physics in the uni­verse I’m in.

Blaine: Me­taphor­i­cally speak­ing, our laws of physics would play the role of a par­tic­u­lar choice of Univer­sal Tur­ing Ma­chine, which has some effect on which com­pu­ta­tions count as “sim­ple” in­side the Solomonoff for­mula. But nor­mally, the UTM should be very sim­ple com­pared to the amount of data in the se­quence we’re try­ing to pre­dict, just like the laws of physics are very sim­ple com­pared to a hu­man brain. In terms of al­gorith­mic com­plex­ity, the laws of physics are very sim­ple com­pared to watch­ing a 1920x1080 @60Hz vi­sual field for a day.

Ash­ley: Part of my mind feels like the laws of physics are quite com­pli­cated com­pared to go­ing out­side and watch­ing a sun­set. Like, I re­al­ize that’s false, but I’m not sure how to say out loud ex­actly why it’s false…

Blaine: Be­cause the al­gorith­mic com­plex­ity of a sys­tem isn’t mea­sured by how long a hu­man has to go to col­lege to un­der­stand it, it’s mea­sured by the size of the com­puter pro­gram re­quired to gen­er­ate it. The lan­guage of physics is differ­en­tial equa­tions, and it turns out that this is some­thing difficult to beat into some hu­man brains, but differ­en­tial equa­tions are sim­ple to pro­gram into a sim­ple Tur­ing Ma­chine.

Ash­ley: Right, like, the laws of physics ac­tu­ally have much fewer de­tails to them than, say, hu­man na­ture. At least on the Stan­dard Model of Physics. I mean, in prin­ci­ple there could be an­other decillion undis­cov­ered par­ti­cle fam­i­lies out there.

Blaine: The con­cept of “al­gorith­mic com­plex­ity” isn’t about see­ing some­thing with lots of gears and de­tails, it’s about the size of com­puter pro­gram re­quired to com­press all those de­tails. The Man­delbrot set looks very com­pli­cated vi­su­ally, you can keep zoom­ing in us­ing more and more de­tail, but there’s a very sim­ple rule that gen­er­ates it, so we say the al­gorith­mic com­plex­ity is very low.

Ash­ley: All the vi­sual in­for­ma­tion I’ve seen is some­thing that hap­pens within the phys­i­cal uni­verse, so how can it be more com­pli­cated than the uni­verse? I mean, I have a sense on some level that this shouldn’t be a prob­lem, but I don’t know why it’s not a prob­lem.

Blaine: That’s be­cause par­tic­u­lar parts of the uni­verse can have much higher al­gorith­mic com­plex­ity than the en­tire uni­verse! Con­sider a library that con­tains all pos­si­ble books. It’s very easy to write a com­puter pro­gram that gen­er­ates all pos­si­ble books. So any par­tic­u­lar book in the library con­tains much more al­gorith­mic in­for­ma­tion than the en­tire library; it con­tains the in­for­ma­tion re­quired to say ‘look at this par­tic­u­lar book here’. If pi is nor­mal, then some­where in its digits is a copy of Shake­speare’s Ham­let—but the num­ber say­ing which par­tic­u­lar digit of pi to start look­ing at, will be just about ex­actly as large as Ham­let it­self. The copy of Shake­speare’s Ham­let that ex­ists in the dec­i­mal ex­pan­sion of pi, is more com­plex than pi it­self. If you zoomed way in and re­stricted your vi­sion to a par­tic­u­lar part of the Man­delbrot set, what you saw might be much more al­gorith­mi­cally com­plex than the en­tire Man­delbrot set, be­cause the speci­fi­ca­tion has to say where in the Man­delbrot set you are. Similarly, the world Earth is much more al­gorith­mi­cally com­plex than the laws of physics. Like­wise, the vi­sual field you see over the course of a sec­ond can eas­ily be far more al­gorith­mi­cally com­plex than the laws of physics.

Ash­ley: Okay, I think I get that. And similarly, even though the ways that pro­teins fold up are very com­pli­cated, in prin­ci­ple we could get all that info us­ing just the sim­ple fun­da­men­tal laws of physics plus the rel­a­tively sim­ple DNA code for the pro­tein. There’s all sorts of ob­vi­ous caveats about epi­ge­net­ics and so on, but those caveats aren’t likely to change the num­bers by a whole or­der of mag­ni­tude.

Blaine: Right!

Ash­ley: So the laws of physics are, like, a few kilo­bytes, and my brain has say 75 megabytes of in­nate wiring in­struc­tions. And then I get to see a lot more in­for­ma­tion than that over my life­time, like a megabit per sec­ond af­ter my ini­tial vi­sual sys­tem finishes pre­pro­cess­ing it, and then most of that is for­got­ten. Uh… what does that have to do with Solomonoff in­duc­tion again?

Blaine: Solomonoff in­duc­tion quickly catches up to any sin­gle com­puter pro­gram at se­quence pre­dic­tion, even if the origi­nal pro­gram is very large and con­tains a lot of prior in­for­ma­tion about the en­vi­ron­ment. If a pro­gram is 75 megabytes long, it can only pre­dict 75 megabytes worth of data bet­ter than the Solomonoff in­duc­tor be­fore the Solomonoff in­duc­tor catches up to it. That doesn’t mean that a Solomonoff in­duc­tor knows ev­ery­thing a baby does af­ter the first sec­ond of ex­po­sure to a we­b­cam feed, but it does mean that af­ter the first sec­ond the Solomonoff in­duc­tor is already no more sur­prised than a baby by the vast ma­jor­ity of pix­els in the next frame. Every time the Solomonoff in­duc­tor as­signs half as much prob­a­bil­ity as the baby to the next pixel it sees, that’s one bit spent per­ma­nently out of the 75 megabytes of er­ror that can hap­pen be­fore the Solomonoff in­duc­tor catches up to the baby. That your brain is writ­ten in the laws of physics also has some im­plicit cor­re­la­tion with the en­vi­ron­ment, but that’s like say­ing that a pro­gram is writ­ten in the same pro­gram­ming lan­guage as the en­vi­ron­ment. The lan­guage can con­tribute some­thing to the power of the pro­gram, and the en­vi­ron­ment be­ing writ­ten in the same pro­gram­ming lan­guage can be a kind of prior knowl­edge. But if Solomonoff in­duc­tion starts from a stan­dard Univer­sal Tur­ing Ma­chine as its lan­guage, that doesn’t con­tribute any more bits of life­time er­ror than the com­plex­ity of that pro­gram­ming lan­guage in the UTM.

Ash­ley: Let me jump back a cou­ple of steps and re­turn to the no­tion of my brain wiring it­self up in re­sponse to en­vi­ron­men­tal in­for­ma­tion. I’d ex­pect an im­por­tant part of that pro­cess was my brain learn­ing to con­trol the en­vi­ron­ment, not just pas­sively ob­serv­ing it. Like, it mat­tered to my brain’s wiring al­gorithm that my brain saw the room shift in a cer­tain way when it sent out sig­nals tel­ling my eyes to move.

Blaine: In­deed. But talk­ing about the se­quen­tial con­trol prob­lem is more com­pli­cated math. AIXI is the ideal agent that uses Solomonoff in­duc­tion as its episte­mol­ogy and ex­pected re­ward as its de­ci­sion the­ory. That in­tro­duces ex­tra com­plex­ity, so it makes sense to talk about just Solomonoff in­duc­tion first. We can talk about AIXI later. So imag­ine for the mo­ment that we were just look­ing at your sen­sory data, and try­ing to pre­dict what would come next in that.

Ash­ley: Wouldn’t it make more sense to look at the brain’s in­puts and out­puts, if we wanted to pre­dict the next in­put? Not just look at the se­ries of pre­vi­ous in­puts?

Blaine: It’d make the prob­lem eas­ier for a Solomonoff in­duc­tor to solve, sure; but it also makes the prob­lem more com­pli­cated. Let’s talk in­stead about what would hap­pen if you took the com­plete sen­sory record of your life, gave it to an ideally smart agent, and asked the agent to pre­dict what you would see next. Maybe the agent could do an even bet­ter job of pre­dic­tion if we also told it about your brain’s out­puts, but I don’t think that sub­tract­ing the out­puts would leave it hel­pless to see pat­terns in the in­puts.

Ash­ley: It sounds like a pretty hard prob­lem to me, maybe even an un­solv­able one. I’m think­ing of the dis­tinc­tion in com­puter sci­ence be­tween need­ing to learn from non-cho­sen data, ver­sus learn­ing when you can choose par­tic­u­lar queries. Learn­ing can be much faster in the sec­ond case.

Blaine: In terms of what can be pre­dicted in prin­ci­ple given the data, what facts are ac­tu­ally re­flected in it that Solomonoff in­duc­tion might un­cover, we shouldn’t imag­ine a hu­man try­ing to an­a­lyze the data, we should imag­ine an en­tire ad­vanced civ­i­liza­tion pon­der­ing it for years. If you look at it from that an­gle, then the alien civ­i­liza­tion isn’t go­ing to be balked by the fact that it’s look­ing at the an­swers to the queries that Ash­ley’s brain chose, in­stead of the an­swers to the queries it chose it­self. Like, if the Ash­ley had already read Shake­speare’s Ham­let—if the image of those pages had already crossed the sen­sory stream—and then the Ash­ley saw a mys­te­ri­ous box out­putting 20, 15, 2, 5, 15, 18, I think some­body eaves­drop­ping on that sen­sory data would be equally able to guess that this was en­cod­ing ‘to­beor’ and guess that the next thing the Ash­ley saw might be the box out­putting 14. You wouldn’t even need an en­tire alien civ­i­liza­tion of su­per­in­tel­li­gent cryp­tog­ra­phers to guess that. And it definitely wouldn’t be a kil­ler prob­lem that Ash­ley was con­trol­ling the eye­ball’s sac­cades, even if you could learn even faster by con­trol­ling the eye­ball your­self. So far as the com­puter-sci­ence dis­tinc­tion goes, Ash­ley’s eye­ball is be­ing con­trol­led to make in­tel­li­gent queries and seek out use­ful in­for­ma­tion, it’s just Ash­ley con­trol­ling the eye­ball in­stead of you—that eye­ball is not a query-or­a­cle an­swer­ing ran­dom ques­tions.

Ash­ley: Okay, I think this ex­am­ple here is helping my un­der­stand­ing of what we’re do­ing here. In the case above, the next item in the Ash­ley-se­quence wouldn’t ac­tu­ally be 14. It would be this huge 1920 x 1080 vi­sual field that showed the box flash­ing a lit­tle pic­ture of ’14′.

Blaine: Sure. Other­wise it would be a rigged demo, as you say.

Ash­ley: I think I’m con­fused about the idea of pre­dict­ing the vi­sual field. It seems to me that what with all the dust specks in my vi­sual field, and maybe my de­cid­ing to tilt my head us­ing mo­tor in­struc­tions that won’t ap­pear in the se­quence, there’s no way to ex­actly pre­dict the 66-megabit in­te­ger rep­re­sent­ing the next vi­sual frame. So it must be do­ing some­thing other than the equiv­a­lent of guess­ing “14” in a sim­pler se­quence, but I’m not sure what.

Blaine: In­deed, there’d be some el­e­ment of ther­mo­dy­namic and quan­tum ran­dom­ness pre­vent­ing that ex­act pre­dic­tion even in prin­ci­ple. So in­stead of pre­dict­ing one par­tic­u­lar next frame, we put a prob­a­bil­ity dis­tri­bu­tion on it.

Ash­ley: A prob­a­bil­ity dis­tri­bu­tion over pos­si­ble 66-megabit frames? Like, a table with 2 to the 66 mil­lion en­tries, sum­ming to 1?

Blaine: Sure. 2^(32 x 1920 x 1080) isn’t a large num­ber when you have un­limited com­put­ing power. As Martin Gard­ner once ob­served, “Most finite num­bers are very much larger.” Like I said, Solomonoff in­duc­tion is an epistemic ideal that re­quires an un­rea­son­ably large amount of com­put­ing power.

Ash­ley: I don’t deny that big com­pu­ta­tions can some­times help us un­der­stand lit­tle ones. But at the point when we’re talk­ing about prob­a­bil­ity dis­tri­bu­tions that large, I have some trou­ble hold­ing onto what the prob­a­bil­ity dis­tri­bu­tion is sup­posed to mean.

Blaine: Really? Just imag­ine a prob­a­bil­ity dis­tri­bu­tion over N pos­si­bil­ities, then let N go to 2 to the 66 mil­lion. If we were talk­ing about a let­ter rang­ing from A to Z, then putting 100 times as much prob­a­bil­ity mass on (X, Y, Z) as on the rest of the alpha­bet, would say that, al­though you didn’t know ex­actly what let­ter would hap­pen, you ex­pected it would be to­ward the end of the alpha­bet. You would have used 26 prob­a­bil­ities, sum­ming to 1, to pre­cisely state that pre­dic­tion. In Solomonoff in­duc­tion, since we have un­limited com­put­ing power, we ex­press our un­cer­tainty about a 1920 x 1080 video frame the same way. All the var­i­ous pixel fields you could see if your eye jumped to a plau­si­ble place, saw a plau­si­ble num­ber of dust specks, and saw the box flash some­thing that vi­su­ally en­coded ’14′, would have high prob­a­bil­ity. Pixel fields where the box van­ished and was re­placed with a glow-in-the-dark uni­corn would have very low, though not zero, prob­a­bil­ity.

Ash­ley: Can we re­ally get away with view­ing things that way?

Blaine: If we could not make iden­ti­fi­ca­tions like these in prin­ci­ple, there would be no prin­ci­pled way in which we could say that you had ever ex­pected to see some­thing hap­pen—no way to say that one vi­sual field your eyes saw, had higher prob­a­bil­ity than any other sen­sory ex­pe­rience. We couldn’t jus­tify sci­ence; we couldn’t say that, hav­ing performed Gal­ileo’s ex­per­i­ment by rol­ling an in­clined cylin­der down a plane, Gal­ileo’s the­ory was thereby to some de­gree sup­ported by hav­ing as­signed a high rel­a­tive prob­a­bil­ity to the only ac­tual ob­ser­va­tions our eyes ever re­port.

Ash­ley: I feel a lit­tle un­sure of that jump, but I sup­pose I can go along with that for now. Then the ques­tion of “What prob­a­bil­ity does Solomonoff in­duc­tion as­sign to Canada in­vad­ing?” is to be iden­ti­fied, in prin­ci­ple, with the ques­tion “Given my past life ex­pe­riences and all the vi­sual in­for­ma­tion that’s en­tered my eyes, what is the rel­a­tive prob­a­bil­ity of see­ing vi­sual in­for­ma­tion that en­codes Google News with the head­line ‘CANADA INVADES USA’ at some point dur­ing the next 300 mil­lion sec­onds?”

Blaine: Right!

Ash­ley: And Solomonoff in­duc­tion has an in-prin­ci­ple way of as­sign­ing this a rel­a­tively low prob­a­bil­ity, which that on­line crack­pot could do well to learn from as a mat­ter of prin­ci­ple, even if he couldn’t be­gin to carry out the ex­act calcu­la­tions that in­volve as­sign­ing prob­a­bil­ities to ex­po­nen­tially vast ta­bles.

Blaine: Pre­cisely!

Ash­ley: Fair­ness re­quires that I con­grat­u­late you on hav­ing come fur­ther in for­mal­iz­ing ‘do good episte­mol­ogy’ as a se­quence pre­dic­tion prob­lem than I pre­vi­ously thought you might. I mean, you haven’t satis­fied me yet, but I wasn’t ex­pect­ing you to get even this far.

Blaine: Next, we con­sider how to rep­re­sent a hy­poth­e­sis in­side this for­mal­ism.

Ash­ley: Hmm. You said some­thing ear­lier about up­dat­ing on a prob­a­bil­is­tic mix­ture of com­puter pro­grams, which leads me to sus­pect that in this for­mal­ism, a hy­poth­e­sis or way the world can be is a com­puter pro­gram that out­puts a se­quence of in­te­gers.

Blaine: There’s in­deed a ver­sion of Solomonoff in­duc­tion that works like that. But I pre­fer the ver­sion where a hy­poth­e­sis as­signs prob­a­bil­ities to se­quences. Like, if the hy­poth­e­sis is that the world is a fair coin, then we shouldn’t try to make that hy­poth­e­sis pre­dict “heads—tails—tails—tails—heads” but should let it just as­sign a 132 prior prob­a­bil­ity to the se­quence HTTTH.

Ash­ley: I can see that for coins, but I feel a bit iffier on what this means as a state­ment about the real world.

Blaine: A sin­gle hy­poth­e­sis in­side the Solomonoff mix­ture would be a com­puter pro­gram that took in a se­ries of video frames, and as­signed a prob­a­bil­ity to each pos­si­ble next video frame. Or for greater sim­plic­ity and el­e­gance, imag­ine a pro­gram that took in a se­quence of bits, ones and ze­roes, and out­put a ra­tio­nal num­ber for the prob­a­bil­ity of the next bit be­ing ‘1’. We can read­ily go back and forth be­tween a pro­gram like that, and a prob­a­bil­ity dis­tri­bu­tion over se­quences. Like, if you can an­swer all of the ques­tions, “What’s the prob­a­bil­ity that the coin comes up heads on the first flip?”, “What’s the prob­a­bil­ity of the coin com­ing up heads on the sec­ond flip, if it came up heads on the first flip?”, and “What’s the prob­a­bil­ity that the coin comes up heads on the sec­ond flip, if it came up tails on the first flip?” then we can turn that into a prob­a­bil­ity dis­tri­bu­tion over se­quences of two coin­flips. Analo­gously, if we have a pro­gram that out­puts the prob­a­bil­ity of the next bit, con­di­tioned on a finite num­ber of pre­vi­ous bits taken as in­put, that pro­gram cor­re­sponds to a prob­a­bil­ity dis­tri­bu­tion over in­finite se­quences of bits.

$$\displaystyle \mathbb{P}_{prog}(bits_{1 \dots N}) = \prod_{i=1}^{N} InterpretProb(prog(bits_{1 \dots i-1}), bits_i)$$

$$\displaystyle InterpretProb(prog(x), y) = \left\{ \begin{array}{ll} InterpretFrac(prog(x)) & \text{if } y = 1 \\ 1-InterpretFrac(prog(x)) & \text{if } y = 0 \\ 0 & \text{if $prog(x)$ does not halt} \end{array} \right\} $$

Ash­ley: I think I fol­lowed along with that in the­ory, though it’s not a type of math I’m used to (yet). So then in what sense is a pro­gram that as­signs prob­a­bil­ities to se­quences, a way the world could be—a hy­poth­e­sis about the world?

Blaine: Well, I mean, for one thing, we can see the in­fant Ash­ley as a pro­gram with 75 megabytes of in­for­ma­tion about how to wire up its brain in re­sponse to sense data, that sees a bunch of sense data, and then ex­pe­riences some de­gree of rel­a­tive sur­prise. Like in the baby-look­ing-paradigm ex­per­i­ments where you show a baby an ob­ject dis­ap­pear­ing be­hind a screen, and the baby looks longer at those cases, and so we sus­pect that ba­bies have a con­cept of ob­ject per­ma­nence.

Ash­ley: That sounds like a pro­gram that’s a way Ash­ley could be, not a pro­gram that’s a way the world could be.

Blaine: Those in­deed are dual per­spec­tives on the mean­ing of Sol­monoff in­duc­tion. Maybe we can shed some light on this by con­sid­er­ing a sim­pler in­duc­tion rule, Laplace’s Rule of Suc­ces­sion, in­vented by the Rev­erend Thomas Bayes in the 1750s, and named af­ter Pierre-Si­mon Laplace, the in­ven­tor of Bayesian rea­son­ing.

Ash­ley: Par­don me?

Blaine: Sup­pose you have a bi­ased coin with an un­known bias, and ev­ery pos­si­ble bias be­tween 0 and 1 is equally prob­a­ble.

Ash­ley: Okay. Though in the real world, it’s quite likely that an un­known fre­quency is ex­actly 0, 1, or 12. If you as­sign equal prob­a­bil­ity den­sity to ev­ery part of the real num­ber field be­tween 0 and 1, the prob­a­bil­ity of 1 is 0. In­deed, the prob­a­bil­ity of all ra­tio­nal num­bers put to­gether is zero.

Blaine: The origi­nal prob­lem con­sid­ered by Thomas Bayes was about an ideal billiard ball bounc­ing back and forth on an ideal billiard table many times and even­tu­ally slow­ing to a halt; and then bounc­ing other billiards to see if they halted to the left or the right of the first billiard. You can see why, in first con­sid­er­ing the sim­plest form of this prob­lem with­out any com­pli­ca­tions, we might con­sider ev­ery po­si­tion of the first billiard to be equally prob­a­ble.

Ash­ley: Sure. Though I note with pointless pedantry that if the billiard was re­ally an ideal rol­ling sphere and the walls were perfectly re­flec­tive, it’d never halt in the first place.

Blaine: Sup­pose we’re told that, af­ter rol­ling the origi­nal billiard ball and then 5 more billiard balls, one billiard ball was to the right of the origi­nal, an R. The other four were to the left of the origi­nal, or Ls. Again, that’s 1 R and 4 Ls. Given only this data, what is the prob­a­bil­ity that the next billiard ball rol­led will be on the left of the origi­nal, an­other L?

Ash­ley: Five sev­enths.

Blaine: Ah, you’ve heard this prob­lem be­fore?

Ash­ley: No, but it’s ob­vi­ous.

Blaine: Uh… re­ally?

Ash­ley: Com­bi­na­torics. Con­sider just the or­der­ings of the balls, in­stead of their ex­act po­si­tions. Des­ig­nate the origi­nal ball with the sym­bol |, the next five balls as LLLLR, and the next ball to be rol­led as +. Given that the cur­rent or­der­ing of these six balls is LLLL|R and that all po­si­tions and spac­ings of the un­der­ly­ing balls are equally likely, af­ter rol­ling the +, there will be seven equally likely or­der­ings +LLLL|R, L+LLL|R, LL+LL|R, and so on up to LLLL|L+R and LLLL|R+. In five of those seven or­der­ings, the + is on the left of the |. In gen­eral, if we see M of L and N of R, the prob­a­bil­ity of the next item be­ing an L is (M + 1) /​ (M + N + 2).

Blaise: Gosh… Well, the much more com­pli­cated proof origi­nally de­vised by Thomas Bayes starts by con­sid­er­ing ev­ery po­si­tion of the origi­nal ball to be equally likely a pri­ori, the ad­di­tional balls as pro­vid­ing ev­i­dence about that po­si­tion, and then in­te­grat­ing over the pos­te­rior prob­a­bil­ities of the origi­nal ball’s pos­si­ble po­si­tions to ar­rive at the prob­a­bil­ity that the next ball lands on the left or right.

Ash­ley: Heh. And is all that ex­tra work use­ful if you also hap­pen to know a lit­tle com­bi­na­torics?

Blaise: Well, it tells me ex­actly how my be­liefs about the origi­nal ball change with each new piece of ev­i­dence—the new pos­te­rior prob­a­bil­ity func­tion on the ball’s po­si­tion. Sup­pose I in­stead asked you some­thing along the lines of, “Given 4 L and 1 R, where do you think the origi­nal ball + is most likely to be on the num­ber line? How likely is it to be within 0.1 dis­tance of there?”

Ash­ley: That’s fair; I don’t see a com­bi­na­toric an­swer for the later part. You’d have to ac­tu­ally in­te­grate over the den­sity func­tion \(f^M(1-f)^N \ \mathrm{d}f\).

Blaise: Any­way, let’s just take at face value that Laplace’s Rule of Suc­ces­sion says that, af­ter ob­serv­ing M 1s and N 0s, the prob­a­bil­ity of get­ting a 1 next is (M + 1) /​ (M + N + 2).

Ash­ley: But of course.

Blaise: We can con­sider Laplace’s Rule as a short Python pro­gram that takes in a se­quence of 1s and 0s, and spits out the prob­a­bil­ity that the next bit in the se­quence will be 1. We can also con­sider it as a prob­a­bil­ity dis­tri­bu­tion over in­finite se­quences, like this:

  • 0 : 12

  • 1 : 12

  • 00 : 12 * 23 = 13

  • 01 : 12 * 13 = 16

  • 000 : 12 * 23 * 34 = 14

  • 001 : 12 * 23 * 14 = 112

  • 010 : 12 * 13 * 12 = 112

Blaise: …and so on. Now, we can view this as a rule some­one might es­pouse for pre­dict­ing coin­flips, but also view it as cor­re­spond­ing to a par­tic­u­lar class of pos­si­ble wor­lds con­tain­ing ran­dom­ness. I mean, Laplace’s Rule isn’t the only rule you could use. Sup­pose I had a bar­rel con­tain­ing ten white balls and ten green balls. If you already knew this about the bar­rel, then af­ter see­ing M white balls and N green balls, you’d pre­dict the next ball be­ing white with prob­a­bil­ity (10 - M) /​ (20 - M—N). If you use Laplace’s Rule, that’s like be­liev­ing the world was like a billiards table with an origi­nal ball rol­ling to a stop at a ran­dom point and new balls end­ing up on the left or right. If you use (10 - M) /​ (20 - M—N), that’s like the hy­poth­e­sis that there’s ten green balls and ten white balls in a bar­rel. There isn’t re­ally a sharp bor­der be­tween rules we can use to pre­dict the world, and rules for how the world be­haves -

Ash­ley: Well, that sounds just plain wrong. The map is not the ter­ri­tory, don’cha know? If Solomonoff in­duc­tion can’t tell the differ­ence be­tween maps and ter­ri­to­ries, maybe it doesn’t con­tain all episte­molog­i­cal good­ness af­ter all.

Blaise: Maybe it’d be bet­ter to say that there’s a du­al­ism be­tween good ways of com­put­ing pre­dic­tions and be­ing in ac­tual wor­lds where that kind of pre­dict­ing works well? Like, you could also see Laplace’s Rule as im­ple­ment­ing the rules for a world with ran­dom­ness where the origi­nal billiard ball ends up in a ran­dom place, so that the first thing you see is equally likely to be 1 or 0. Then to ask what prob­a­bly hap­pens on round 2, we tell the world what hap­pened on round 1 so that it can up­date what the back­ground ran­dom events were.

Ash­ley: Mm­maybe.

Blaise: If you go with the ver­sion where Solomonoff in­duc­tion is over pro­grams that just spit out a de­ter­mined string of ones and ze­roes, we could see those pro­grams as cor­re­spond­ing to par­tic­u­lar en­vi­ron­ments—ways the world could be that would pro­duce our sen­sory in­put, the se­quence. We could jump ahead and con­sider the more so­phis­ti­cated de­ci­sion-prob­lem that ap­pears in AIXI: an en­vi­ron­ment is a pro­gram that takes your mo­tor out­puts as its in­put, and then re­turns your sen­sory in­puts as its out­put. Then we can see a pro­gram that pro­duces Bayesian-up­dated pre­dic­tions as cor­re­spond­ing to a hy­po­thet­i­cal prob­a­bil­is­tic en­vi­ron­ment that im­plies those up­dates, al­though they’ll be con­ju­gate sys­tems rather than mir­ror images.

Ash­ley: Did you say some­thing ear­lier about the de­ter­minis­tic and prob­a­bil­is­tic ver­sions of Solomonoff in­duc­tion giv­ing the same an­swers? Like, is it a dis­tinc­tion with­out a differ­ence whether we ask about sim­ple pro­grams that re­pro­duce the ob­served data ver­sus sim­ple pro­grams that as­sign high prob­a­bil­ity to the data? I can’t see why that should be true, es­pe­cially since Tur­ing ma­chines don’t in­clude a ran­dom­ness source.

Blaise: I’m told the an­swers are the same but I con­fess I can’t quite see why, un­less there’s some added as­sump­tion I’m miss­ing. So let’s talk about pro­grams that as­sign prob­a­bil­ities for now, be­cause I think that case is clearer. The next key idea is to pre­fer sim­ple pro­grams that as­sign high prob­a­bil­ity to our ob­ser­va­tions so far.

Ash­ley: It seems like an ob­vi­ous step, es­pe­cially con­sid­er­ing that you were already talk­ing about “sim­ple pro­grams” and Oc­cam’s Ra­zor a while back. Solomonoff in­duc­tion is part of the Bayesian pro­gram of in­fer­ence, right?

Blaise: In­deed. Very much so.

Ash­ley: Okay, so let’s talk about the pro­gram, or hy­poth­e­sis, for “This bar­rel has an un­known fre­quency of white and green balls”, ver­sus the hy­poth­e­sis “This bar­rel has 10 white and 10 green balls”, ver­sus the hy­poth­e­sis, “This bar­rel always puts out a green ball af­ter a white ball and vice versa.” Let’s say we see a green ball, then a white ball, the se­quence GW. The first hy­poth­e­sis as­signs this prob­a­bil­ity 12 * 13 = 16, the sec­ond hy­poth­e­sis as­signs this prob­a­bil­ity 1020 * 919 or roughly 14, and the third hy­poth­e­sis as­signs prob­a­bil­ity 12 * 1. Now it seems to me that there’s some im­por­tant sense in which, even though Laplace’s Rule as­signed a lower prob­a­bil­ity to the data, it’s sig­nifi­cantly sim­pler than the sec­ond and third hy­pothe­ses and is the wiser an­swer. Does Solomonoff in­duc­tion agree?

Blaise: I think you might be tak­ing into ac­count some prior knowl­edge that isn’t in the se­quence it­self, there. Like, things that al­ter­nate ei­ther 101010… or 010101… are ob­jec­tively sim­ple in the sense that a short com­puter pro­gram simu­lates them or as­signs prob­a­bil­ities to them. It’s just un­likely to be true about an ac­tual bar­rel of white and green balls. If 10 is liter­ally the first sense data that you ever see, when you are a fresh new in­tel­li­gence with only two bits to rub to­gether, then “The uni­verse con­sists of al­ter­nat­ing bits” is no less rea­son­able than “The uni­verse pro­duces bits with an un­known ran­dom fre­quency any­where be­tween 0 and 1.”

Ash­ley: Con­ceded. But as I was go­ing to say, we have three hy­pothe­ses that as­signed 16, ~1/​4, and 12 to the ob­served data; but to know the pos­te­rior prob­a­bil­ities of these hy­pothe­ses we need to ac­tu­ally say how rel­a­tively likely they were a pri­ori, so we can mul­ti­ply by the odds ra­tio. Like, if the prior odds were 3:2:1, the pos­te­rior odds would be 3:2:1 * (2/​12 : 312 : 612) = 3:2:1 * 2:3:6 = 6:6:6 = 1:1:1. Now, how would Solomonoff in­duc­tion as­sign prior prob­a­bil­ities to those com­puter pro­grams? Be­cause I re­mem­ber you say­ing, way back when, that you thought Solomonoff was the an­swer to “How should Bayesi­ans as­sign pri­ors?”

Blaise: Well, how would you do it?

Ash­ley: I mean… yes, the sim­pler rules should be fa­vored, but it seems to me that there’s some deep ques­tions as to the ex­act rel­a­tive ‘sim­plic­ity’ of the rules (M + 1) /​ (M + N + 2), or the rule (10 - M) /​ (20 - M—N), or the rule “al­ter­nate the bits”…

Blaise: Sup­pose I ask you to just make up some sim­ple rule.

Ash­ley: Okay, if I just say the rule I think you’re look­ing for, the rule would be, “The com­plex­ity of a com­puter pro­gram is the num­ber of bits needed to spec­ify it to some ar­bi­trary but rea­son­able choice of com­piler or Univer­sal Tur­ing Ma­chine, and the prior prob­a­bil­ity is 12 to the power of the num­ber of bits. Since, e.g., there’s 32 pos­si­ble 5-bit pro­grams, so each such pro­gram has prob­a­bil­ity 132. So if it takes 16 bits to spec­ify Laplace’s Rule of Suc­ces­sion, which seems a tad op­ti­mistic, then the prior prob­a­bil­ity would be 165536, which seems a tad pes­simistic.

Blaise: Now just ap­ply that rule to the in­finity of pos­si­ble com­puter pro­grams that as­sign prob­a­bil­ities to the ob­served data, up­date their pos­te­rior prob­a­bil­ities based on the prob­a­bil­ity they’ve as­signed to the ev­i­dence so far, sum over all of them to get your next pre­dic­tion, and we’re done. And yes, that re­quires a hy­per­com­puter that can solve the halt­ing prob­lem, but we’re talk­ing ideals here. Let \(\mathcal P\) be the set of all pro­grams and \(s_1s_2\ldots s_n\) also writ­ten \(s_{\preceq n}\) be the sense data so far, then

\(\displaystyle \mathbb{Sol}(s_{\preceq n}) := \sum_{\mathrm{prog} \in \mathcal{P}} 2^{-\mathrm{length}(\mathrm{prog})} \cdot {\prod_{j=1}^n \mathop{InterpretProb}(\mathrm{prog}(s_{\preceq j-1}), s_j)}\)

\(\displaystyle \mathbb{P}(s_{n+1}=1\mid s_{\preceq n}) = \frac{\mathbb{Sol}(s_1s_2\ldots s_n 1)}{\mathbb{Sol}(s_1s_2\ldots s_n 1) + \mathbb{Sol}(s_1s_2\ldots s_n 0)}.\)

Ash­ley: Uh.

Blaine: Yes?

Ash­ley: Um…

Blaine: What is it?

Ash­ley: You in­voked a countably in­finite set, so I’m try­ing to figure out if my pre­dicted prob­a­bil­ity for the next bit must nec­es­sar­ily con­verge to a limit as I con­sider in­creas­ingly large finite sub­sets in any or­der.

Blaine (sighs): Of course you are.

Ash­ley: I think you might have left out some im­por­tant caveats. Like, if I take the rule liter­ally, then the pro­gram “0” has prob­a­bil­ity 12, the pro­gram “1″ has prob­a­bil­ity 12, the pro­gram “01” has prob­a­bil­ity 14 and now the to­tal prob­a­bil­ity is 1.25 which is too much. So I can’t ac­tu­ally nor­mal­ize it be­cause the se­ries sums to in­finity. Now, this just means we need to, say, de­cide that the prob­a­bil­ity of a pro­gram hav­ing length 1 is 12, the prob­a­bil­ity of it hav­ing length 2 is 14, and so on out to in­finity, but it’s an added pos­tu­late.

Blaine: The con­ven­tional method is to re­quire a pre­fix-free code. If “0111” is a valid pro­gram then “01110″ can­not be a valid pro­gram. With that con­straint, as­sign­ing “1/​2 to the power of the length of the code”, to all valid codes, will sum to less than 1; and we can nor­mal­ize their rel­a­tive prob­a­bil­ities to get the ac­tual prior.

Ash­ley: Okay. And you’re sure that it doesn’t mat­ter in what or­der we con­sider more and more pro­grams as we ap­proach the limit, be­cause… no, I see it. Every pro­gram has pos­i­tive prob­a­bil­ity mass, with the to­tal set sum­ming to 1, and Bayesian up­dat­ing doesn’t change that. So as I con­sider more and more pro­grams, in any or­der, there’s only so many large con­tri­bu­tions that can be made from the mix—only so of­ten that the fi­nal prob­a­bil­ity can change. Like, let’s say there are at most 99 pro­grams with prob­a­bil­ity 1% that as­sign prob­a­bil­ity 0 to the next bit be­ing a 1; that’s only 99 times the fi­nal an­swer can go down by as much as 0.01, as the limit is ap­proached.

Blaine: This idea gen­er­al­izes, and is im­por­tant. List all pos­si­ble com­puter pro­grams, in any or­der you like. Use any defi­ni­tion of sim­plic­ity that you like, so long as for any given amount of sim­plic­ity, there are only a finite num­ber of com­puter pro­grams that sim­ple. As you go on carv­ing off chunks of prior prob­a­bil­ity mass and as­sign­ing them to pro­grams, it must be the case that as pro­grams get more and com­pli­cated, their prior prob­a­bil­ity ap­proaches zero! - though it’s still pos­i­tive for ev­ery finite pro­gram, be­cause of Cromwell’s Rule. You can’t have more than 99 pro­grams as­signed 1% prior prob­a­bil­ity and still obey Cromwell’s Rule, which means there must be some most com­plex pro­gram that is as­signed 1% prob­a­bil­ity, which means ev­ery more com­pli­cated pro­gram must have less than 1% prob­a­bil­ity out to the end of the in­finite list.

Ash­ley: Huh. I don’t think I’ve ever heard that jus­tifi­ca­tion for Oc­cam’s Ra­zor be­fore. I think I like it. I mean, I’ve heard a lot of ap­peals to the em­piri­cal sim­plic­ity of the world, and so on, but this is the first time I’ve seen a log­i­cal proof that, in the limit, more com­pli­cated hy­pothe­ses must be less likely than sim­ple ones.

Blaine: Be­hold the awe­some­ness that is Solomonoff In­duc­tion!

Ash­ley: Uh, but you didn’t ac­tu­ally use the no­tion of com­pu­ta­tional sim­plic­ity to get that con­clu­sion, you just re­quired that the sup­ply of prob­a­bil­ity mass is finite and the sup­ply of po­ten­tial com­pli­ca­tions is in­finite. Any way of count­ing dis­crete com­pli­ca­tions would im­ply that con­clu­sion, even if it went by sur­face wheels and gears.

Blaine: Well, maybe. But it so hap­pens that Yud­kowsky did in­vent or rein­vent that ar­gu­ment af­ter pon­der­ing Solomonoff in­duc­tion, and if it pre­dates him (or Solomonoff) then Yud­kowsky doesn’t know the source. Con­crete in­spira­tion for sim­plified ar­gu­ments is also a credit to a the­ory, es­pe­cially if the sim­plified ar­gu­ment didn’t ex­ist be­fore that.

Ash­ley: Fair enough. My next ques­tion is about the choice of Univer­sal Tur­ing Ma­chine—the choice of com­piler for our pro­gram codes. There’s an in­finite num­ber of pos­si­bil­ities there, and in prin­ci­ple, the right choice of com­piler can make our prob­a­bil­ity for the next thing we’ll see be any­thing we like. At least I’d ex­pect this to be the case, based on how the “prob­lem of in­duc­tion” usu­ally goes. So with the right choice of Univer­sal Tur­ing Ma­chine, our on­line crack­pot can still make it be the case that Solomonoff in­duc­tion pre­dicts Canada in­vad­ing the USA.

Blaine: One way of look­ing at the prob­lem of good episte­mol­ogy, I’d say, is that the job of a good episte­mol­ogy is not to make it im­pos­si­ble to err. You can still blow off your foot if you re­ally in­sist on point­ing the shot­gun at your foot and pul­ling the trig­ger. The job of good episte­mol­ogy is to make it more ob­vi­ous when you’re about to blow your own foot off with a shot­gun. On this di­men­sion, Solomonoff In­duc­tion ex­cels. If you claim that we ought to pick an enor­mously com­pli­cated com­piler to en­code our hy­pothe­ses, in or­der to make the ‘sim­plest hy­poth­e­sis that fits the ev­i­dence’ be one that pre­dicts Canada in­vad­ing the USA, then it should be ob­vi­ous to ev­ery­one ex­cept you that you are in the pro­cess of screw­ing up.

Ash­ley: Ah, but of course they’ll say that their code is just the sim­ple and nat­u­ral choice of Univer­sal Tur­ing Ma­chine, be­cause they’ll ex­hibit a meta-UTM which out­puts that UTM given only a short code. And if you say the meta-UTM is com­pli­cated--

Blaine: Flon’s Law says, “There is not now, nor has there ever been, nor will there ever be, any pro­gram­ming lan­guage in which it is the least bit difficult to write bad code.” You can’t make it im­pos­si­ble for peo­ple to screw up, but you can make it more ob­vi­ous. And Solomonoff in­duc­tion would make it even more ob­vi­ous than might at first be ob­vi­ous, be­cause--

Ash­ley: Your Honor, I move to have the pre­vi­ous sen­tence taken out and shot.

Blaine: Let’s say that the whole of your sen­sory in­for­ma­tion is the string 10101010… Con­sider the stupid hy­poth­e­sis, “This pro­gram has a 99% prob­a­bil­ity of pro­duc­ing a 1 on ev­ery turn”, which you jumped to af­ter see­ing the first bit. What would you need to claim your pri­ors were like—what Univer­sal Tur­ing Ma­chine would you need to en­dorse—in or­der to main­tain blind faith in that hy­poth­e­sis in the face of ever-mount­ing ev­i­dence?

Ash­ley: You’d need a Univer­sal Tur­ing Ma­chine blind-utm that as­signed a very high prob­a­bil­ity to the blind pro­gram “def ProbNex­tEle­men­tIsOne(pre­vi­ous_se­quence): re­turn 0.99”. Like, if blind-utm sees the code 0, it ex­e­cutes the blind pro­gram “re­turn 0.99″. And to defend your­self against charges that your UTM blind-utm was not it­self sim­ple, you’d need a meta-UTM, blind-meta which, when it sees the code 10, ex­e­cutes blind-utm. And to re­ally wrap it up, you’d need to take a fixed point through all tow­ers of meta and use di­ag­o­nal­iza­tion to cre­ate the UTM blind-diag that, when it sees the pro­gram code 0, ex­e­cutes “re­turn 0.99”, and when it sees the pro­gram code 10, ex­e­cutes blind-diag. I guess I can see some sense in which, even if that doesn’t re­solve Hume’s prob­lem of in­duc­tion, any­one ac­tu­ally ad­vo­cat­ing that would be com­mit­ting blatant shenani­gans on a com­mon­sense level, ar­guably more blatant than it would have been if we hadn’t made them pre­sent the UTM.

Blaine: Ac­tu­ally, the shenani­gans have to be much worse than that in or­der to fool Solomonoff in­duc­tion. Like, Solomonoff in­duc­tion us­ing your blind-diag isn’t fooled for a minute, even tak­ing blind-diag en­tirely on its own terms.

Ash­ley: Really?

Blaise: As­sum­ing 60 se­quence items per sec­ond? Yes, ab­solutely, Solomonoff in­duc­tion shrugs off the delu­sion in the first minute, un­less there’s fur­ther and even more blatant shenani­gans. We did re­quire that your blind-diag be a Univer­sal Tur­ing Ma­chine, mean­ing that it can re­pro­duce ev­ery com­putable prob­a­bil­ity dis­tri­bu­tion over se­quences, given some par­tic­u­lar code to com­pile. Let’s say there’s a 200-bit code laplace for Laplace’s Rule of Suc­ces­sion, “lambda se­quence: re­turn (se­quence.count(‘1’) + 1) /​ (len(se­quence) + 2)”, so that its prior prob­a­bil­ity rel­a­tive to the 1-bit code for blind is 2^-200. Let’s say that the sense data is around 5050 1s and 0s. Every time we see a 1, blind gains a fac­tor of 2 over laplace (99% vs. 50% prob­a­bil­ity), and ev­ery time we see a 0, blind loses a fac­tor of 50 over laplace (1% vs. 50% prob­a­bil­ity). On av­er­age, ev­ery 2 bits of the se­quence, blind is los­ing a fac­tor of 25 or, say, a bit more than 4 bits, i.e., on av­er­age blind is los­ing two bits of prob­a­bil­ity per el­e­ment of the se­quence ob­served. So it’s only go­ing to take 100 bits, or a lit­tle less than two sec­onds, for laplace to win out over blind.

Ash­ley: I see. I was fo­cus­ing on a UTM that as­signed lots of prior prob­a­bil­ity to blind, but what I re­ally needed was a com­piler that, while still be­ing uni­ver­sal and en­cod­ing ev­ery pos­si­bil­ity some­where, still as­signed a re­ally tiny prob­a­bil­ity to laplace, fair­coin that en­codes “re­turn 0.5”, and ev­ery other hy­poth­e­sis that does bet­ter, round by round, than blind. So what I re­ally need to carry off the delu­sion is ob­sti­nate-diag that is uni­ver­sal, as­signs high prob­a­bil­ity to blind, re­quires billions of bits to spec­ify laplace, and also re­quires billions of bits to spec­ify any UTM that can ex­e­cute laplace as a shorter code than billions of bits. Be­cause oth­er­wise we will say, “Ah, but given the ev­i­dence, this other UTM would have done bet­ter.” I agree that those are even more blatant shenani­gans than I thought.

Blaise: Yes. And even then, even if your UTM takes two billion bits to spec­ify fair­coin, Solomonoff in­duc­tion will lose its faith in blind af­ter see­ing a billion bits. Which will hap­pen be­fore the first year is out, if we’re get­ting 60 bits per sec­ond. And if you turn around and say, “Oh, well, I didn’t mean that was my UTM, I re­ally meant this was my UTM, this thing over here where it takes a trillion bits to en­code fair­coin”, then that’s prob­a­bil­ity-the­ory-vi­o­lat­ing shenani­gans where you’re chang­ing your pri­ors as you go.

Ash­ley: That’s ac­tu­ally a very in­ter­est­ing point—that what’s needed for Bayesian to main­tain a delu­sion in the face of mount­ing ev­i­dence is not so much a blindly high prior for the delu­sory hy­poth­e­sis, as a blind skep­ti­cism of all its al­ter­na­tives. But what if their UTM re­quires a googol bits to spec­ify fair­coin? What if blind and blind-diag, or pro­grams pretty much iso­mor­phic to them, are the only pro­grams that can be speci­fied in less than a googol bits?

Blaise: Then your de­sire to shoot your own foot off has been made very, very visi­ble to any­one who un­der­stands Solomonoff in­duc­tion. We’re not go­ing to get ab­solutely ob­jec­tive prior prob­a­bil­ities as a mat­ter of log­i­cal de­duc­tion, not with­out prin­ci­ples that are un­known to me and be­yond the scope of Solomonoff in­duc­tion. But we can make the stu­pidity re­ally blatant and force you to con­struct a down­right em­bar­rass­ing Univer­sal Tur­ing Ma­chine.

Ash­ley: I guess I can see that. I mean, I guess that if you’re pre­sent­ing a lu­dicrously com­pli­cated Univer­sal Tur­ing Ma­chine that just re­fuses to en­code the pro­gram that would pre­dict Canada not in­vad­ing, that’s more visi­bly silly than a ver­bal ap­peal that says, “But you must just have faith that Canada will in­vade.” I guess part of me is still hop­ing for a more ob­jec­tive sense of “com­pli­cated”.

Blaine: We could say that rea­son­able UTMs should con­tain a small num­ber of wheels and gears in a ma­te­rial in­stan­ti­a­tion un­der our uni­verse’s laws of physics, which might in some ul­ti­mate sense provide a prior over pri­ors. Like, the hu­man brain evolved from DNA-based speci­fi­ca­tions, and the things you can con­struct out of rel­a­tively small num­bers of phys­i­cal ob­jects are ‘sim­ple’ un­der the ‘prior’ im­plic­itly searched by nat­u­ral se­lec­tion.

Ash­ley: Ah, but what if I think it’s likely that our phys­i­cal uni­verse or the search space of DNA won’t give us a good idea of what’s com­pli­cated?

Blaine: For your al­ter­na­tive no­tion of what’s com­pli­cated to go on be­ing be­lieved even as other hy­pothe­ses are rack­ing up bet­ter ex­per­i­men­tal pre­dic­tions, you need to as­sign a lu­dicrously low prob­a­bil­ity that our uni­verse’s space of phys­i­cal sys­tems build­able us­ing a small num­ber of ob­jects, could pos­si­bly provide bet­ter pre­dic­tions of that uni­verse than your com­pli­cated al­ter­na­tive no­tion of prior prob­a­bil­ity. We don’t need to ap­peal that it’s a pri­ori more likely than not that “a uni­verse can be pre­dicted well by low-ob­ject-num­ber ma­chines built us­ing that uni­verse’s physics.” In­stead, we ap­peal that it would vi­o­late Cromwell’s Rule, and be ex­ceed­ingly spe­cial plead­ing, to as­sign the pos­si­bil­ity of a phys­i­cally learn­able uni­verse a prob­a­bil­ity of less than \(2^{-1,000,000}\). It then takes only a megabit of ex­po­sure to no­tice that the uni­verse seems to be reg­u­lar.

Ash­ley: In other words, so long as you don’t start with an ab­solute and blind prej­u­dice against the uni­verse be­ing pre­dictable by sim­ple ma­chines en­coded in our uni­verse’s physics—so long as, on this planet of seven billion peo­ple, you don’t as­sign prob­a­bil­ities less than \(2^{-1,000,000}\) to the other per­son be­ing right about what is a good Univer­sal Tur­ing Ma­chine—then the pure logic of Bayesian up­dat­ing will rapidly force you to the con­clu­sion that in­duc­tion works. Hm. I don’t know that good prag­matic an­swers to the prob­lem of in­duc­tion were ever in short sup­ply. Still, on the mar­gins, it’s a more force­ful prag­matic an­swer than the last one I re­mem­ber hear­ing.

Blaise: Yay! Now isn’t Solomonoff in­duc­tion won­der­ful?

Ash­ley: Maybe? You didn’t re­ally use the prin­ci­ple of com­pu­ta­tional sim­plic­ity to de­rive that les­son. You just used that some in­duc­tive prin­ci­ple ought to have a prior prob­a­bil­ity of more than \(2^{-1,000,000}\).

Blaise: …

Ash­ley: Can you give me an ex­am­ple of a prob­lem where the com­pu­ta­tional defi­ni­tion of sim­plic­ity mat­ters and can’t be fac­tored back out of an ar­gu­ment?

Blaise: As it hap­pens, yes I can. I can give you three ex­am­ples of how it mat­ters.

Ash­ley: Vun… two… three! Three ex­am­ples! Ah-ah-ah!

Blaise: Must you do that ev­ery—oh, never mind. Ex­am­ple one is that galax­ies are not so im­prob­a­ble that no one could ever be­lieve in them, ex­am­ple two is that the limits of pos­si­bil­ity in­clude Ter­rence Tao, and ex­am­ple three is that diffrac­tion is a sim­pler ex­pla­na­tion of rain­bows than di­v­ine in­ter­ven­tion.

Ash­ley: Th­ese state­ments are all so ob­vi­ous that no fur­ther ex­pla­na­tion of any of them is re­quired.

Blaine: On the con­trary! And I’ll start with ex­am­ple one. Back when the An­dromeda galaxy was a hazy mist seen through a telescope, and some­one first sug­gested that maybe that hazy mist was an in­cred­ibly large num­ber of dis­tant stars—that many “neb­u­lae” were ac­tu­ally dis­tant galax­ies, and our own Milky Way was only one of them—there was a time when Oc­cam’s Ra­zor was in­voked against that hy­poth­e­sis.

Ash­ley: What? Why?

Blaine: They in­voked Oc­cam’s Ra­zor against the galac­tic hy­poth­e­sis, be­cause if that were the case, then there would be a much huger num­ber of stars in the uni­verse, and the stars would be en­tities, and Oc­cam’s Ra­zor said “En­tities are not to be mul­ti­plied be­yond ne­ces­sity.”

Ash­ley: That’s not how Oc­cam’s Ra­zor works. The “en­tities” of a the­ory are its types, not its ob­jects. If you say that the hazy mists are dis­tant galax­ies of stars, then you’ve re­duced the num­ber of laws be­cause you’re just pos­tu­lat­ing a pre­vi­ously seen type, namely stars or­ga­nized into galax­ies, in­stead of a new type of hazy as­tro­nom­i­cal mist.

Blaine: Okay, but imag­ine that it’s the nine­teenth cen­tury and some­body replies to you, “Well, I dis­agree! William of Ock­ham said not to mul­ti­ply en­tities, this galac­tic hy­poth­e­sis ob­vi­ously cre­ates a huge num­ber of en­tities, and that’s the way I see it!”

Ash­ley: I think I’d give them your spiel about there be­ing no hu­man episte­mol­ogy that can stop you from shoot­ing off your own foot.

Blaine: I don’t think you’d be jus­tified in giv­ing them that lec­ture. I’ll paren­the­size at this point that you ought to be very care­ful when you say “I can’t stop you from shoot­ing off your own foot”, lest it be­come a Fully Gen­eral Scorn­ful Re­join­der. Like, if you say that to some­one, you’d bet­ter be able to ex­plain ex­actly why Oc­cam’s Ra­zor counts types as en­tities but not ob­jects. In fact, you’d bet­ter ex­plain that to some­one be­fore you go ad­vis­ing them not to shoot off their own foot. And once you’ve told them what you think is fool­ish and why, you might as well stop there. Ex­cept in re­ally weird cases of peo­ple pre­sent­ing us with enor­mously com­pli­cated and jury-rigged Univer­sal Tur­ing Machines, and then we say the shot­gun thing.

Ash­ley: That’s fair. So, I’m not sure what I’d have an­swered be­fore start­ing this con­ver­sa­tion, which is much to your credit, friend Blaine. But now that I’ve had this con­ver­sa­tion, it’s ob­vi­ous that it’s new types and not new ob­jects that use up the prob­a­bil­ity mass we need to dis­tribute over all hy­pothe­ses. Like, I need to dis­tribute my prob­a­bil­ity mass over “Hy­poth­e­sis 1: there are stars” and “Hy­poth­e­sis 2: there are stars plus huge dis­tant hazy mists”. I don’t need to dis­tribute my prob­a­bil­ity mass over all the ac­tual stars in the galaxy!

Blaine: In terms of Solomonoff in­duc­tion, we pe­nal­ize a pro­gram’s lines of code rather than its run­time or RAM used, be­cause we need to dis­tribute our prob­a­bil­ity mass over pos­si­ble al­ter­na­tives each time we add a line of code. There’s no cor­re­spond­ing choice be­tween mu­tu­ally ex­clu­sive al­ter­na­tives when a pro­gram uses more run­time or RAM.

*(Eliezer, whisper­ing: Un­less we need a lev­er­age prior to con­sider the hy­poth­e­sis of be­ing a par­tic­u­lar agent in­side all that RAM or run­time.)*

Ash­ley: Or to put it an­other way, any fully de­tailed model of the uni­verse would re­quire some par­tic­u­lar ar­range­ment of stars, and the more stars there are, the more pos­si­ble ar­range­ments there are. But when we look through the telescope and see a hazy mist, we get to sum over all ar­range­ments of stars that would pro­duce that hazy mist. If some galac­tic hy­poth­e­sis re­quired a hun­dred billion stars to all be in par­tic­u­lar ex­act places with­out fur­ther ex­pla­na­tion or cause, then that would in­deed be a grave im­prob­a­bil­ity.

Blaine: Pre­cisely. And if you needed all the hun­dred billion stars to be in par­tic­u­lar ex­act places, that’s just the kind of hy­poth­e­sis that would take a huge com­puter pro­gram to spec­ify.

Ash­ley: But does it re­ally re­quire learn­ing Solomonoff in­duc­tion to un­der­stand that point? Maybe the bad ar­gu­ment against galax­ies was just a mo­ti­vated er­ror some­body made in the nine­teenth cen­tury, be­cause they didn’t want to live in a big uni­verse for emo­tional rea­sons.

Blaine: The same de­bate is play­ing out to­day over no-col­lapse ver­sions of quan­tum me­chan­ics, also some­what un­for­tu­nately known as “many-wor­lds in­ter­pre­ta­tions”. Now, re­gard­less of what any­one thinks of all the other parts of that de­bate, there’s a par­tic­u­lar sub-ar­gu­ment where some­body says, “It’s sim­pler to have a col­lapse in­ter­pre­ta­tion be­cause all those ex­tra quan­tum “wor­lds” are ex­tra en­tities that are un­nec­es­sary un­der Oc­cam’s Ra­zor since we can’t see them.” And Solomonoff in­duc­tion tells us that this in­vo­ca­tion of Oc­cam’s Ra­zor is flatly mis­guided be­cause Oc­cam’s Ra­zor does not work like that. Ba­si­cally, they’re try­ing to cut down the RAM and run­time of the uni­verse, at the ex­pen­sive of adding an ex­tra line of code, namely the code for the col­lapse pos­tu­late that prunes off parts of the wave­func­tion that are in un­de­tectably weak causal con­tact with us.

Ash­ley: Hmm. Now that you put it that way, it’s not so ob­vi­ous to me that it makes sense to have no prej­u­dice against suffi­ciently enor­mous uni­verses. I mean, the uni­verse we see around us is ex­po­nen­tially vast but not su­per­ex­po­nen­tially vast—the visi­ble atoms are \(10^{80}\) in num­ber or so, not \(10^{10^{80}}\) or “big­ger than Gra­ham’s Num­ber”. Maybe there’s some fun­da­men­tal limit on how much gets com­puted.

Blaine: You, um, know that on the Stan­dard Model, the uni­verse doesn’t just cut out and stop ex­ist­ing at the point where our telescopes stop see­ing it? There isn’t a gi­ant void sur­round­ing a lit­tle bub­ble of mat­ter cen­tered perfectly on Earth? It calls for a liter­ally in­finite amount of mat­ter? I mean, I guess if you don’t like liv­ing in a uni­verse with more than \(10^{80}\) en­tities, a uni­verse where too much gets com­puted, you could try to spec­ify ex­tra laws of physics that cre­ate an abrupt spa­tial bound­ary with no fur­ther mat­ter be­yond them, some­where out past where our telescopes can see -

Ash­ley: All right, point taken.

(Eliezer, whisper­ing: Though I per­son­ally sus­pect that the spa­tial mul­ti­verse and the quan­tum mul­ti­verse are the same mul­ti­verse, and that what lies be­yond the reach of our telescopes is not en­tan­gled with us—mean­ing that the uni­verse is as finitely large as the su­per­po­si­tion of all pos­si­ble quan­tum branches, rather than be­ing liter­ally in­finite in space.)

Blaine: I mean, there is in fact an al­ter­na­tive for­mal­ism to Solomonoff in­duc­tion, namely Levin search, which says that pro­gram com­plex­ities are fur­ther pe­nal­ized by the log­a­r­ithm of their run­time. In other words, it would make ‘ex­pla­na­tions’ or ‘uni­verses’ that re­quire a long time to run, be in­her­ently less prob­a­ble. Some peo­ple like Levin search more than Solomonoff in­duc­tion be­cause it’s more com­putable. I dis­like Levin search be­cause (a) it has no fun­da­men­tal epistemic jus­tifi­ca­tion and (b) it as­signs prob­a­bil­ity zero to quan­tum me­chan­ics.

Ash­ley: Can you un­pack that last part?

Blaine: If, as is cur­rently sus­pected, there’s no way to simu­late quan­tum com­put­ers us­ing clas­si­cal com­put­ers with­out an ex­po­nen­tial slow­down, then even in prin­ci­ple, this uni­verse re­quires ex­po­nen­tially vast amounts of clas­si­cal com­put­ing power to simu­late. Let’s say that with suffi­ciently ad­vanced tech­nol­ogy, you can build a quan­tum com­puter with a mil­lion qubits. On Levin’s defi­ni­tion of com­plex­ity, for the uni­verse to be like that is as im­prob­a­ble a pri­ori as any par­tic­u­lar set of laws of physics that must spec­ify on the or­der of one mil­lion equa­tions. Can you imag­ine how im­prob­a­ble it would be to see a list of one hun­dred thou­sand differ­en­tial equa­tions, with­out any jus­tifi­ca­tion or ev­i­dence at­tached, and be told that they were the laws of physics? That’s the kind of penalty that Levin search or Sch­mid­hu­ber’s Speed Prior would at­tach to any laws of physics that could run a quan­tum com­pu­ta­tion of a mil­lion qubits, or, heck, any physics that claimed that a pro­tein was be­ing folded in a way that ul­ti­mately went through con­sid­er­ing mil­lions of quarks in­ter­act­ing. If you’re not ab­solutely cer­tain a pri­ori that the uni­verse isn’t like that, you don’t be­lieve in Sch­mid­hu­ber’s Speed Prior. Even with a col­lapse pos­tu­late, the amount of com­pu­ta­tion that goes on be­fore a col­lapse would be pro­hibited by the Speed Prior.

Ash­ley: Okay, yeah. If you’re phras­ing it that way—that the Speed Prior as­signs prob­a­bil­ity nearly zero to quan­tum me­chan­ics, so we shouldn’t be­lieve in the Speed Prior—then I can’t eas­ily see a way to ex­tract out the same point with­out mak­ing refer­ence to ideas like pe­nal­iz­ing al­gorith­mic com­plex­ity but not pe­nal­iz­ing run­time. I mean, maybe I could ex­tract the les­son back out but it’s eas­ier to say, or more ob­vi­ous, by point­ing to the idea that Oc­cam’s Ra­zor should pe­nal­ize al­gorith­mic com­plex­ity but not run­time.

Blaine: And that isn’t just im­plied by Solomonoff in­duc­tion, it’s pretty much the whole idea of Solomonoff in­duc­tion, right?

Ash­ley: Maaaybe.

Blaine: For ex­am­ple two, that Solomonoff in­duc­tion out­performs even Ter­ence Tao, we want to have a the­o­rem that says Solomonoff in­duc­tion catches up to ev­ery com­putable way of rea­son­ing in the limit. Since we iter­ated through all pos­si­ble com­puter pro­grams, we know that some­where in there is a simu­lated copy of Ter­ence Tao in a simu­lated room, and if this re­quires a petabyte to spec­ify, then we shouldn’t have to make more than a quadrillion bits of er­ror rel­a­tive to Ter­ence Tao be­fore ze­ro­ing in on the Ter­ence Tao hy­poth­e­sis. I mean, in prac­tice, I’d ex­pect far less than a quadrillion bits of er­ror be­fore the sys­tem was be­hav­ing like it was vastly smarter than Ter­ence Tao. It’d take a lot less than a quadrillion bits to give you some speci­fi­ca­tion of a uni­verse with sim­ple physics that gave rise to a civ­i­liza­tion of vastly greater than in­ter­galac­tic ex­tent. Like, Gra­ham’s Num­ber is a very sim­ple num­ber so it’s easy to spec­ify a uni­verse that runs for that long be­fore it re­turns an an­swer. It’s not ob­vi­ous how you’d ex­tract Solomonoff pre­dic­tions from that civ­i­liza­tion and in­cen­tivize them to make good ones, but I’d be sur­prised if there were no Tur­ing ma­chine of fewer than one thou­sand states which did that some­how.

Ash­ley: …

Blaine: And for all I know there might be even bet­ter ways than that of get­ting ex­cep­tion­ally good pre­dic­tions, some­where in the list of the first decillion com­puter pro­grams. That is, some­where in the first 100 bits.

Ash­ley: So your ba­sic ar­gu­ment is, “Never mind Ter­ence Tao, Solomonoff in­duc­tion dom­i­nates God.”

Blaine: Solomonoff in­duc­tion isn’t the epistemic pre­dic­tion ca­pa­bil­ity of a su­per­in­tel­li­gence. It’s the epistemic pre­dic­tion ca­pa­bil­ity of some­thing that eats su­per­in­tel­li­gences like potato chips.

Ash­ley: Is there any point to con­tem­plat­ing an episte­mol­ogy so pow­er­ful that it will never be­gin to fit in­side the uni­verse?

Blaine: Maybe? I mean a lot of times, you just find peo­ple failing to re­spect the no­tion of or­di­nary su­per­in­tel­li­gence, do­ing the equiv­a­lent of sup­pos­ing that a su­per­in­tel­li­gence be­haves like a bad Hol­ly­wood ge­nius and misses ob­vi­ous-seem­ing moves. And a lot of times you find them in­sist­ing that “there’s a limit to how much in­for­ma­tion you can get from the data” or some­thing along those lines. “That Alien Mes­sage” is in­tended to con­vey the coun­ter­point, that smarter en­tities can ex­tract more info than is im­me­di­ately ap­par­ent on the sur­face of things. Similarly, think­ing about Solomonoff in­duc­tion might also cause some­one to re­al­ize that if, say, you simu­lated zillions of pos­si­ble sim­ple uni­verses, you could look at which agents were see­ing ex­act data like the data you got, and figure out where you were in­side that range of pos­si­bil­ities, so long as there was liter­ally any cor­re­la­tion to use. And if you say that an agent can’t ex­tract that data, you’re mak­ing a claim about which short­cuts to Solomonoff in­duc­tion are and aren’t com­putable. In fact, you’re prob­a­bly point­ing at some par­tic­u­lar short­cut and claiming no­body can ever figure that out us­ing a rea­son­able amount of com­put­ing power even though the info is there in prin­ci­ple. Con­tem­plat­ing Solomonoff in­duc­tion might help peo­ple re­al­ize that, yes, the data is there in prin­ci­ple. Like, un­til I ask you to imag­ine a civ­i­liza­tion run­ning for Gra­ham’s Num­ber of years in­side a Gra­ham-sized mem­ory space, you might not imag­ine them try­ing all the meth­ods of anal­y­sis that you per­son­ally can imag­ine be­ing pos­si­ble.

Ash­ley: If some­body is mak­ing that mis­take in the first place, I’m not sure you can beat it out of them by tel­ling them the defi­ni­tion of Solomonoff in­duc­tion.

Blaine: Maybe not. But to brute-force some­body into imag­in­ing that suffi­ciently ad­vanced agents have Level 1 pro­tag­o­nist in­tel­li­gence, that they are epistem­i­cally effi­cient rather than miss­ing fac­tual ques­tions that are visi­ble even to us, you might need to ask them to imag­ine an agent that can see liter­ally any­thing see­able in the com­pu­ta­tional limit just so that their men­tal simu­la­tion of the ideal an­swer isn’t run­ning up against stu­pidity as­ser­tions. Like, I think there’s a lot of peo­ple who could benefit from look­ing over the ev­i­dence they already per­son­ally have, and ask­ing what a Solomonoff in­duc­tor could de­duce from it, so that they wouldn’t be run­ning up against stu­pidity as­ser­tions about them­selves. It’s the same trick as ask­ing your­self what God, Richard Feyn­man, or a “perfect ra­tio­nal­ist” would be­lieve in your shoes. You just have to pick a real or imag­i­nary per­son that you re­spect enough for your model of that per­son to lack the same stu­pidity as­ser­tions that you be­lieve about your­self.

Ash­ley: Well, let’s once again try to fac­tor out the part about Solomonoff in­duc­tion in par­tic­u­lar. If we’re try­ing to imag­ine some­thing epistem­i­cally smarter than our­selves, is there any­thing we get from imag­in­ing a com­plex­ity-weighted prior over pro­grams in par­tic­u­lar? That we don’t get from, say, try­ing to imag­ine the rea­son­ing of one par­tic­u­lar Gra­ham-Num­ber-sized civ­i­liza­tion?

Blaine: We get the surety that even any­thing we imag­ine Ter­ence Tao him­self as be­ing able to figure out, is some­thing that is al­lowed to be known af­ter some bounded num­ber of er­rors ver­sus Ter­ence Tao, be­cause Ter­ence Tao is in­side the list of all com­puter pro­grams and gets pro­moted fur­ther each time the dom­i­nant paradigm makes a pre­dic­tion er­ror rel­a­tive to him. We can’t get that dom­i­nance prop­erty with­out in­vok­ing “all pos­si­ble ways of com­put­ing” or some­thing like it—we can’t in­cor­po­rate the power of all rea­son­able pro­cesses, un­less we have a set such that all the rea­son­able pro­cesses are in it. The enu­mer­a­tion of all pos­si­ble com­puter pro­grams is one such set.

Ash­ley: Hm.

Blaine: Ex­am­ple three, diffrac­tion is a sim­pler ex­pla­na­tion of rain­bows than di­v­ine in­ter­ven­tion. I don’t think I need to be­la­bor this point very much, even though in one way it might be the most cen­tral one. It sounds like “Je­ho­vah placed rain­bows in the sky as a sign that the Great Flood would never come again” is a ‘sim­ple’ ex­pla­na­tion, you can ex­plain it to a child in noth­ing flat. Just the di­a­gram of diffrac­tion through a rain­drop, to say noth­ing of the Prin­ci­ple of Least Ac­tion un­der­ly­ing diffrac­tion, is some­thing that hu­mans don’t usu­ally learn un­til un­der­grad­u­ate physics, and it sounds more alien and less in­tu­itive than Je­ho­vah. In what sense is this in­tu­itive sense of sim­plic­ity wrong? What gold stan­dard are we com­par­ing it to, that could be a bet­ter sense of sim­plic­ity than just ‘how hard is it for me to un­der­stand’? The an­swer is Sol­monoff in­duc­tion and the rule which says that sim­plic­ity is mea­sured by the size of the com­puter pro­gram, not by how hard things are for hu­man be­ings to un­der­stand. Diffrac­tion is a small com­puter pro­gram; any pro­gram­mer who un­der­stands diffrac­tion can simu­late it with­out too much trou­ble. Je­ho­vah would be a much huger pro­gram—a com­plete mind that im­ple­ments anger, vengeance, be­lief, mem­ory, con­se­quen­tial­ism, etcetera. Solomonoff in­duc­tion is what tells us to re­train our in­tu­itions so that differ­en­tial equa­tions feel like less bur­den­some ex­pla­na­tions than heroic mythol­ogy.

Ash­ley: Now hold on just a sec­ond, if that’s ac­tu­ally how Solomonoff in­duc­tion works then it’s not work­ing very well. I mean, Abra­ham Lin­coln was a great big com­pli­cated mechanism from an al­gorith­mic stand­point—he had a hun­dred trillion synapses in his brain—but that doesn’t mean I should look at the his­tor­i­cal role sup­pos­edly filled by Abra­ham Lin­coln, and look for sim­ple me­chan­i­cal rules that would ac­count for the things Lin­coln is said to have done. If you’ve already seen hu­mans and you’ve already learned to model hu­man minds, it shouldn’t cost a vast amount to say there’s one more hu­man, like Lin­coln, or one more en­tity that is cog­ni­tively hu­manoid, like the Old Tes­ta­ment jeal­ous-god ver­sion of Je­ho­vah. It may be wrong but it shouldn’t be vastly im­prob­a­ble a pri­ori. If you’ve already been forced to ac­knowl­edge the ex­is­tence of some hu­man­like minds, why not oth­ers? Shouldn’t you get to reuse the com­plex­ity that you pos­tu­lated to ex­plain hu­mans, in pos­tu­lat­ing Je­ho­vah? In fact, shouldn’t that be what Solomonoff in­duc­tion does? If you have a com­puter pro­gram that can model and pre­dict hu­mans, it should only be a slight mod­ifi­ca­tion of that pro­gram—only slightly longer in length and added code—to pre­dict the mod­ified-hu­man en­tity that is Je­ho­vah.

Blaine: Hm. That’s fair. I may have to re­treat from that ex­am­ple some­what. In fact, that’s yet an­other point to the credit of Solomonoff in­duc­tion! The abil­ity of pro­grams to reuse code, in­cor­po­rates our in­tu­itive sense that if you’ve already pos­tu­lated one kind of thing, it shouldn’t cost as much to pos­tu­late a similar kind of thing el­se­where!

Ash­ley: Uh huh.

Blaine: Well, but even if I was wrong that Solomonoff in­duc­tion should make Je­ho­vah seem very im­prob­a­ble, it’s still Solomonoff in­duc­tion that says that the al­ter­na­tive hy­poth­e­sis of ‘diffrac­tion’ shouldn’t it­self be seen as bur­den­some—even though diffrac­tion might re­quire a longer time to ex­plain to a hu­man, it’s still at heart a sim­ple pro­gram.

Ash­ley: Hmm. I’m try­ing to think if there’s some no­tion of ‘sim­plic­ity’ that I can ab­stract away from ‘sim­ple pro­gram’ as the nice prop­erty that diffrac­tion has as an ex­pla­na­tion for rain­bows, but I guess any­thing I try to say is go­ing to come down to some way of count­ing the wheels and gears in­side the ex­pla­na­tion, and jus­tify the com­plex­ity penalty on prob­a­bil­ity by the in­creased space of pos­si­ble con­figu­ra­tions each time we add a new gear. And I can’t make it be about sur­face de­tails be­cause that will make whole hu­mans seem way too im­prob­a­ble. If I have to use sim­ply speci­fied sys­tems and I can’t use sur­face de­tails or run­time, that’s prob­a­bly go­ing to end up ba­si­cally equiv­a­lent to Solomonoff in­duc­tion. So in that case we might as well use Solomonoff in­duc­tion, which is prob­a­bly sim­pler than what­ever I’ll think up and will give us the same ad­vice. Okay, you’ve mostly con­vinced me.

Blaine: Mostly? What’s left?

Ash­ley: Well, sev­eral things. Most of all, I think of how the ‘lan­guage of thought’ or ‘lan­guage of episte­mol­ogy’ seems to be differ­ent in some sense from the ‘lan­guage of com­puter pro­grams’. Like, when I think about the laws of New­to­nian grav­ity, or when I think about my Mom, it’s not just one more line of code tacked onto a big black-box com­puter pro­gram. It’s more like I’m craft­ing an ex­pla­na­tion with mod­u­lar parts—if it con­tains a part that looks like New­to­nian me­chan­ics, I step back and rea­son that it might con­tain other parts with differ­en­tial equa­tions. If it has a line of code for a Mom, it might have a line of code for a Dad. I’m wor­ried that if I un­der­stood how hu­mans think like that, maybe I’d look at Solomonoff in­duc­tion and see how it doesn’t in­cor­po­rate some fur­ther key in­sight that’s needed to do good episte­mol­ogy.

Blaine: Solomonoff in­duc­tion liter­ally in­cor­po­rates a copy of you think­ing about what­ever you’re think­ing right now.

Ash­ley: Okay, great, but that’s in­side the sys­tem. If Solomonoff learns to pro­mote com­puter pro­grams con­tain­ing good episte­mol­ogy, but is not it­self good episte­mol­ogy, then it’s not the best pos­si­ble an­swer to “How do you com­pute episte­mol­ogy?” Like, nat­u­ral se­lec­tion pro­duced hu­mans but pop­u­la­tion ge­net­ics is not an an­swer to “How does in­tel­li­gence work?” be­cause the in­tel­li­gence is in the in­ner con­tent rather than the outer sys­tem. In that sense, it seems like a rea­son­able worry that Solomonoff in­duc­tion might in­cor­po­rate only some prin­ci­ples of good episte­mol­ogy rather than all the prin­ci­ples, even if the in­ter­nal con­tent rather than the outer sys­tem might boot­strap the rest of the way.

Blaine: Hm. If you put it that way…

(long pause)

Blaine: …then, I guess I have to agree. I mean, Solomonoff in­duc­tion doesn’t ex­plic­itly say any­thing about, say, the dis­tinc­tion be­tween an­a­lytic propo­si­tions and em­piri­cal propo­si­tions, and know­ing that is part of good episte­mol­ogy on my view. So if you want to say that Solomonoff in­duc­tion is some­thing that boot­straps to good episte­mol­ogy rather than be­ing all of good episte­mol­ogy by it­self, I guess I have no choice but to agree. I do think the outer sys­tem already con­tains a lot of good episte­mol­ogy and in­spires a lot of good ad­vice all on its own. Espe­cially if you give it credit for for­mally re­pro­duc­ing prin­ci­ples that are “com­mon sense”, be­cause cor­rectly for­mal­iz­ing com­mon sense is no small feat.

Ash­ley: Got a list of the good ad­vice you think is deriv­able?

Blaine: Um. Not re­ally, but off the top of my head:

  1. The best ex­pla­na­tion is the one with the best mix­ture of sim­plic­ity and match­ing the ev­i­dence.

  2. “Sim­plic­ity” and “match­ing the ev­i­dence” can both be mea­sured in bits, so they’re com­men­su­rable.

  3. The sim­plic­ity of a hy­poth­e­sis is the num­ber of bits re­quired to for­mally spec­ify it, for ex­am­ple as a com­puter pro­gram.

  4. When a hy­poth­e­sis as­signs twice as much prob­a­bil­ity to the ex­act ob­ser­va­tions seen so far as some other hy­poth­e­sis, that’s one bit’s worth of rel­a­tively bet­ter match­ing the ev­i­dence.

  5. You should ac­tu­ally be mak­ing your pre­dic­tions us­ing all the ex­pla­na­tions, not just the sin­gle best one, but ex­pla­na­tions that poorly match the ev­i­dence will drop down to tiny con­tri­bu­tions very quickly.

  6. Good ex­pla­na­tions lets you com­press lots of data into com­pact rea­sons which strongly pre­dict see­ing just that data and no other data.

  7. Logic can’t dic­tate prior prob­a­bil­ities ab­solutely, but if you as­sign prob­a­bil­ity less than \(2^{-1,000,000}\) to the prior that mechanisms con­structed us­ing a small num­ber of ob­jects from your uni­verse might be able to well pre­dict that uni­verse, you’re be­ing un­rea­son­able.

  8. So long as you don’t as­sign in­finites­i­mal prior prob­a­bil­ity to hy­pothe­ses that let you do in­duc­tion, they will very rapidly over­take hy­pothe­ses that don’t.

  9. It is a log­i­cal truth, not a con­tin­gent one, that more com­plex hy­pothe­ses must in the limit be less prob­a­ble than sim­ple ones. 10. Epistemic ra­tio­nal­ity is a pre­cise art with no user-con­trol­led de­grees of free­dom in how much prob­a­bil­ity you ideally ought to as­sign to a be­lief. If you think you can tweak the prob­a­bil­ity de­pend­ing on what you want the an­swer to be, you’re do­ing some­thing wrong. 11. Things that you’ve seen in one place might reap­pear some­where else. 12. Once you’ve learned a new lan­guage for your ex­pla­na­tions, like differ­en­tial equa­tions, you can use it to de­scribe other things, be­cause your best hy­pothe­ses will now already en­code that lan­guage. 13. We can learn meta-rea­son­ing pro­ce­dures as well as ob­ject-level facts by look­ing at which meta-rea­son­ing rules are sim­ple and have done well on the ev­i­dence so far. 14. So far, we seem to have no a pri­ori rea­son to be­lieve that uni­verses which are more ex­pen­sive to com­pute are less prob­a­ble. 15. Peo­ple were wrong about galax­ies be­ing a pri­ori im­prob­a­ble be­cause that’s not how Oc­cam’s Ra­zor works. To­day, other peo­ple are equally wrong about other parts of a con­tin­u­ous wave­func­tion count­ing as ex­tra en­tities. 16. If some­thing seems “weird” to you but would be a con­se­quence of sim­ple rules that fit the ev­i­dence so far, well, there’s no term in these ex­plicit laws of episte­mol­ogy which add an ex­tra penalty term for weird­ness. 17. Your episte­mol­ogy shouldn’t have ex­tra rules in it that aren’t needed to do Solomonoff in­duc­tion or some­thing like it, in­clud­ing rules like “sci­ence is not al­lowed to ex­am­ine this par­tic­u­lar part of re­al­ity”--

Ash­ley: This list isn’t finite, is it.

Blaine: Well, there’s a lot of out­stand­ing de­bate about episte­mol­ogy where you can view that de­bate through the lens of Solomonoff in­duc­tion and see what Solomonoff sug­gests.

Ash­ley: But if you don’t mind my stop­ping to look at your last item, #17 above—again, it’s at­tempts to add com­plete­ness clauses to Solomonoff in­duc­tion that make me the most ner­vous. I guess you could say that a good rule of episte­mol­ogy ought to be one that’s pro­moted by Solomonoff in­duc­tion—that it should arise, in some sense, from the sim­ple ways of rea­son­ing that are good at pre­dict­ing ob­ser­va­tions. But that doesn’t mean a good rule of episte­mol­ogy ought to ex­plic­itly be in Solomonoff in­duc­tion or it’s out.

Blaine: Can you think of good episte­mol­ogy that doesn’t seem to be con­tained in Solomonoff in­duc­tion? Be­sides the ex­am­ple I already gave of dis­t­in­guish­ing log­i­cal propo­si­tions from em­piri­cal ones.

Ash­ley: I’ve been try­ing to. First, it seems to me that when I rea­son about laws of physics and how those laws of physics might give rise to higher lev­els of or­ga­ni­za­tion like molecules, cells, hu­man be­ings, the Earth, and so on, I’m not con­struct­ing in my mind a great big chunk of code that re­pro­duces my ob­ser­va­tions. I feel like this differ­ence might be im­por­tant and it might have some­thing to do with ‘good episte­mol­ogy’.

Blaine: I guess it could be? I think if you’re say­ing that there might be this un­known other thing and there­fore Solomonoff in­duc­tion is ter­rible, then that would be the nir­vana fal­lacy. Solomonoff in­duc­tion is the best for­mal­ized episte­mol­ogy we have right now--

Ash­ley: I’m not say­ing that Solomonoff in­duc­tion is ter­rible. I’m try­ing to look in the di­rec­tion of things that might point to some fu­ture for­mal­ism that’s bet­ter than Solomonoff in­duc­tion. Here’s an­other thing: I feel like I didn’t have to learn how to model the hu­man be­ings around me from scratch based on en­vi­ron­men­tal ob­ser­va­tions. I got a jump-start on mod­el­ing other hu­mans by ob­serv­ing my­self, and by re­cruit­ing my brain ar­eas to run in a sand­box mode that mod­els other peo­ple’s brain ar­eas—em­pa­thy, in a word. I guess I feel like Solomonoff in­duc­tion doesn’t in­cor­po­rate that idea. Like, maybe in­side the mix­ture there are pro­grams which do that, but there’s no ex­plicit sup­port in the outer for­mal­ism.

Blaine: This doesn’t feel to me like much of a dis­ad­van­tage of Solomonoff in­duc­tion--

Ash­ley: I’m not say­ing it would be a dis­ad­van­tage if we ac­tu­ally had a hy­per­com­puter to run Solomonoff in­duc­tion. I’m say­ing it might point in the di­rec­tion of “good episte­mol­ogy” that isn’t ex­plic­itly in­cluded in Solomonoff in­duc­tion. I mean, now that I think about it, a gen­er­al­iza­tion of what I just said is that Solomonoff in­duc­tion as­sumes I’m sep­a­rated from the en­vi­ron­ment by a hard, Carte­sian wall that oc­ca­sion­ally hands me ob­ser­va­tions. Shouldn’t a more re­al­is­tic view of the uni­verse be about a sim­ple pro­gram that con­tains me some­where in­side it, rather than a sim­ple pro­gram that hands ob­ser­va­tions to some other pro­gram?

Blaine: Hm. Maybe. How would you for­mal­ize that? It seems to open up a big can of worms--

Ash­ley: But that’s what my ac­tual episte­mol­ogy ac­tu­ally says. My world-model is not about a big com­puter pro­gram that pro­vides in­puts to my soul, it’s about an enor­mous math­e­mat­i­cally sim­ple phys­i­cal uni­verse that in­stan­ti­ates Ash­ley as one piece of it. And I think it’s good and im­por­tant to have episte­mol­ogy that works that way. It wasn’t ob­vi­ous that we needed to think about a sim­ple uni­verse that em­beds us. Descartes did think in terms of an im­per­vi­ous soul that had the uni­verse pro­ject­ing sen­sory in­for­ma­tion onto its screen, and we had to get away from that kind of episte­mol­ogy.

Blaine: You un­der­stand that Solomonoff in­duc­tion makes only a bounded num­ber of er­rors rel­a­tive to any com­puter pro­gram which does rea­son the way you pre­fer, right? If think­ing of your­self as a con­tigu­ous piece of the uni­verse lets you make bet­ter ex­per­i­men­tal pre­dic­tions, pro­grams which rea­son that way will rapidly be pro­moted.

Ash­ley: It’s still un­nerv­ing to see a for­mal­ism that seems, in its own struc­ture, to harken back to the Carte­sian days of a sep­a­rate soul watch­ing a sep­a­rate uni­verse pro­ject­ing sen­sory in­for­ma­tion on a screen. Who knows, maybe that would some­how come back to bite you?

Blaine: Well, it wouldn’t bite you in the form of re­peat­edly mak­ing wrong ex­per­i­men­tal pre­dic­tions.

Ash­ley: But it might bite you in the form of hav­ing no way to rep­re­sent the ob­ser­va­tion of, “I drank this ‘wine’ liquid and then my emo­tions changed; could my emo­tions them­selves be in­stan­ti­ated in stuff that can in­ter­act with some com­po­nent of this liquid? Can al­co­hol touch neu­rons and in­fluence them, mean­ing that I’m not a sep­a­rate soul?” If we in­ter­ro­gated the Solomonoff in­duc­tor, would it be able to un­der­stand that rea­son­ing? Which brings up that dan­gling ques­tion from be­fore about mod­el­ing the effect that my ac­tions and choices have on the en­vi­ron­ment, and whether, say, an agent that used Solomonoff in­duc­tion would be able to cor­rectly pre­dict “If I drop an anvil on my head, my se­quence of sen­sory ob­ser­va­tions will end.”

Eliezer: And that’s my cue to step in! For more about the is­sues Ash­ley raised with agents be­ing a con­tigu­ous part of the uni­verse, see nat­u­ral­is­tic re­flec­tion once that sec­tion ex­ists. Mean­while, we’ll con­sider next the ques­tion of ac­tions and choices, as we en­counter the agent that uses Solomonoff in­duc­tion for be­liefs and ex­pected re­ward max­i­miza­tion for se­lect­ing ac­tions—the perfect rol­ling sphere of ad­vanced agent the­ory, AIXI. Look for­ward to Ash­ley and Blaine’s di­alogue con­tin­u­ing there, once that sec­tion is up, which it cur­rently isn’t.

Parents:

  • Solomonoff induction

    A sim­ple way to su­per­in­tel­li­gently pre­dict se­quences of data, given un­limited com­put­ing power.

    • Methodology of unbounded analysis

      What we do and don’t un­der­stand how to do, us­ing un­limited com­put­ing power, is a crit­i­cal dis­tinc­tion and im­por­tant fron­tier.

    • Inductive prior

      Some states of pre-ob­ser­va­tion be­lief can learn quickly; oth­ers never learn any­thing. An “in­duc­tive prior” is of the former type.