# Modal combat

Modal combat is a way of studying the special case of the one-shot prisoner’s dilemma in which the players are programs which have read access to each other source code. There is a class of such agents, defined by expressions in modal logic, which can do general provability reasoning about themselves and each other, and furthermore there is a quick way to evaluate how two such agents will behave against each other.

While they are not allowed to run each other’s source code noteOtherwise we risk infinite regression as $$A$$ simulates $$B$$ that simulates $$A$$, we are going to generously give each of them access to a halting oracle so that they can make powerful reasoning about each other’s behavior.

Wait, why is this interesting? Are not your assumptions a bit overpowered? Well, they are. But thinking about this machines which are obviously way more powerful than our current capabilities is a useful, albeit unrealistic, proxy for superintelligence. We should furthermore remark that while we are going to stick with halting oracles through this article there exist bounded versions of some of the results.

## Optimality in modal combat

We are especially interested in figuring out how would one go about making ideal competitors in this setup, because that may shed light on decision theory issues.

For that, we need a good optimality criterion to allow us to compare different players. Let’s think about some conditions we would like our player to satisfy:

1. Unexploitability: We would like our player to never end in the sucker’s payoff (we cooperate and the opponent defects).

2. Robust cooperation: Whenever it is possible to achieve mutual cooperation, we would like to do so. A particular way of highly desirable robust cooperation is achieving cooperation with one self. We call this property of self-trust Löbian cooperation.

3. Robust exploitation: Whenever it is possible to exploit the opponent (tricking him into cooperating when we defect) we wish to do so.

The question now becomes: it is possible to design a program which performs optimally according to all those metrics? How close can we get?

## DefectBot and CliqueBot

The simplest program that achieves unexploitability is the program which defects against every opponent, commonly known as $$DefectBot$$. This is a rather boring result, which does not manage to even cooperate with itself.

Taking advantage of the access to source codes, we could build a program $$CliqueBot$$ which cooperates if and only if the opponent’s source code is equal to his own code (wait, how do you do that? With the magic of quining we can always treat our own code as data!). Unexploitable? Check. Achieves Löbian cooperation? Check. But can’t we do better? Innocuous changes to $$CliqueBot$$’s source code (such as adding comments) result in agents which fail to cooperate, which is rather lame.

Before we start making more complicated program designs we need to develop a good reasoning framework to figure out how the matches will end. Simple simulation will not do, because we have given them access to a halting oracle to make proofs about their opponent’s program.

Fortunately, there is one formal system which nicely handles reasoning about provability: cue to Provability logic.

in provability logic we have a modal operator $$\square$$ which Solovay’s theorem tells us can be read as “it is provable in $$PA$$ that”. Furthermore, this system of logic is fully decidable, though that comes at the expense of some expressive power (we, for example, do not have quantifiers in modal logic).

The next step is to express agents in modal combat as expressions of modal logic.

For that we need to extend provability logic with second-order formulas. A second-order formula $$\phi(x)$$ is a fixed point expression in the language of modal logic in which $$x$$ appears as a formula and $$\phi$$ can be used as a variable.

For example, we could have the formula $$\phi(x) \leftrightarrow \square x(\phi)$$ and the formula $$\psi(x)\leftrightarrow \neg\square x(\psi)$$. To evaluate a formula with its free variable instantiated we substitute their variables its fixed point expression its body as many times as required until the formula with its argument its expressed in terms of itself, and then we compute its fixed point in GL.

For example, $$\psi(\phi)\leftrightarrow \neg\square\phi(\psi)\leftrightarrow \neg\square\square\psi (\phi)$$, and the fixed point can be calculated to be $$GL\vdash \psi(\phi)\leftrightarrow \neg [\square \bot \vee \square\square\bot \wedge \neg\square \bot]$$, which is arithmetically false and thus we say that $$\phi(\psi)$$ evaluates to false.

It is possible to reference other formulas in the fixed point expression of a formula, though one precaution must be taken of lest we fall into circular dependencies. We are going to say that a formula is of rank $$0$$ if it is defined in terms of itself and no other formula. Then we say that a formula can only be defined in terms of itself and lower rank formulas, and define the rank of a formula as the biggest rank in its definition plus one. If the rank is well defined, then the formula is also well defined. To guarantee the existence of a fixed point it will also be convenient to restrict ourselves to fully modalized formulas.

A modal agent of rank $$k$$ is a one place fully modalized modal formula of the kind we have just defined of rank $$k$$, whose free variable represents the opponent’s source code. To simulate a match, we substitute the free variable by the opponents formula, and evaluate it according to the procedure described in the previous paragraph. If the fixed point is arithmetically true, then the agent cooperates, and if it is false it defects. The rank of the agent is the maximum number of nested boxes that occur in the sentence.

Thus we have that we can express $$DefectBot$$ as a modal agent with the modal formula $$DB(x)\leftrightarrow \bot$$.

## FairBot and PrudentBot

To try to beat CliqueBot’s performance, we will define a bot which takes advantage of the halting oracle to achieve robust cooperation with a greater class of agents.

FairBot(x): if Prv(x(FairBot))=C then output C else return D


$$FairBot$$ tries to find a proof in $$PA$$ that its opponent is going to cooperate with him, and if it succeeds then it cooperates as well.

We can express $$FairBot$$ as a modal agent represented by the modal formula $$FB(x) \leftrightarrow \square x(FB)$$.

So, does $$FairBot$$ cooperate with himself? Let’s find out!

The result of the match against itself is the same as the fixed point of $$FB(FB)\leftrightarrow \square FB(FB)$$. According to the methods for computing fixed points, the fixed point is thus $$\top$$, which is of course arithmetically truenoteThis can also be figured out using Löb’s theorem. Thus $$FairBot$$ satisfies the Löbian cooperation condition!

Furthermore, this holds no matter how $$FairBot$$ is implemented, so it is way more robust than $$CliqueBot$$!

What happens if $$FairBot$$ faces $$DefectBot$$? Then we have to find the fixed point of $$FB(DB)\leftrightarrow \square \bot$$, which is obviously $$\square \bot$$, and since this expression is arithmetically false notePA+1 disproves it!, then it is the case that FairBot does not cooperate with $$DefectBot$$.

$$FairBot$$ is an impressive formalism, and it achieves very robust cooperation. However, it fails to exploit other agents. For example, it fails to exploit the simple program $$CooperateBot$$ note$$CB(x)\leftrightarrow \top$$, which always cooperates.

An improvement can be made by considering the program represented by the rank $$1$$ modal formula $$PB(x)\leftrightarrow\square [x(PB)\wedge \neg\square\bot \rightarrow \neg x(DB)]$$, which we call $$PrudentBot$$. By its definition, $$PrudentBot$$ cooperates with another modal agent if and only if it can prove that the opponent will cooperate with him, and furthermore $$PA+1$$ proves that its opponent will not cooperate with $$DefectBot$$.

Thus $$PrudentBot$$ cooperates against itself and with $$FairBot$$, but defects against $$DefectBot$$ and $$CooperateBot$$. In fact, $$PrudentBot$$ dominates $$FairBot$$.

## Can we do better?

One question that comes to mind: is it possible to design a modal agent which achieves cooperation with every agent for which there is at least an input which will make it cooperate without being exploitable? Sadly no. Consider the bot TrollBot\$ defined by $$TB(x)\leftrightarrow \square x(DB)$$noteExercise for the reader: does $$TrollBot$$ cooperate with himself?. That is, $$TrollBot$$ cooperates with you if and only if you cooperate with $$DefectBot$$.

Then $$CooperateBot$$ achieves cooperation with $$TrollBot$$, but no bot cooperates with $$TrollBot$$ while defecting against $$DefectBot$$.

In general, there is no good optimality criteria developed. An open question is to formalize a good notion of optimality in the class of modal agents and design a bot which achieves it.

Parents:

• At this point, I think the evidence points away from there being any deeply useful form of optimality for agents in modal combat. The things you can do with provability logic are more restrictive, it turns out, than the things you can do with logical inductors, and for that reason most of my current thinking about decision theory is in that context.

Accordingly, I now think of modal combat as being a way of studying and illustrating a few basic concepts in decision theory for advanced agents, but not an essential part of MIRI’s current research direction.