# Square visualization of probabilities on two events

Say we have two events, \(A\) and \(B\), and a probability distribution \(\bP\) over whether or not they happen. We can represent \(\bP\) as a square:

So for example, the probability \(\bP(A,B)\) of both \(A\) and \(B\) occurring is the ratio of [the area of the dark red region] to [the area of the entire square]:

Visualizing probabilities in a square is neat because we can draw simple pictures that highlight interesting facts about our probability distribution.

Below are some pictures illustrating:

independent events (What happens if the columns and the rows in our square

*both*line up?)marginal probabilities (If we’re looking at a square of probabilities, where’s the probability \(\bP(A)\) of \(A\) or the probability \(\bP(\neg B)\)?)

conditional probabilities (Can we find in the square the probability \(\bP(B \mid A)\) of \(B\) if we condition on seeing \(A\)? What about the conditional probability \(\bP(A \mid B)\)?)

factoring a distribution (Can we always write \(\bP\) as a square? Why do the columns line up but not the rows?)

the process of computing joint probabilities from factored probabilities

# Independent events

Here’s a picture of the joint distribution of two independent events \(A\) and \(B\):

Now the rows for \(\bP(B)\) and \(\bP(\neg B)\) line up across the two columns. This is because \(\bP(B \mid A) = \bP(B) = \bP(B \mid \neg A)\). When \(A\) and \(B\) are independent, updating on \(A\) or \(\neg A\) doesn’t change the probability of \(B\).

For more on this visualization of independent events, see the aptly named Two independent events: Square visualization.

# Marginal probabilities

We can see the marginal probabilities of \(A\) and \(B\) by looking at some of the blocks in our square. For example, to find the probability \(\bP(\neg A)\) that \(A\) doesn’t occur, we just need to add up all the blocks where \(\neg A\) happens: \(\bP(\neg A) = \bP(\neg A, B) + \bP(\neg A, \neg B)\).

Here’s the probability \(\bP(A)\) of \(A\), and the probability \(\bP(\neg A)\) of \(\neg A\):

Here’s the probability \(\bP(\neg B)\) of \(\neg B\):

In these pictures we’re dividing by the area of the whole square. Since the probability of anything at all happening is 1, we could just leave it out, but it’ll be helpful for comparison while we think about conditionals next.

# Conditional probabilities

We can start with some probability \(\bP(B)\), and then *assume* that \(A\) is true to get a conditional probability \(\bP(B \mid A)\) of \(B\). Conditioning on \(A\) being true is like restricting our whole attention to just the possible worlds where \(A\) happens:

Then the conditional probability of \(B\) given \(A\) is the proportion of these \(A\) worlds where \(B\) also happens:

If instead we condition on \(\neg A\), we get:

So our square visualization gives a nice way to see, at a glance, the conditional probabilities of \(B\) given \(A\) or given \(\neg A\):

We don’t get such nice pictures for \(\bP(A \mid B)\):

# Factoring a distribution

Recall the square showing our joint distribution \(\bP\):

Notice that in the above square, the reddish blocks for \(\bP(A,B)\) and \(\bP(A,\neg B)\) are the same width and form a column; and likewise the blueish blocks for \(\bP(\neg A,B)\) and \(\bP(\neg A,\neg B)\). This is because we chose to factor our probability distribution starting with \(A\):

Let’s use the equivalence between events and binary random variables, so if we say \(\bP( B= \true \mid A= \false)\) we mean \(\bP(B \mid \neg A)\). For any choice of truth values \(t_A \in \{\true, \false\}\) and \(t_B \in \{\true, \false\}\), we have

The first factor \(\bP(A = t_A)\) tells us how wide to make the red column \((\bP(A = \true))\) relative to the blue column \((\bP(A = \false))\). Then the second factor \(\bP( B= t_B \mid A= t_A)\) tells us the proportions of dark \((B = \true)\) and light \((B = \false)\) within the column for \(A = t_A\).

We could just as well have factored by \(B\) first:

Then we’d draw a picture like this:

By the way, earlier when we factored by \(A\) first, we got simple pictures of the probabilities \(\bP(B \mid A)\) for \(B\) conditioned on \(A\). Now that we’re factoring by \(B\) first, we have simple pictures for the conditional probability \(\bP(A \mid B)\):

and for the conditional probability \(\bP(A \mid \neg B)\):

# Computing joint probabilities from factored probabilities

Let’s say we know the factored probabilities for \(A\) and \(B\), factoring by \(A\). That is, we know \(\bP(A = \true)\), and we also know \(\bP(B = \true \mid A = \true)\) and \(\bP(B = \true \mid A = \false)\). How can we recover the joint probability \(\bP(A = t_A, B = t_B)\) that \(A = t_A\) is the case and also \(B = t_B\) is the case?

Since

we can multiply the prior \(\bP(A)\) by the conditional \(\bP(\neg B \mid A)\) to get the joint \(\bP(A, \neg B)\):

If we do this at the same time for all the possible truth values \(t_A\) and \(t_B\), we get back the full joint distribution:

information theory. a couple things, then point to another page. eg show example when two things have lots of mutual info.

Children:

- Square visualization of probabilities on two events: (example) Diseasitis
But it

*seems*like the patient with the black tongue depressor has diseasitis…

Parents:

- Probability theory
The logic of science; coherence relations on quantitative degrees of belief.

Does this actually work for any proportions of A and B? Is there a simple proof?

And is there any significance to the fact that A and -A are divided by a straight line, but B and -B are divided by a jagged line? Could we have arranged the rectangle so that B and -B were divided by a straight line w/o changing any of the probabilities?

Yes, but I’m not sure it’s worth proving? I’d say that the “Factoring” section explains how this works, though there are no proofs. Will add pointers at the beginning.

This is addressed in the factoring section.