Probability distribution: Motivated definition

When discussing probabilities, people will often (informally) say things like “well, the probability $\mathbb P(sick)$ of the patient being sick is about 20%.” What does this $\mathbb P(sick)$ notation mean?

Intuitively, $\mathbb P(sick)$ is supposed to denote the probability that a particular person is sick (on a scale from 0 to 1). But how is $\mathbb P(sick)$ defined? Is there an objective probability of sickness? If not, where does the number come from?

At first you might be tempted to say $\mathbb P(sick)$ is defined by the surrounding population: If 1% of people are sick at any given time, then maybe $\mathbb P(sick)$ should be 1%. But what if this person is currently running a high fever and complaining about an upset stomach? Then we should probably assign a probability higher than 1%.

Next you might be tempted to say that the true probability of the person being sick is either 0 or 1 (because they’re either sick or they aren’t), but this observation doesn’t really help us manage our own uncertainty. It’s all well and good to say “either they sick or they aren’t,” but if you’re a doctor who has to choose which medication to prescribe (and different ones have different drawbacks), then you need some way of talking about how sick they seem to be (given what you’ve seen).

This leads us to the notion of subjective probability. Your probability that a person is sick is a fact about you. They are either sick or healthy, and as you observe more facts about them (such as “they’re running a fever”), your personal belief in their health vs sickness changes. This is the idea that used to define notation like $\mathbb P(sick).$

Formally, $\mathbb P(sick)$ is defined to be the probability that $\mathbb P$ assigns to $sick,$ where $\mathbb P$ is a type of object known as a “probability distribution”, which is an object designed for keeping track of (and managing) uncertainty. Specifically, probability distributions are objects that distribute a finite amount of “stuff” across a large number of “states,” and $\mathbb P(sick)$ measures how much stuff $\mathbb P$ in particular puts on $sick$-type states. For example, the states could be cups with labels on them, and the stuff could be water, in which case $\mathbb P(sick)$ would be the proportion of all water in the $sick$-labeled cups.

The “stuff” and “states” may be arbitrary: you can build a probability distribution out of water in cups, clay in cubbyholes, abstract numbers represented in a computer, or weightings between neurons in your head. The stuff is called “probability mass,” the states are called “possibilities.”

To be even more concrete, imagine you build $\mathbb P$ out of cups and water, and that you give some of the cups suggestive labels like $sick$ and $healthy$. Then you can talk about the proportion of all probability-water that’s in the $sick$ cup vs the $healthy$ cup. This is a probability distribution, but it’s not a very useful one. In practice, we want to model more than one thing at a time. Let’s say that you’re a doctor at an immigration center who needs to assess a person’s health, age, and country of origin. Now the set of possibilities that you want to represent aren’t just $sick$ and $healthy,$ they’re all combinations of health, age, and origin:

$$ \begin{align} sick, \text{age }1, \text{Afghanistan} \\ healthy, \text{age }1, \text{Afghanistan} \\ sick, \text{age }2, \text{Afghanistan} \\ \vdots \\ sick, \text{age }29, \text{Albania} \\ healthy, \text{age }29, \text{Albania} \\ sick, \text{age }30, \text{Albania} \\ \vdots \end{align} $$

and so on. If you build this probability distribution out of cups, you’re going to need a lot of cups. If there are 2 possible health states ($sick$ and $healthy$), 150 possible ages, and 196 possible countries, then the total number of cups you need in order to build this probability distribution is $2 \cdot 150 \cdot 196 = 58800,$ which is rather excessive. (There’s a reason we do probabilistic reasoning using transistors and/or neurons, as opposed to cups with water in them).

In order to make this proliferation of possibilities manageable, the possibilities are usually arranged into columns, such as the “Health”, “Age”, and “Country” columns above. This columns are known as “variables” of the distribution. Then, $\mathbb P(sick)$ is an abbreviation for $\mathbb P(\text{Health}=sick),$ which counts the proportion of all probability mass (water) allocated to possibilities (cups) that have $sick$ in the Health column of their label.

What’s the point of doing all this setup? Once we’ve made a probability distribution, we can hook it up to the outside world such that, when the world interacts with the probability distribution, the probability mass is shifted around inside the cups. For example, if you have a rule which says “whenever a person shows me a passport from country X, I throw out all water except the water in cups with X in the Country column”, then, whenever you see a passport, the probability distribution will get more accurate.

The natural question here is, what are the best ways to manipulate the probability mass in $\mathbb P$ (in response to observations), if the goal is to have $\mathbb P$ get more and more accurate over time? That’s exactly the sort of question that probability theory can be used to answer (and it has implications both for artificial intelligence, and for understanding human intelligence — after all, we ourselves are a physical system that manages uncertainty, and updates beliefs in response to observations).

At this point, there are two big objections to answer. First objection:

Whoa now, the number of cups in $\mathbb P$ got pretty big pretty quickly, and this was a simple example. In a realistic probability distribution $\mathbb P$ intended to represent the real world (which has way more than 3 variables worth tracking), the number of necessary possibilities would be ridiculous. Why do we define probabilities in terms of these huge impractical “probability distributions”?

This is an important question, which is answered by three points:

In practice, there are a number of tricks for exploiting regularities in the structure of the world in order to drastically reduce the number of cups you need to track. We won’t be covering those tricks in this guide, but you can check out Realistic probabilities and Arbital’s guide to Bayes nets if you’re interested in the topic.
Even so, full-fledged probabilistic reasoning is computationally infeasible on complex problems. In practice, physical reasoning systems (such as brains or artificial intelligence algorithms) use lots of approximations and shortcuts.
Nevertheless, reasoning according to a full probability distribution is the theoretical ideal for how to do good reasoning. You can’t do better than probabilistic reasoning (unless you’re born knowing the right answers to everything), and insofar as you don’t use probabilistic reasoning, you can be exploited. Even if complex probability distribution are too big to manage in practice, they tell give lots of hints about how to reason right or wrong that we can follow in our day-to-day lives.

Second objection:

You basically just said “given a bunch of cups and some water, we define the probability of a person being sick as the amount of water in some suggestively-labeled cups.” How does that have anything to do with whether or not the person is actually sick? Just because you put a $sick$ label on there doesn’t magically give the water meaning!

This is an important point. For $\mathbb P$ to be useful, we want to design a reasoning procedure such that the more we interact with a person, the more probability mass starts to reflect how healthy the person actually is. That is, we want the water to go into $sick$ cups if they’re sick, and $healthy$ cups if they’re healthy. If our reasoning procedure has that property, and we have $\mathbb P$ interact with the world for a while, then its probabilities will get pretty accurate — at which point $\mathbb P$ can be used to answer questions and/or make decisions. (This is the principle that makes brains and artificial intelligence algorithms tick.)

How do we design reasoning mechanisms that cause $\mathbb P$ to become more accurate the more it interacts with the world? That’s a big question, and the answer has many parts. One of the most important parts of the answer, though, is a law of probability theory which tells us the correct way to move the probability mass around in response to new observations (assuming the goal is to make $\mathbb P$ more accurate). For more on that law, see Bayes’ rule.