Probability distribution: Motivated definition

When dis­cussing prob­a­bil­ities, peo­ple will of­ten (in­for­mally) say things like “well, the prob­a­bil­ity \(\mathbb P(sick)\) of the pa­tient be­ing sick is about 20%.” What does this \(\mathbb P(sick)\) no­ta­tion mean?

In­tu­itively, \(\mathbb P(sick)\) is sup­posed to de­note the prob­a­bil­ity that a par­tic­u­lar per­son is sick (on a scale from 0 to 1). But how is \(\mathbb P(sick)\) defined? Is there an ob­jec­tive prob­a­bil­ity of sick­ness? If not, where does the num­ber come from?

At first you might be tempted to say \(\mathbb P(sick)\) is defined by the sur­round­ing pop­u­la­tion: If 1% of peo­ple are sick at any given time, then maybe \(\mathbb P(sick)\) should be 1%. But what if this per­son is cur­rently run­ning a high fever and com­plain­ing about an up­set stom­ach? Then we should prob­a­bly as­sign a prob­a­bil­ity higher than 1%.

Next you might be tempted to say that the true prob­a­bil­ity of the per­son be­ing sick is ei­ther 0 or 1 (be­cause they’re ei­ther sick or they aren’t), but this ob­ser­va­tion doesn’t re­ally help us man­age our own un­cer­tainty. It’s all well and good to say “ei­ther they sick or they aren’t,” but if you’re a doc­tor who has to choose which med­i­ca­tion to pre­scribe (and differ­ent ones have differ­ent draw­backs), then you need some way of talk­ing about how sick they seem to be (given what you’ve seen).

This leads us to the no­tion of sub­jec­tive prob­a­bil­ity. Your prob­a­bil­ity that a per­son is sick is a fact about you. They are ei­ther sick or healthy, and as you ob­serve more facts about them (such as “they’re run­ning a fever”), your per­sonal be­lief in their health vs sick­ness changes. This is the idea that used to define no­ta­tion like \(\mathbb P(sick).\)

For­mally, \(\mathbb P(sick)\) is defined to be the prob­a­bil­ity that \(\mathbb P\) as­signs to \(sick,\) where \(\mathbb P\) is a type of ob­ject known as a “prob­a­bil­ity dis­tri­bu­tion”, which is an ob­ject de­signed for keep­ing track of (and man­ag­ing) un­cer­tainty. Speci­fi­cally, prob­a­bil­ity dis­tri­bu­tions are ob­jects that dis­tribute a finite amount of “stuff” across a large num­ber of “states,” and \(\mathbb P(sick)\) mea­sures how much stuff \(\mathbb P\) in par­tic­u­lar puts on \(sick\)-type states. For ex­am­ple, the states could be cups with la­bels on them, and the stuff could be wa­ter, in which case \(\mathbb P(sick)\) would be the pro­por­tion of all wa­ter in the \(sick\)-la­beled cups.

The “stuff” and “states” may be ar­bi­trary: you can build a prob­a­bil­ity dis­tri­bu­tion out of wa­ter in cups, clay in cub­by­holes, ab­stract num­bers rep­re­sented in a com­puter, or weight­ings be­tween neu­rons in your head. The stuff is called “prob­a­bil­ity mass,” the states are called “pos­si­bil­ities.”

To be even more con­crete, imag­ine you build \(\mathbb P\) out of cups and wa­ter, and that you give some of the cups sug­ges­tive la­bels like \(sick\) and \(healthy\). Then you can talk about the pro­por­tion of all prob­a­bil­ity-wa­ter that’s in the \(sick\) cup vs the \(healthy\) cup. This is a prob­a­bil­ity dis­tri­bu­tion, but it’s not a very use­ful one. In prac­tice, we want to model more than one thing at a time. Let’s say that you’re a doc­tor at an im­mi­gra­tion cen­ter who needs to as­sess a per­son’s health, age, and coun­try of ori­gin. Now the set of pos­si­bil­ities that you want to rep­re­sent aren’t just \(sick\) and \(healthy,\) they’re all com­bi­na­tions of health, age, and ori­gin:

$$ \begin{align} sick, \text{age }1, \text{Afghanistan} \\ healthy, \text{age }1, \text{Afghanistan} \\ sick, \text{age }2, \text{Afghanistan} \\ \vdots \\ sick, \text{age }29, \text{Albania} \\ healthy, \text{age }29, \text{Albania} \\ sick, \text{age }30, \text{Albania} \\ \vdots \end{align} $$

and so on. If you build this prob­a­bil­ity dis­tri­bu­tion out of cups, you’re go­ing to need a lot of cups. If there are 2 pos­si­ble health states (\(sick\) and \(healthy\)), 150 pos­si­ble ages, and 196 pos­si­ble coun­tries, then the to­tal num­ber of cups you need in or­der to build this prob­a­bil­ity dis­tri­bu­tion is \(2 \cdot 150 \cdot 196 = 58800,\) which is rather ex­ces­sive. (There’s a rea­son we do prob­a­bil­is­tic rea­son­ing us­ing tran­sis­tors and/​or neu­rons, as op­posed to cups with wa­ter in them).

In or­der to make this pro­lifer­a­tion of pos­si­bil­ities man­age­able, the pos­si­bil­ities are usu­ally ar­ranged into columns, such as the “Health”, “Age”, and “Coun­try” columns above. This columns are known as “vari­ables” of the dis­tri­bu­tion. Then, \(\mathbb P(sick)\) is an ab­bre­vi­a­tion for \(\mathbb P(\text{Health}=sick),\) which counts the pro­por­tion of all prob­a­bil­ity mass (wa­ter) al­lo­cated to pos­si­bil­ities (cups) that have \(sick\) in the Health column of their la­bel.

What’s the point of do­ing all this setup? Once we’ve made a prob­a­bil­ity dis­tri­bu­tion, we can hook it up to the out­side world such that, when the world in­ter­acts with the prob­a­bil­ity dis­tri­bu­tion, the prob­a­bil­ity mass is shifted around in­side the cups. For ex­am­ple, if you have a rule which says “when­ever a per­son shows me a pass­port from coun­try X, I throw out all wa­ter ex­cept the wa­ter in cups with X in the Coun­try column”, then, when­ever you see a pass­port, the prob­a­bil­ity dis­tri­bu­tion will get more ac­cu­rate.

The nat­u­ral ques­tion here is, what are the best ways to ma­nipu­late the prob­a­bil­ity mass in \(\mathbb P\) (in re­sponse to ob­ser­va­tions), if the goal is to have \(\mathbb P\) get more and more ac­cu­rate over time? That’s ex­actly the sort of ques­tion that prob­a­bil­ity the­ory can be used to an­swer (and it has im­pli­ca­tions both for ar­tifi­cial in­tel­li­gence, and for un­der­stand­ing hu­man in­tel­li­gence — af­ter all, we our­selves are a phys­i­cal sys­tem that man­ages un­cer­tainty, and up­dates be­liefs in re­sponse to ob­ser­va­tions).

At this point, there are two big ob­jec­tions to an­swer. First ob­jec­tion:

Whoa now, the num­ber of cups in \(\mathbb P\) got pretty big pretty quickly, and this was a sim­ple ex­am­ple. In a re­al­is­tic prob­a­bil­ity dis­tri­bu­tion \(\mathbb P\) in­tended to rep­re­sent the real world (which has way more than 3 vari­ables worth track­ing), the num­ber of nec­es­sary pos­si­bil­ities would be ridicu­lous. Why do we define prob­a­bil­ities in terms of these huge im­prac­ti­cal “prob­a­bil­ity dis­tri­bu­tions”?

This is an im­por­tant ques­tion, which is an­swered by three points:

  1. In prac­tice, there are a num­ber of tricks for ex­ploit­ing reg­u­lar­i­ties in the struc­ture of the world in or­der to dras­ti­cally re­duce the num­ber of cups you need to track. We won’t be cov­er­ing those tricks in this guide, but you can check out Real­is­tic prob­a­bil­ities and Ar­bital’s guide to Bayes nets if you’re in­ter­ested in the topic.

  2. Even so, full-fledged prob­a­bil­is­tic rea­son­ing is com­pu­ta­tion­ally in­fea­si­ble on com­plex prob­lems. In prac­tice, phys­i­cal rea­son­ing sys­tems (such as brains or ar­tifi­cial in­tel­li­gence al­gorithms) use lots of ap­prox­i­ma­tions and short­cuts.

  3. Nev­er­the­less, rea­son­ing ac­cord­ing to a full prob­a­bil­ity dis­tri­bu­tion is the the­o­ret­i­cal ideal for how to do good rea­son­ing. You can’t do bet­ter than prob­a­bil­is­tic rea­son­ing (un­less you’re born know­ing the right an­swers to ev­ery­thing), and in­so­far as you don’t use prob­a­bil­is­tic rea­son­ing, you can be ex­ploited. Even if com­plex prob­a­bil­ity dis­tri­bu­tion are too big to man­age in prac­tice, they tell give lots of hints about how to rea­son right or wrong that we can fol­low in our day-to-day lives.

Se­cond ob­jec­tion:

You ba­si­cally just said “given a bunch of cups and some wa­ter, we define the prob­a­bil­ity of a per­son be­ing sick as the amount of wa­ter in some sug­ges­tively-la­beled cups.” How does that have any­thing to do with whether or not the per­son is ac­tu­ally sick? Just be­cause you put a \(sick\) la­bel on there doesn’t mag­i­cally give the wa­ter mean­ing!

This is an im­por­tant point. For \(\mathbb P\) to be use­ful, we want to de­sign a rea­son­ing pro­ce­dure such that the more we in­ter­act with a per­son, the more prob­a­bil­ity mass starts to re­flect how healthy the per­son ac­tu­ally is. That is, we want the wa­ter to go into \(sick\) cups if they’re sick, and \(healthy\) cups if they’re healthy. If our rea­son­ing pro­ce­dure has that prop­erty, and we have \(\mathbb P\) in­ter­act with the world for a while, then its prob­a­bil­ities will get pretty ac­cu­rate — at which point \(\mathbb P\) can be used to an­swer ques­tions and/​or make de­ci­sions. (This is the prin­ci­ple that makes brains and ar­tifi­cial in­tel­li­gence al­gorithms tick.)

How do we de­sign rea­son­ing mechanisms that cause \(\mathbb P\) to be­come more ac­cu­rate the more it in­ter­acts with the world? That’s a big ques­tion, and the an­swer has many parts. One of the most im­por­tant parts of the an­swer, though, is a law of prob­a­bil­ity the­ory which tells us the cor­rect way to move the prob­a­bil­ity mass around in re­sponse to new ob­ser­va­tions (as­sum­ing the goal is to make \(\mathbb P\) more ac­cu­rate). For more on that law, see Bayes’ rule.


  • Probability theory

    The logic of sci­ence; co­her­ence re­la­tions on quan­ti­ta­tive de­grees of be­lief.

  • Subjective probability

    Prob­a­bil­ity is in the mind, not in the en­vi­ron­ment. If you don’t know whether a coin came up heads or tails, that’s a fact about you, not a fact about the coin.