# Kolmogorov’s Axioms of Probability: Even Smarter Than You Have Been Told

## Introduction

I’d like to talk about the Kolmogorov Axioms of Probability as another example of revisionist history in mathematics (another example here). What is commonly quoted as the Kolmogorov Axioms of Probability is, in my opinion, a less insightful formulation than what is found in the 1956 English translation of Kolmogorov’s 1933 German monograph. I would like to discuss the axioms and the difference.

## The Fly in The Ointment

First I want to point out that these are Kolmogorov’s Axioms of Probability, not the only possible axioms of probability. They are not necessarily the laws of the universe, they are a human attempt to capture and describe such laws. That being said, Kolmogorov’s Axioms of Probability tend to kick other formulations to the curb.

The Wikipedia entry https://en.wikipedia.org/wiki/Probability_axioms currently (as of September 19, 2020) condenses the laws as follows.

This is a useful formulation. But it differs from the English translation Nathan Morrison (translator); A. N,. Kolmogorov; Foundations of the Theory of Probability, Second English Edition, Dover, 2018 in a key point: it doesn’t address the difference between finite additivity (or set-systems) and countable additivity (or sigma-algebras) in the decisive manner of the original.

Kolmogorov included a very clever argument against assuming only finite additivity in his monograph.

## The Actual Point of the Axioms

In approaching probability the learner is faced with two huge problems:

1. What do probabilities mean?
2. What calculations on probabilities make sense (or are allowed or admissible)?

Good axiomatized probability theory arrived very late, in the 1930s after seemingly more advanced fields such as quantum mechanics (into its second generation by 1927)! In my opinion, what slowed the development of probability was trying very hard to address the first question: “what do probabilities mean?” The honest answer is we know less about this first statement than we would like. In the current theory low-probability events occur at low frequencies (under appropriate identical independent repetition of experiment), and high probability events occur at high frequencies (under appropriate independent repetition of experiment). Probability zero events tend to be rare, but can occur. Probability one events tend to be common, but can fail to happen.

A somewhat simpler question: “What calculations on probabilities make sense (or are allowed or admissible)?” What even makes sense to do with probabilities as numbers? When does it make sense to add two probabilities (in the sense that the sum corresponds to the probability of some other event, and isn’t merely a new number)?

The axioms tell us what calculations are admissible. That is their job, and we can’t ask too much more of them.

### An Example

Suppose we have two probabilities of events:

1. The probability of tomorrow being sunny, written as P[sunny] and supposed equal to 1/2.
2. The probability of tomorrow not being garbage pickup day, written as P[!garbage] and supposed equal to 6/7.

As number we can add P[sunny] to P[!garbage] and get a new number 9/7 = 1/2 + 6/7. However this new number 9/7 is not a probability. Standard probabilities are always in the range zero to one, an axiom we will assume.

The axioms of probability save us from the above. They don’t include adding these two arbitrary probabilities, they allow adding probabilities of disjoint events (where one event happening implies the other can not happen).

## Kolmogorov’s Axioms

Kolmogorov’s Axioms as reported in Morrison (translator); A. N,. Kolmogorov; Foundations of the Theory of Probability, Second English Edition, Dover, 2018 start as follows (symbols changed to simple Roman characters):

Let E be a collection of elements a, b, c, …, which we shall call elementary events, and F a set of subsets of E; the elements of the set F will be called random events.

• I. F is a field of sets.
• II. F contains the set E.
• III. To each set A in F is assigned a non-negative real number P(A). This number P(A) is called the probability of the event A.
• IV. P(E) equals 1.
• V. If A and B have no element in common, then
```      P(A union B) = P(A) + P(B)
```
• This requires some background to understand. One of Kolmogorov’s many good ideas was to base the axioms of probability on already established set theory and measure theory (instead of trying to start from nothing).

Unwinding it a bit (and inserting meaning/interpretation to make it easier to remember, however we can drop off any claimed meaning!):

• E is the set of possible exact things that can happen in our probability model. If our model is about rolling a standard six-sided die then E = {1, 2, 3, 4, 5, 6}. If our model is about flipping a coin two times, then E = {HH, HT, TH, TT}. If our model is about a continuous variable such as height, E may be the real numbers or a subset of the real numbers (such as the non-negative real numbers).

E is just a set of things that can happen. Each event is a complete description of the world, including bits we may not have yet observed.

• F is the major bookkeeping machine that tells us which subsets of E we are allowed to ask for a probability of. An random event A in F is said to “happen” if for some elementary event x in A x has “happened.”

A “field of sets” is what we now call a set algebra. It means if A and B are in F, then A union B, A intersect B, A set difference B, and B set difference A are all also in F. This is saying the collections of events we can ask for probabilities of is not completely wild, it has some structure.

It is a major feature of the theory that F is not always the set of all subsets of E.

• P(E) = 1. This is the “law of total probability.” It reads as: some elementary event always happens (as E is exactly the set of all elementary events).

From this we can derive a lot. As every A in F is contained in E we have P(A) < 1 for any A in F.

• The “A and B have no element in common, then P(A union B) = P(A) + P(B)” (assuming A, B in F; which does in turn imply A + B = A union B is also in F). This is our one tool to calculate. It allows us to use the declared probabilities on some sets to infer what the declared probabilities on other sets must be. For example. We know by the fact F is a set-algebra of subsets of E with E in F that empty set must also be in F. We also know P(emptyset) + P(E) = 1, so P(emptyset) = 0. We didn’t need this as an axiom, as it follows from the other axioms.

From this axiom we can already derive the powerful method of inclusion/exclusion.

For arbitrary A, B in F:

```  P(A union B) = P(A) + P(B) - P(A intersection B)
```

(Proof: Let I = A intersection B, A’ = A setdiffernce I, and B’ = B setdifference I. I, A’, and B’ are disjoint. Our axioms then give us: P(A) = P(A’) + P(I), P(B) = P(B’) + P(I), P(A union B) = P(A’ union B’ union I) = P(A’) + P(B’) + P(I) = (P(A) – P(I)) + (P(B) – P(I)) + P(I).)

### A Few More Definitions

Using the above and two more definitions (definitions differ from axioms as definitions largely just introduce notation, whereas axioms introduce working assumptions) we have all the common rules for calculating with probabilities. This should help alleviate the common fear that one doesn’t know enough of the important calculations allowed in probability.

The two additional definitions are (assuming all the sets named are in the set algebra, and moving to more common notation):

• For events A, B in F with P[A] > 0, the conditional probability written P[B|A] (read “probability of B given A”) is define as P[B|A] = P[A intersect B]/P[A].
• For events A, B in F we say A is independent of B if and only if P[A intersect B] = P[A] P[B].

From these axioms and definitions one can derive much of the rest of probability theory, including theorems such as Bayes’s Theorem.

## An Application

This sort of probability theory is designed to work with finite probability spaces, such as flipping a few coins and to work with infinite probability spaces, such as drawing a real number uniformly from the interval [0, 1].

For the uniform [0, 1] example we choose:

• E = the set of real numbers in the interval [0, 1].
• F = the set of Borel subsets of [0, 1] induced by the standard topology on R is generated by the open intervals (a, b) (0 <= a < b <= 1).
• P = the Borel measure induced by assigning measure P([a, b]) = b – a for all 0 <= a <= b <= 1.

This may seem complicated, but delegates work to objects from measure theory (sigma-algebras, Borel sets, and Borel measures).

After all this: we get a theory of probability where, for the measurable subsets of [0, 1] the probability of the set is just the measure (an abstraction of length in this case). The measure of the set [0, 1/2] is 1/2, the measure of the set [1/2, 1] is 1/2, the measures of the set [1/2, 1/2] is zero, and the measure of the whole set E = [0, 1] is 1 as required.

### A Comment

And here we have an odd thing about modern measure theory and probability. The measures of the small (finite) sets is somewhat isolated from the measures of the large (infinite) sets.

Consider our example “drawing a real number uniformly from the interval [0, 1].”

By our above laws any point has probability zero of being drawn! Let’s see how that follows for the point 0.5. The point 0.5 is exactly the same set as the closed interval [0.5, 0.5]. The measure of this interval is zero, so the probability of the set {0.5} is zero. This is true for every single point in the interval. Each point has probability zero of being drawn, yet every time we generated a uniform random number in the interval some point is drawn.

Our axioms up to now in no way relate the measures of large (infinite) sets to the measures of large (infinite) collections of small (finite) sets. In this system we have P([1/2, 3/4]) = 1/4, but we know P({x}) = 0 for every point x in [1/2, 3/4]. What we know about the individual points seems disturbingly unrelated to what we know about non-finite intervals.

The usual and intended fix is: associate with the points a new quantity called a density and see if we can get a probability that obeys our axioms using integrals instead of sums. This involves adding more external mathematical theory and assumptions, so we will not continue on this path here.

## The Infinite

What we lose in moving from finite probability spaces to infinite is: knowing the probabilities of all of the individual points is no longer useful in working out the probabilities of the larger sets. Finite probability theory doesn’t even needs sets, one could base such a theory entirely on the elementary events.

Notice the axioms discussed here differ from the ones quoted in the Wikipedia as they only talk about summing two sets at a time, and not countably infinite sums. We can, by induction, derive from the axioms we have stated that for any finite set of disjoint sets the probability of the union is the sum of the probabilities. But we can not get from our axioms to the countable sum axiom quoted by the Wikipedia.

This is deliberate.

Kolmogorov was, as is traditional in good mathematics, picking axioms that both model the system to be discussed and are arguably much better than the alternative. This second part is key. He is not so much saying “accept axiom IV- the probability of something happening is one.” He is implicitly saying, “okay probability is already difficult, but it is going to be much worse for you if you don’t limit the theory down to P(E) = 1.”

Here is how Kolmogorov actually moved from being able o work with finite disjoint sums, to being able to work with countably infinite disjoint sums. This is from the start of chapter 2 “Infinite Probability Fields” and it is a masterstroke of subtle salesmanship.

In all future investigations, we shall assume that besides Axioms I – V another holds true:

``` VI: For a [countably infinite] decreasing sequence of events

A(1) contained in A(2) contained in ... contained in A(n) contained in ...   (1)

of F, for which

Intersection_{n} A(n) = emptyset                                             (2)

the following equation holds:

lim P(A(n)) = 0   as n goes to infinity                                      (3)
```

(Note in the above n is a free-index, so Intersection_{n} A(n) is an infinite intersection over all of the sets in the countably infinite sequence of sets, not the intersection of any finite prefix of them.)

At this point we have upgraded F from a set-algebra to a sigma-algebra. A sigma-algebra is a set-algebra with the additional guarantee that any countably infinite intersection of sets in the sigma-algebra is also in the sigma-algebra. Or more commonly, any countably infinite union of disjoint elements of the sigma-algebra is also in the sigma-algebra.

The new axiom VI is a continuity axiom: a sequence of nested sets that approaches the empty set must also approach probability zero, the probability of the empty set.

### The Sales Pitch

Kolmogorov’s sales pitch is amazing, and folding his six axioms into a shorter 3 cheats us of the opportunity to see it in isolation.

All examples of finite fields of probability, in the first chapter, satisfy, therefore, Axiom VI [Kolmogorov had just proven that any finite probability space that obeys Axioms I through V also obeys Axiom VI as a theorem. He is repeating this claim for emphasis and to imply that for finite probability fields since it is already true there is no harm (beyond redundancy) in adding it to your axiom list.] …

For infinite fields [of probability], on the other hand, the Axiom of Continuity, VI, proved to be independent of Axioms I – V. Since this new axiom is essential for infinite fields of probability only, it is almost impossible to elucidate its empirical meaning, as was done, for example, in the case of Axioms I – V in section 2 of the first chapter. For, in describing any observable random process we can obtain only finite fields of probability. Infinite fields of probability occur only as idealized models of read random processes. We limit ourselves, arbitrarily, to only those models which satisfy Axiom VI. This limitation has been found expedient in researches of the most diverse sort.

And that is the sell: “you have no idea what the infinite is, and I will run circles around you by assuming an easier case (only continuous distributions) that no finite empirical experiment can falsify.”

Kolmogorov then proves the “Generalized Addition Theorem” (what the Wikipedia called Axiom 3) as a theorem from the standing six axioms. The reader doesn’t get to pick if they believe in countable sums or not (the deal if the countable sum statement is an axiom). The reader gets to pick if they want to restrict themselves to continuous distributions, and if so they are forced to accept countable additivity.

Notice this “Generalized Addition Theorem” does not allow sums over arbitrary set indices, such as uncountable index sets- as this would lead to problems such as being unable to define a uniform distribution over the reals numbers in the interval [0, 1] (as the arbitrary sum ability would let us use the fact that if each real is probability zero, then the set of all reals also has probability zero!).

## The Counter Point

A counter-argument in favor of only asserting finite disjoint additivity would be: assuming finite disjoint additivity is assuming less. So the theory may be more broadly applicable, and less influenced by our choices.

A possible difference between a finitist system and the current countably infinite system would be the following. Kolmogorov’s system allows that there is theory of uniform distributions of real numbers in the interval [0, 1]. However, also under Kolmogorov’s system there can not be a uniform distribution of the rational numbers in the interval [0, 1]! [The second part follows as the rational numbers are covered by a countable collection of probability zero intervals (exactly the intervals [q, q] for all rationals q in the interval of interest), therefore if our probability space were merely the rational numbers in the interval [0, 1] we would have the total measure of the space is zero, not the required oneThere are reasonable probability measures where single points have non-zero mass. But no more than 1/epsilon points can have a mass of epsilon or more. So very few points have non-negligible mass and for all points to have non-zero mass we must see the mass distributed non-uniformly.]

## The Synthesis

A challenge in probability theory is how to handle changes in scale. If we are not careful, probabilities required on many small sets seem to contradict the probabilities required on some of the large sets. In contrast to finite probability models, probabilities over infinite sets are usually not defined as mere sums of individual points.

The current measure theory axioms, are very much constrained by how they are going to deal with the challenges of change of scale.

Categories: History Mathematics Opinion Tutorials

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.