Bayes Theorem in Statistics provides us with a way for calculating conditional probabilities in the light of new or existing evidence. It has varied and wide-ranging use cases, such as understanding the results of a medical test to improving our machine learning models.
Though the mathematical formula for Bayes theorem is quite simple and easy to derive using the definition of conditional probability, understanding the intuition behind the theorem is not so trivial. I have been thinking of alternative ways to describe the theorem other than what the definition says and was inspired to write about this visual derivation using areas of the sample and event spaces, after watching this excellent video by Grant Sanderson.
Before starting though, let us get the definition out of the way :
Given two different events A and B, where P(B)=0, Bayes theorem says:
P(A|B) = P(B|A)xP(A)/P(B)
Where :
P(A|B) is the conditional probability of event A occurring given B has already occurred
P(B|A) is the conditional probability of event B occurring given A has already occurred
P(A) probability of event A occurring on its own.
P(B) probability of event B occurring on its own.
The mathematical derivation is quite simple and can be found here.
In this article, I will try to prove the above theorem visually. This will build upon my previous article on conditional probability, but it is not mandatory to go through it to understand the concepts here if you already are familiar with terms like ‘sample space’ and ‘events’.
As in the previous article, I am going to denote the sample space by rectangle U and events by other rectangles or shapes inside U.
I am also going to assume that the shapes have been drawn to a scale, such that, the area of a shape is proportional to the number of outcomes possible in the event represented by that shape, and that Area(U) = P(U) = 1.
Now let us consider the following scenario :
A factory produces an item using three machines — A, B, and C — which account for 20%, 30%, and 50% of its output, respectively. Of the items produced by machine A, 5% are defective; similarly, 3% of machine B’s items and 1% of machine C’s are defective. If a randomly selected item is defective, what is the probability it was produced by machine A?
Let us first draw the visual representation of this scenario :
In the scenario described above, our sample space U consists of all the items produced in the given factory and is represented by the outer rectangular boundary in the figure. The rectangles A, B and C represent the events of items being produced by machines A, B and C and Da, Db and Dc are the events of items being defective when they are produced by A, B and C respectively.
Using our area assumption, Area(U) = 1 . Also I am assuming that the rectangle U has been drawn such that h=1. So then we have,
wa + wb + wc = 1 (1)
Also as per our area assumption, the number of items produced by machine A is proportional to the area of rectangle A, which is
No. of items produced by machine A ∞ Area(A)
Hence, by the definition of probability,
The probability that an item has been produced by machine A,
P(A) = No. of items produced by the machine A ÷ Total Number of items produced,
or
P(A) = Area(A)/Area(U). //cancelling the common proportionality term
putting Area(U) = 1,
P(A) = Area(A) = wa * h = wa (2)
similarly,
P(B) = wb ,
P(C) = wc.
And
P(Da) = Area(Da),
P(Db) = Area(Db) , and
P(Dc) = Area(Dc).
If we take the heights of these rectangles D to be ha, hb and hc respectively,
then,
P(Da) = wa * ha and so on.
Suppose we know that the scenario that A has already occurred, the sample space then becomes
Now the probability of D given A has occurred will be given by
P(D|A) = Area(Da)/Area(A)
or
P(D|A) = wa*ha/wa*h
which gives
P(D|A) = ha/h (3)
Now we consider the second part of the scenario, which says that it is given that a randomly sampled item from the sample space U is defective. In this case, our sample space will now shrink to the shaded grey area in the above figure, consisting of rectangles Da, Db and Dc, as a defective item can only belong to this area.
This is how our new sample space will look now, in the light of new evidence.
If now we have to find out the probability P(A|D), that is, the probability of a given defective item coming from machine A, we can easily see that this can be written in terms of areas as
P(A|D) = Area(Da) / (Area(Da) + Area(Db) + Area(Dc))
If we divide both numerator and denominator by Area(U) (=1), we get
P(A|D) = Area(Da)/Area(U) / (Area(Da)+Area(Db)+Area(Dc))/ Area(U)
The term Area(Da)/Area(U) is nothing but P(Da), and the denominator term is P(D) .
so we get
P(A|D) = P(Da)/P(D) (4)
if we suppose that height of rectangle Da is ha , then
P(Da) = Area(Da) = wa * ha
dividing and multiplying the term by h, the height of rectangle U ( or A) ,
P(Da) = (wa * ha * h)/h, or
P(Da) = wa * (ha/h) // removing one h, as h=1
Now from equation (2) above, we know that wa = P(A),
and from equation (3), ha/h = P(D|A).
so substituting these in equation 4 we get
P(A|D) = P(D|A) x P(A) / P(D)
which is what the formula of the Bayes theorem gives us.
Note that even though I have made some assumptions about sample space and areas, in real-world scenarios it is always possible to represent our experiments like this where the shapes or areas of random events represent their probabilities. I have actually taken this idea from the Monte Carlo method of estimating areas using probabilities and reversed it to represent probabilities using areas.
Your comments and suggestion are welcome. Subscribe if you would like to read more articles like this in future. Thank you.