Statistics for Data Science: Central Limit Theorem

3 min readOct 30, 2021

Central Limit Theorem is one of the most fundamental theorems in Statistics. The theorem states that if we have a population and we take a lot of large sized random samples from the population, the means of the samples will approximate normal distribution, regardless of the distribution of the population itself.

Before we expand any more on the theorem, let us understand some of the terminology used above.

Population : In statistics we often want to study about the characteristics of a population. A population is a set of all the elements in a group or all the outcomes possible in an experiment. For example if we are doing a study on the voting preferences of Americans, the population will consists of all the citizens of USA. Or if we are trying to find the average of all outcomes of a roll of a fair dice , the population will consist of all the outcomes of rolling the dice infinite times.

Most often studying the whole population is hard , time-consuming and involves lots of resources. So instead we rely on sampling.

Sampling : Sampling involves taking subsets of the population , which are scaled down representations of the population on the whole , and combining the information obtained from these samples to draw conclusions about the population as a whole.

This is how most studies in inferential statistics are carried out, by taking representative samples of decent size and using them to derive inferences about the population . Not all samples though, are equally useful.

A good sample must be:

Representative of the whole population.
Big enough to draw conclusions (thumb rule here is ≥30).
Picked at random so as not to be biased towards any particular segment of the population.

Normal Distribution : When your population is spread perfectly symmetrical with σ standard deviations around the mean value μ, you get the following bell-shaped curve:

This is called a Normal Distribution or a Gaussian Distribution with a mean μ and standard deviation σ.

Coming back to the theorem now, if we draw samples , of size n each from a population, then the mean of each sample is an estimate of the population mean itself. According to Central limit theorem, these sample means are normally distributed with a mean μ and standard deviation σ²/n.

Using Central Limit theorem, we can obtain a fairly good estimate of the population mean and standard deviation even if we do not know about the distribution of the population itself. And as the sample size n increases , the sample standard deviation σ²/n, converges around μ . This means that for large enough sample size, the sample mean is quite close to population mean .

So why is Central Limit Theorem so important ? It is so, as it tells us that the sample means drawn from a population are normally distributed , no matter what is the underlying distribution of the population. Using this normal distribution , we can easily test out our ideas and hypothesis about the population , even though we know nothing about the distribution of the population itself.

Thanks for reading. Please feel free to add any comments or ask any questions and I will try my best to answer them.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Statistics

Data Science

Machine Learning

Written by Prakhar S

223 Followers

52 Following

Data Engineer, Sprighub

Responses (2)

Write a response

What are your thoughts?

Also publish to my profile

SooYoung Kwak

Nov 1, 2021

Thanks for clear, simple explanation for the concept! By the way, I noticed that there was some typo regarding sample variance denoted as the standard deviation.

Kadamhari

Apr 21, 2022

Hello Prakhar.
You explianed this concept in very easy language. Thanks .

More from Prakhar S

Plotting Decision Boundaries using Numpy and Matplotlib

Prakhar S

Plotting Decision Boundaries using Numpy and Matplotlib

A decision boundary is a surface that separates two or more classes into different sets, where all the points belonging to one class lie on…

Jan 11, 2022

Prakhar S

Linear Regression using Tensorflow

A neural network is normally associated with Deep Learning problems, such as Image classification or Natural Language Processing. But it…

Jan 19, 2022

Getting Matrix Dimensions Right in Neural Networks

Prakhar S

Getting Matrix Dimensions Right in Neural Networks

I have always had problems in getting the shape of the various matrices right when trying to use forward or backward propagation in Neural…

Dec 8, 2021

Data Normalization in Relational Databases

Prakhar S

Data Normalization in Relational Databases

Data Normalisation is a bottom-up technique for database design, as opposed to Entity-Relationship Diagrams, which are a top-down technique…

Feb 4, 2022

112

See all from Prakhar S

Recommended from Medium

Art Of Living

Fill Your Own Cup First:

The Secret to a Fulfilling Life

Mar 26

How to Split a Pandas Dataframe Randomly into Train and Test Sets with Scikit-Learn and Python

Vili Meriläinen

How to Split a Pandas Dataframe Randomly into Train and Test Sets with Scikit-Learn and Python

In this article I will show how to use the train_test_split() -function from the scikit-learn library to split your Pandas Dataframe…

Nov 22, 2024

What I Wish I Knew Before Becoming A Data Scientist (2): All About Interviews

Women in Technology

Lu Zhenna

What I Wish I Knew Before Becoming A Data Scientist (2): All About Interviews

How to navigate data scientist job interviews and pave the way to your dream job?

Mar 18

123

Python in Plain English

Lee Vaughan

Don’t Let Conda Eat Your Hard Drive

Memory management for conda environments

5d ago

260

Multicollinearity in Data Science and Machine Learning: The Hidden Threat and How to Tackle It

Academy Team

Mustafa Erboga, Ph.D.

Multicollinearity in Data Science and Machine Learning: The Hidden Threat and How to Tackle It

In data science and machine learning, understanding the relationships between variables is essential for building accurate and…

Nov 8, 2024

381

You’re Naive if you think Linear Regression and Linear Regression are the same thing.

Damini Vadrevu

You’re Naive if you think Linear Regression and Linear Regression are the same thing.

Yes, you read that right.

Feb 5

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Rules
Terms
Text to speech