How can we trust statistical inference?
Sample average converges to true mean
"More data β Better estimates"
Sample means follow normal distribution
"Averages become Gaussian"
These two theorems are the mathematical foundation of all statistical inference and machine learning. They explain why we can learn from data, why averaging reduces uncertainty, and why statistical methods work in practice.
As the sample size increases, the sample mean converges to the true population mean.
"Sample average approaches population mean as sample size grows"
Example: Fair coin (true probability = 0.5)
High variance!
Results vary wildly (0.40 to 0.70)
Example: Fair coin (true probability = 0.5)
Low variance!
All close to true value (0.5)
Roll a fair die many times and watch the average converge to 3.5
Total Rolls
0
Last Roll
β
Current Average
β
Error
β
The distribution of sample means approaches a normal distribution as sample size increases, regardless of the population's original distribution!
"Sample means become normally distributed with variance shrinking as 1/n"
Key Insight: Even if individual data points follow a weird distribution (uniform, exponential, bimodal), their averages will follow a bell curve!
Original Data
Any distribution
(uniform, exponential, etc.)
Take Averages
Average groups of
n samples each
Result
Normal distribution!
(bell curve)
Why "Central"? Because this theorem is CENTRAL to all of statistics! It explains why the normal distribution appears everywhere in nature and science.
Start with uniform distribution (dice), take averages, watch it become normal!
The Process: Watch how repeated sampling transforms ANY distribution into a normal distribution!
Step 1
Pick source distribution
Step 2
Sample n values
Step 3
Calculate mean
Step 4
Repeat many times
Normal Distribution of Means!
Repetition
0
Current Sample
β
Current Mean
β
Collected Means
0
Can be ANY shape
Mean: β
Becomes Normal! π
Key Observation: No matter what the source distribution looks like (left), the distribution of sample means (right) always becomes bell-shaped (normal) as you repeat the process many times!
The Central Limit Theorem doesn't just tell us that sample means are normally distributedβit allows us to quantify our uncertainty about estimates. Confidence intervals give us a range where the true population parameter likely falls, with a specific level of confidence.
CLT Says
Means are Normal
Properties Known
Use Z-scores
Build Interval
With confidence!
CLT Foundation:
Sample means follow a normal distribution β We can calculate probabilities!
Convert to Z-score:
Result: Z ~ N(0, 1)
Transform to standard normal
Determine zΞ±/2:
Example: 95% CI β z = 1.96
Based on confidence level
Final Formula:
Range: [Lower, Upper]
Captures true ΞΌ with confidence
Without CLT:
We wouldn't know the distribution of X, so we couldn't calculate probabilities or build intervals!
With CLT:
We know X ~ Normal β Use Z-scores β Quantify uncertainty precisely!
A Z-score (or standard score) measures how many standard deviations a value is from the mean. For sampling distributions:
where SE = Standard Error = Ο / βn
| Confidence Level | Ξ± | Ξ±/2 | Critical Value zΞ±/2 | Interpretation |
|---|---|---|---|---|
| 90% | 0.10 | 0.05 | 1.645 | 90% of intervals will contain ΞΌ |
| 95% | 0.05 | 0.025 | 1.96 | Most common! 95% capture rate |
| 99% | 0.01 | 0.005 | 2.576 | Very high confidence, wider interval |
You trained a classifier and tested it on n = 100 samples. The observed accuracy is X = 0.85 (85%), with standard deviation Ο = 0.15.
Question: What is the confidence interval for the true accuracy at different confidence levels?
Sample Size
n = 100
Sample Mean
XΜ = 0.85
Std Deviation
Ο = 0.15
Standard Error
SE = 0.015
SE = Ο/βn = 0.15/β100 = 0.15/10 = 0.015
Calculation:
90% Confidence Interval
Width: Β±2.47%
Calculation:
95% Confidence Interval
Width: Β±2.94%
Calculation:
99% Confidence Interval
Width: Β±3.86%
Sample random numbers from a normal distribution (true mean = 50, Ο = 10). As you add more samples, watch how the confidence interval shrinks and variance decreases, demonstrating the power of larger sample sizes!
Samples (n)
0
Sample Mean
β
Std Dev (s)
β
CI Lower
β
CI Upper
β
CI Width
β
CI bounds narrow as n increases
Variance estimate stabilizes with more data
Observe: As n increases: (1) CI width shrinks as 1/βn, (2) Sample variance stabilizes around ΟΒ²=100, (3) Mean stays close to true value of 50. This is CLT + LLN in action!
These two theorems form the mathematical foundation that makes machine learning possible. They transform uncertainty into predictability, and randomness into reliable patterns.
Step 1: Collect Many Samples
Increase sample size
Step 2: Calculate Average
Step 3: Approaches True Mean
Sample mean = True mean
Step 1: Start with ANY Distribution
Original data shape doesn't matter!
Step 2: Run Many Iterations, Calculate Mean Each Time
Iteration 1
Sample n points
Iteration 2
Sample n points
Iteration 3
Sample n points
Iteration m
Sample n points
π Collected Means:
Each iteration: sample n β calculate mean
Step 3: Distribution of These Means is Normal!
Always becomes bell curve!
Collect Data
Sample from population
Make Estimates
LLN guarantees accuracy
Quantify Uncertainty
CLT gives confidence intervals
Train on
n=10,000
Accuracy
With CI
LLN: Large n β True accuracy
CLT: Confidence interval
n=100: CI=Β±2% π | n=10,000: CI=Β±0.4% β
β οΈ Problem: Full dataset gradient is too slow!
Full Data
(slow)
Mini-batch
(fast!)
ββTrue
Gradient
LLN: Batch avg β True β
CLT: Noise ~ N(0, ΟΒ²/n)
π‘ Larger batch β More accurate + Smoother convergence!
Ensemble
Avg
LLN: Many trees β Accurate
CLT: Variance β 1/n
π‘ 100 trees β« 10 trees, even if individual trees are weak!
Is Algorithm B better than Algorithm A? π€
Baseline
New
Statistical
Test
LLN: Avg β True performance
CLT: p-values & CI
β οΈ Need n β₯ 30 samples for valid statistical conclusions!
Without LLN and CLT, machine learning would be impossible. These theorems guarantee that: (1) Learning from finite samples works, and (2) We can quantify our uncertainty about what we learned. Every time you train a model, compute a confidence interval, or run an A/B test, you're relying on these fundamental results!
These theorems transform randomness into predictability.
They're the reason we can trust machine learning in the real world!