Statistical Theorems for ML | Lecture 10

The Magic of Large Numbers

The Fundamental Question

How can we trust statistical inference?

🎲 Individual coin flips are unpredictable
📊 But the average of many flips becomes predictable
🤖 This is why machine learning works!

📈

Law of Large Numbers

Sample average converges to true mean
"More data → Better estimates"

🔔

Central Limit Theorem

Sample means follow normal distribution
"Averages become Gaussian"

Why These Theorems Matter

These two theorems are the mathematical foundation of all statistical inference and machine learning. They explain why we can learn from data, why averaging reduces uncertainty, and why statistical methods work in practice.

Law of Large Numbers (LLN)

The Statement

As the sample size increases, the sample mean converges to the true population mean.

X_n → μ as n → ∞

"Sample average approaches population mean as sample size grows"

The Intuition

❌ Small Sample (n = 10)

Example: Fair coin (true probability = 0.5)

Flip 1: 7 heads, 3 tails → 0.70
Flip 2: 4 heads, 6 tails → 0.40
Flip 3: 6 heads, 4 tails → 0.60

High variance!
Results vary wildly (0.40 to 0.70)

✓ Large Sample (n = 10,000)

Example: Fair coin (true probability = 0.5)

Flip 1: 4,987 heads → 0.4987
Flip 2: 5,012 heads → 0.5012
Flip 3: 4,998 heads → 0.4998

Low variance!
All close to true value (0.5)

Interactive Simulation: Dice Roll Averages

Roll a fair die many times and watch the average converge to 3.5

Max rolls:

Step size:

Current Progress

Total Rolls

0

Last Roll

—

Current Average

—

Error

—

                    Why This Matters for ML
                    Training: More training data → better model estimates
Validation: Larger validation set → more reliable performance estimate
Monte Carlo methods: More samples → better approximation
Ensemble methods: Averaging many models reduces variance
Gradient descent: Mini-batch averages approximate true gradient

                

Central Limit Theorem (CLT)

The Statement

The distribution of sample means approaches a normal distribution as sample size increases, regardless of the population's original distribution!

X_n ~ N ( μ, ^σ²⁄_n ) as n → ∞

"Sample means become normally distributed with variance shrinking as 1/n"

The Magic: Any Distribution → Normal

Key Insight: Even if individual data points follow a weird distribution (uniform, exponential, bimodal), their averages will follow a bell curve!

Original Data

📊

Any distribution
(uniform, exponential, etc.)

Take Averages

➗

Average groups of
n samples each

Result

🔔

Normal distribution!
(bell curve)

Why "Central"? Because this theorem is CENTRAL to all of statistics! It explains why the normal distribution appears everywhere in nature and science.

Interactive Simulation: From Uniform to Normal

Start with uniform distribution (dice), take averages, watch it become normal!

Source Distribution:

Sample Size (n per average):

Original Distribution

Distribution of Sample Means

                    Why This Matters for ML
                    Confidence intervals: We can compute error bars assuming normality
Hypothesis testing: T-tests, z-tests rely on CLT
Neural networks: Weight initialization often uses Gaussian
Batch gradient descent: Mini-batch gradients are approximately normal
Ensemble averaging: Combining predictions follows normal distribution
Error analysis: Prediction errors often approximately normal

                

Step-by-Step CLT Demo: Building the Normal Distribution

The Process: Watch how repeated sampling transforms ANY distribution into a normal distribution!

Step 1

📊

Pick source distribution

Step 2

🎲

Sample n values

Step 3

📊

Calculate mean

Step 4

🔄

Repeat many times

⬇️

🔔

Normal Distribution of Means!

Source Distribution:

Sample Size (n):

Max Repetitions:

Current Status

Repetition

0

Current Sample

—

Current Mean

—

Collected Means

0

Source Distribution

Can be ANY shape

Current Sample (n values)

Mean: —

Distribution of Means

Becomes Normal! 🔔

Key Observation: No matter what the source distribution looks like (left), the distribution of sample means (right) always becomes bell-shaped (normal) as you repeat the process many times!

Confidence Intervals: Quantifying Uncertainty

Why Confidence Intervals Matter

The Central Limit Theorem doesn't just tell us that sample means are normally distributed—it allows us to quantify our uncertainty about estimates. Confidence intervals give us a range where the true population parameter likely falls, with a specific level of confidence.

How CLT Leads to Confidence Intervals

The Magic: CLT → Normal Distribution → Confidence Intervals

🔔

CLT Says

Means are Normal

→

📊

Properties Known

Use Z-scores

→

📏

Build Interval

With confidence!

CLT Foundation:

X ~ N ( μ, ^σ²⁄_n )

Sample means follow a normal distribution → We can calculate probabilities!

Three-Step Process to Build Confidence Intervals

1️⃣

Standardize

Convert to Z-score:

Z = ( X − μ ) / ( σ/√n )

Result: Z ~ N(0, 1)

Transform to standard normal

2️⃣

Find Critical Value

Determine z_α/2:

P(−z_α/2 < Z < z_α/2)
= 1 − α

Example: 95% CI → z = 1.96

Based on confidence level

3️⃣

Build Interval

Final Formula:

X ±
z_α/2 · ^σ⁄_√n

Range: [Lower, Upper]

Captures true μ with confidence

Why This Works: The Power of CLT

Without CLT:

We wouldn't know the distribution of X, so we couldn't calculate probabilities or build intervals!

With CLT:

We know X ~ Normal → Use Z-scores → Quantify uncertainty precisely!

Understanding Z-Scores & Critical Values

What is a Z-Score?

A Z-score (or standard score) measures how many standard deviations a value is from the mean. For sampling distributions:

Z = ( X − μ ) / ( σ / √n ) = ( X − μ ) /SE

where SE = Standard Error = σ / √n

Common Confidence Levels & Critical Values

Confidence Level	α	α/2	Critical Value z_α/2	Interpretation
90%	0.10	0.05	1.645	90% of intervals will contain μ
95%	0.05	0.025	1.96	Most common! 95% capture rate
99%	0.01	0.005	2.576	Very high confidence, wider interval

Rule of Thumb: For quick calculations, remember z ≈ 2 gives you approximately a 95% confidence interval!

Examples: Confidence Intervals at Different Levels

Scenario: Model Accuracy Estimation

You trained a classifier and tested it on n = 100 samples. The observed accuracy is X = 0.85 (85%), with standard deviation σ = 0.15.

Question: What is the confidence interval for the true accuracy at different confidence levels?

Given Information

Sample Size

n = 100

Sample Mean

X̄ = 0.85

Std Deviation

σ = 0.15

Standard Error

SE = 0.015

SE = σ/√n = 0.15/√100 = 0.15/10 = 0.015

90% Confidence Interval

Calculation:

                                                CI = 0.85 ± 1.645 × 0.015

                                                CI = 0.85 ± 0.0247

                                                CI = [0.8253, 0.8747]

90% Confidence Interval

[82.5%, 87.5%]

Width: ±2.47%

Interpretation: We are 90% confident that the true model accuracy lies between 82.5% and 87.5%.

95% Confidence Interval ⭐ (Most Common)

Calculation:

                                                CI = 0.85 ± 1.96 × 0.015

                                                CI = 0.85 ± 0.0294

                                                CI = [0.8206, 0.8794]

95% Confidence Interval

[82.1%, 87.9%]

Width: ±2.94%

Interpretation: We are 95% confident that the true model accuracy lies between 82.1% and 87.9%.
This is the standard in ML research and industry!

99% Confidence Interval

Calculation:

                                                CI = 0.85 ± 2.576 × 0.015

                                                CI = 0.85 ± 0.0386

                                                CI = [0.8114, 0.8886]

99% Confidence Interval

[81.1%, 88.9%]

Width: ±3.86%

Interpretation: We are 99% confident that the true model accuracy lies between 81.1% and 88.9%.
Higher confidence means wider interval!

Comparing All Three Confidence Levels

X̄ = 85%

82.5% 87.5% 90% CI

82.1% 87.9% 95% CI ⭐

81.1% 88.9% 99% CI

Key Insight: Higher confidence level → Wider interval. There's a trade-off between confidence and precision!

Interactive Demo: Watch Confidence Intervals Shrink!

How It Works

Sample random numbers from a normal distribution (true mean = 50, σ = 10). As you add more samples, watch how the confidence interval shrinks and variance decreases, demonstrating the power of larger sample sizes!

Confidence Level:

Max Samples:

Current Statistics

Samples (n)

0

Sample Mean

—

Std Dev (s)

—

CI Lower

—

CI Upper

—

CI Width

—

Confidence Interval Convergence

CI bounds narrow as n increases

Sample Variance Convergence

Variance estimate stabilizes with more data

Observe: As n increases: (1) CI width shrinks as 1/√n, (2) Sample variance stabilizes around σ²=100, (3) Mean stays close to true value of 50. This is CLT + LLN in action!

How These Theorems Enable Machine Learning

The Big Picture

These two theorems form the mathematical foundation that makes machine learning possible. They transform uncertainty into predictability, and randomness into reliable patterns.

How the Theorems Work Together

📈

Law of Large Numbers

Step 1: Collect Many Samples

n ↑↑↑

Increase sample size

⬇️

Step 2: Calculate Average

X̄_n = ¹/_n Σ X_i

⬇️

Step 3: Approaches True Mean

X̄_n → μ

Sample mean = True mean

💡 Key Idea: More data → Better estimate
Sample average converges to true mean

🔔

Central Limit Theorem

Step 1: Start with ANY Distribution

Uniform

Skewed

Bimodal

Original data shape doesn't matter!

⬇️ Repeat Experiments ⬇️

Step 2: Run Many Iterations, Calculate Mean Each Time

Iteration 1

Sample n points

→ X̄₁

Iteration 2

Sample n points

→ X̄₂

Iteration 3

Sample n points

→ X̄₃

...

Iteration m

Sample n points

→ X̄_m

📊 Collected Means:

X̄₁ X̄₂ X̄₃ ... X̄_m

Each iteration: sample n → calculate mean

⬇️

Step 3: Distribution of These Means is Normal!

🔔

X̄ ~ N(μ, σ²/n)

Always becomes bell curve!

💡 Key Idea: Averaging → Normal distribution
Variance shrinks as σ²/n (tighter confidence!)

🤖 ML Application: Run many experiments (m trials), each gives a mean accuracy. These m means follow Normal(μ, σ²/n), allowing us to estimate confidence intervals!

Combined Power: LLN + CLT = ML Magic ✨

📊

Collect Data

Sample from population

📈

Make Estimates

LLN guarantees accuracy

🎯

Quantify Uncertainty

CLT gives confidence intervals

Concrete ML Examples

1️⃣ Model Training & Validation

🤖

Train on
n=10,000

→

95%

Accuracy

→

95±0.4%

With CI

📈

LLN: Large n → True accuracy

🔔

CLT: Confidence interval

n=100: CI=±2% 😟 | n=10,000: CI=±0.4% ✓

2️⃣ Stochastic Gradient Descent (SGD)

⚠️ Problem: Full dataset gradient is too slow!

🗄️

Full Data
(slow)

⚡

📦

Mini-batch
(fast!)

→

📉

∇≈True
Gradient

📈

LLN: Batch avg → True ∇

🔔

CLT: Noise ~ N(0, σ²/n)

💡 Larger batch → More accurate + Smoother convergence!

3️⃣ Random Forests & Ensembles

🌳

...

🌳

→ ➗ →

🎯

Ensemble
Avg

📈

LLN: Many trees → Accurate

🔔

CLT: Variance ∝ 1/n

💡 100 trees ≫ 10 trees, even if individual trees are weak!

4️⃣ A/B Testing & Experiment Design

Is Algorithm B better than Algorithm A? 🤔

A

Baseline

85%

vs

B

New

87%

→

📊

Statistical
Test

📈

LLN: Avg → True performance

🔔

CLT: p-values & CI

⚠️ Need n ≥ 30 samples for valid statistical conclusions!

The Bottom Line

Without LLN and CLT, machine learning would be impossible. These theorems guarantee that: (1) Learning from finite samples works, and (2) We can quantify our uncertainty about what we learned. Every time you train a model, compute a confidence interval, or run an A/B test, you're relying on these fundamental results!

Key Takeaways

Core Concepts

LLN: Sample mean → True mean as n → ∞
CLT: Sample means → Normal distribution
Variance shrinks: Uncertainty ∝ 1/√n
Universal: Works for any distribution
Foundation: Makes statistical inference possible

                            ML Applications
                            Training: empirical risk minimization
Validation: confidence intervals
SGD: mini-batch gradient estimation
Ensembles: variance reduction
A/B testing: statistical significance
Monte Carlo: approximation guarantees

                        

The Mathematical Foundation of ML

These theorems transform randomness into predictability.
They're the reason we can trust machine learning in the real world!

Next: You now understand the statistical foundations! Ready to apply these concepts to build real AI systems. Back to Lecture 10 →

Statistical Theorems for Machine Learning

The Magic of Large Numbers

The Fundamental Question

Law of Large Numbers

Central Limit Theorem

Why These Theorems Matter

Law of Large Numbers (LLN)

The Statement

The Intuition

Interactive Simulation: Dice Roll Averages

Why This Matters for ML

Central Limit Theorem (CLT)

The Statement

The Magic: Any Distribution → Normal

Interactive Simulation: From Uniform to Normal

Why This Matters for ML

Step-by-Step CLT Demo: Building the Normal Distribution

Confidence Intervals: Quantifying Uncertainty

Why Confidence Intervals Matter

How CLT Leads to Confidence Intervals

The Magic: CLT → Normal Distribution → Confidence Intervals

Three-Step Process to Build Confidence Intervals

Why This Works: The Power of CLT

Understanding Z-Scores & Critical Values

What is a Z-Score?

Common Confidence Levels & Critical Values

Examples: Confidence Intervals at Different Levels

Scenario: Model Accuracy Estimation

90% Confidence Interval

95% Confidence Interval ⭐ (Most Common)

99% Confidence Interval

Comparing All Three Confidence Levels

Interactive Demo: Watch Confidence Intervals Shrink!

How It Works

How These Theorems Enable Machine Learning

The Big Picture

How the Theorems Work Together

Law of Large Numbers

Central Limit Theorem

Combined Power: LLN + CLT = ML Magic ✨

Concrete ML Examples

The Bottom Line

Key Takeaways

Core Concepts

ML Applications

The Mathematical Foundation of ML