Back to Lecture 10

Statistical Theorems for Machine Learning

Why Randomness Becomes Predictable

The Magic of Large Numbers

The Fundamental Question

How can we trust statistical inference?

  • 🎲 Individual coin flips are unpredictable
  • πŸ“Š But the average of many flips becomes predictable
  • πŸ€– This is why machine learning works!
πŸ“ˆ
Law of Large Numbers

Sample average converges to true mean
"More data β†’ Better estimates"

πŸ””
Central Limit Theorem

Sample means follow normal distribution
"Averages become Gaussian"

Why These Theorems Matter

These two theorems are the mathematical foundation of all statistical inference and machine learning. They explain why we can learn from data, why averaging reduces uncertainty, and why statistical methods work in practice.

Law of Large Numbers (LLN)

The Statement

As the sample size increases, the sample mean converges to the true population mean.

Xn β†’ ΞΌ as n β†’ ∞

"Sample average approaches population mean as sample size grows"

The Intuition
❌ Small Sample (n = 10)

Example: Fair coin (true probability = 0.5)

  • Flip 1: 7 heads, 3 tails β†’ 0.70
  • Flip 2: 4 heads, 6 tails β†’ 0.40
  • Flip 3: 6 heads, 4 tails β†’ 0.60

High variance!
Results vary wildly (0.40 to 0.70)

βœ“ Large Sample (n = 10,000)

Example: Fair coin (true probability = 0.5)

  • Flip 1: 4,987 heads β†’ 0.4987
  • Flip 2: 5,012 heads β†’ 0.5012
  • Flip 3: 4,998 heads β†’ 0.4998

Low variance!
All close to true value (0.5)

Interactive Simulation: Dice Roll Averages

Roll a fair die many times and watch the average converge to 3.5

Current Progress

Total Rolls

0

Last Roll

β€”

Current Average

β€”

Error

β€”

Why This Matters for ML
  • Training: More training data β†’ better model estimates
  • Validation: Larger validation set β†’ more reliable performance estimate
  • Monte Carlo methods: More samples β†’ better approximation
  • Ensemble methods: Averaging many models reduces variance
  • Gradient descent: Mini-batch averages approximate true gradient

Central Limit Theorem (CLT)

The Statement

The distribution of sample means approaches a normal distribution as sample size increases, regardless of the population's original distribution!

Xn ~ N ( ΞΌ, σ²n ) as n β†’ ∞

"Sample means become normally distributed with variance shrinking as 1/n"

The Magic: Any Distribution β†’ Normal

Key Insight: Even if individual data points follow a weird distribution (uniform, exponential, bimodal), their averages will follow a bell curve!

Original Data

πŸ“Š

Any distribution
(uniform, exponential, etc.)

Take Averages

βž—

Average groups of
n samples each

Result

πŸ””

Normal distribution!
(bell curve)

Why "Central"? Because this theorem is CENTRAL to all of statistics! It explains why the normal distribution appears everywhere in nature and science.

Interactive Simulation: From Uniform to Normal

Start with uniform distribution (dice), take averages, watch it become normal!

Original Distribution
Distribution of Sample Means
Why This Matters for ML
  • Confidence intervals: We can compute error bars assuming normality
  • Hypothesis testing: T-tests, z-tests rely on CLT
  • Neural networks: Weight initialization often uses Gaussian
  • Batch gradient descent: Mini-batch gradients are approximately normal
  • Ensemble averaging: Combining predictions follows normal distribution
  • Error analysis: Prediction errors often approximately normal

Step-by-Step CLT Demo: Building the Normal Distribution

The Process: Watch how repeated sampling transforms ANY distribution into a normal distribution!

Step 1

πŸ“Š

Pick source distribution

Step 2

🎲

Sample n values

Step 3

πŸ“Š

Calculate mean

Step 4

πŸ”„

Repeat many times

⬇️
πŸ””

Normal Distribution of Means!

Current Status

Repetition

0

Current Sample

β€”

Current Mean

β€”

Collected Means

0

Source Distribution

Can be ANY shape

Current Sample (n values)

Mean: β€”

Distribution of Means

Becomes Normal! πŸ””

Key Observation: No matter what the source distribution looks like (left), the distribution of sample means (right) always becomes bell-shaped (normal) as you repeat the process many times!

Confidence Intervals: Quantifying Uncertainty

Why Confidence Intervals Matter

The Central Limit Theorem doesn't just tell us that sample means are normally distributedβ€”it allows us to quantify our uncertainty about estimates. Confidence intervals give us a range where the true population parameter likely falls, with a specific level of confidence.

How CLT Leads to Confidence Intervals
The Magic: CLT β†’ Normal Distribution β†’ Confidence Intervals
πŸ””

CLT Says

Means are Normal

β†’
πŸ“Š

Properties Known

Use Z-scores

β†’
πŸ“

Build Interval

With confidence!

CLT Foundation:

X ~ N ( ΞΌ, σ²n )

Sample means follow a normal distribution β†’ We can calculate probabilities!

Three-Step Process to Build Confidence Intervals
1️⃣
Standardize

Convert to Z-score:

Z = ( X βˆ’ ΞΌ ) / ( Οƒ/√n )

Result: Z ~ N(0, 1)

Transform to standard normal

2️⃣
Find Critical Value

Determine zΞ±/2:

P(βˆ’zΞ±/2 < Z < zΞ±/2)
= 1 βˆ’ Ξ±

Example: 95% CI β†’ z = 1.96

Based on confidence level

3️⃣
Build Interval

Final Formula:

X Β±
zΞ±/2 Β· Οƒβˆšn

Range: [Lower, Upper]

Captures true ΞΌ with confidence

Why This Works: The Power of CLT

Without CLT:

We wouldn't know the distribution of X, so we couldn't calculate probabilities or build intervals!

With CLT:

We know X ~ Normal β†’ Use Z-scores β†’ Quantify uncertainty precisely!

Understanding Z-Scores & Critical Values
What is a Z-Score?

A Z-score (or standard score) measures how many standard deviations a value is from the mean. For sampling distributions:

Z = ( X βˆ’ ΞΌ ) / ( Οƒ / √n ) = ( X βˆ’ ΞΌ ) /SE

where SE = Standard Error = Οƒ / √n

Common Confidence Levels & Critical Values
Confidence Level Ξ± Ξ±/2 Critical Value zΞ±/2 Interpretation
90% 0.10 0.05 1.645 90% of intervals will contain ΞΌ
95% 0.05 0.025 1.96 Most common! 95% capture rate
99% 0.01 0.005 2.576 Very high confidence, wider interval
Rule of Thumb: For quick calculations, remember z β‰ˆ 2 gives you approximately a 95% confidence interval!
Examples: Confidence Intervals at Different Levels
Scenario: Model Accuracy Estimation

You trained a classifier and tested it on n = 100 samples. The observed accuracy is X = 0.85 (85%), with standard deviation Οƒ = 0.15.

Question: What is the confidence interval for the true accuracy at different confidence levels?

Given Information

Sample Size

n = 100

Sample Mean

XΜ„ = 0.85

Std Deviation

Οƒ = 0.15

Standard Error

SE = 0.015

SE = Οƒ/√n = 0.15/√100 = 0.15/10 = 0.015

90% Confidence Interval

Calculation:

CI = 0.85 Β± 1.645 Γ— 0.015
CI = 0.85 Β± 0.0247
CI = [0.8253, 0.8747]

90% Confidence Interval

[82.5%, 87.5%]

Width: Β±2.47%

Interpretation: We are 90% confident that the true model accuracy lies between 82.5% and 87.5%.
95% Confidence Interval ⭐ (Most Common)

Calculation:

CI = 0.85 Β± 1.96 Γ— 0.015
CI = 0.85 Β± 0.0294
CI = [0.8206, 0.8794]

95% Confidence Interval

[82.1%, 87.9%]

Width: Β±2.94%

Interpretation: We are 95% confident that the true model accuracy lies between 82.1% and 87.9%.
This is the standard in ML research and industry!
99% Confidence Interval

Calculation:

CI = 0.85 Β± 2.576 Γ— 0.015
CI = 0.85 Β± 0.0386
CI = [0.8114, 0.8886]

99% Confidence Interval

[81.1%, 88.9%]

Width: Β±3.86%

Interpretation: We are 99% confident that the true model accuracy lies between 81.1% and 88.9%.
Higher confidence means wider interval!
Comparing All Three Confidence Levels
XΜ„ = 85%
82.5% 87.5% 90% CI
82.1% 87.9% 95% CI ⭐
81.1% 88.9% 99% CI
Key Insight: Higher confidence level β†’ Wider interval. There's a trade-off between confidence and precision!
Interactive Demo: Watch Confidence Intervals Shrink!
How It Works

Sample random numbers from a normal distribution (true mean = 50, Οƒ = 10). As you add more samples, watch how the confidence interval shrinks and variance decreases, demonstrating the power of larger sample sizes!

Current Statistics

Samples (n)

0

Sample Mean

β€”

Std Dev (s)

β€”

CI Lower

β€”

CI Upper

β€”

CI Width

β€”

Confidence Interval Convergence

CI bounds narrow as n increases

Sample Variance Convergence

Variance estimate stabilizes with more data

Observe: As n increases: (1) CI width shrinks as 1/√n, (2) Sample variance stabilizes around σ²=100, (3) Mean stays close to true value of 50. This is CLT + LLN in action!

How These Theorems Enable Machine Learning

The Big Picture

These two theorems form the mathematical foundation that makes machine learning possible. They transform uncertainty into predictability, and randomness into reliable patterns.

How the Theorems Work Together
πŸ“ˆ
Law of Large Numbers

Step 1: Collect Many Samples

n ↑↑↑

Increase sample size

⬇️

Step 2: Calculate Average

XΜ„n = 1/n Ξ£ Xi
⬇️

Step 3: Approaches True Mean

XΜ„n β†’ ΞΌ

Sample mean = True mean

πŸ’‘ Key Idea: More data β†’ Better estimate
Sample average converges to true mean
πŸ””
Central Limit Theorem

Step 1: Start with ANY Distribution

Uniform
Skewed
Bimodal

Original data shape doesn't matter!

⬇️ Repeat Experiments ⬇️

Step 2: Run Many Iterations, Calculate Mean Each Time

Iteration 1

Sample n points

β†’ X̄₁

Iteration 2

Sample n points

β†’ XΜ„β‚‚

Iteration 3

Sample n points

β†’ X̄₃
...

Iteration m

Sample n points

β†’ XΜ„m

πŸ“Š Collected Means:

X̄₁ XΜ„β‚‚ X̄₃ ... XΜ„m

Each iteration: sample n β†’ calculate mean

⬇️

Step 3: Distribution of These Means is Normal!

πŸ””
XΜ„ ~ N(ΞΌ, σ²/n)

Always becomes bell curve!

πŸ’‘ Key Idea: Averaging β†’ Normal distribution
Variance shrinks as σ²/n (tighter confidence!)
πŸ€– ML Application: Run many experiments (m trials), each gives a mean accuracy. These m means follow Normal(ΞΌ, σ²/n), allowing us to estimate confidence intervals!
Combined Power: LLN + CLT = ML Magic ✨
πŸ“Š

Collect Data

Sample from population

πŸ“ˆ

Make Estimates

LLN guarantees accuracy

🎯

Quantify Uncertainty

CLT gives confidence intervals

Concrete ML Examples
1️⃣ Model Training & Validation
πŸ€–

Train on
n=10,000

β†’
95%

Accuracy

β†’
95Β±0.4%

With CI

πŸ“ˆ

LLN: Large n β†’ True accuracy

πŸ””

CLT: Confidence interval

n=100: CI=Β±2% 😟 | n=10,000: CI=Β±0.4% βœ“

2️⃣ Stochastic Gradient Descent (SGD)

⚠️ Problem: Full dataset gradient is too slow!

πŸ—„οΈ

Full Data
(slow)

⚑
πŸ“¦

Mini-batch
(fast!)

β†’
πŸ“‰

βˆ‡β‰ˆTrue
Gradient

πŸ“ˆ

LLN: Batch avg β†’ True βˆ‡

πŸ””

CLT: Noise ~ N(0, σ²/n)

πŸ’‘ Larger batch β†’ More accurate + Smoother convergence!

3️⃣ Random Forests & Ensembles
🌳
🌳
🌳
...
🌳
β†’ βž— β†’
🎯

Ensemble
Avg

πŸ“ˆ

LLN: Many trees β†’ Accurate

πŸ””

CLT: Variance ∝ 1/n

πŸ’‘ 100 trees ≫ 10 trees, even if individual trees are weak!

4️⃣ A/B Testing & Experiment Design

Is Algorithm B better than Algorithm A? πŸ€”

A

Baseline

85%
vs
B

New

87%
β†’
πŸ“Š

Statistical
Test

πŸ“ˆ

LLN: Avg β†’ True performance

πŸ””

CLT: p-values & CI

⚠️ Need n β‰₯ 30 samples for valid statistical conclusions!

The Bottom Line

Without LLN and CLT, machine learning would be impossible. These theorems guarantee that: (1) Learning from finite samples works, and (2) We can quantify our uncertainty about what we learned. Every time you train a model, compute a confidence interval, or run an A/B test, you're relying on these fundamental results!

Key Takeaways

Core Concepts
  1. LLN: Sample mean β†’ True mean as n β†’ ∞
  2. CLT: Sample means β†’ Normal distribution
  3. Variance shrinks: Uncertainty ∝ 1/√n
  4. Universal: Works for any distribution
  5. Foundation: Makes statistical inference possible
ML Applications
  • Training: empirical risk minimization
  • Validation: confidence intervals
  • SGD: mini-batch gradient estimation
  • Ensembles: variance reduction
  • A/B testing: statistical significance
  • Monte Carlo: approximation guarantees
The Mathematical Foundation of ML

These theorems transform randomness into predictability.
They're the reason we can trust machine learning in the real world!

Next: You now understand the statistical foundations! Ready to apply these concepts to build real AI systems. Back to Lecture 10 β†’