Back to Lecture 10

Foundations of Probability Theory

The Mathematics of Uncertainty

Introduction: Formalizing Uncertainty

From Intuition to Mathematics

You've seen why AI needs probability (Topic 02) and how uncertainty changes the paradigm (Topic 03). Now we formalize these concepts mathematically. Probability theory provides the rigorous foundation for reasoning under uncertainty.

What You'll Learn

Core concepts with interactive visualizations:

  1. Types of data and random variables
  2. Sample spaces, events, and the three axioms of probability
  3. Probability distributions (PMF, PDF, CDF)
  4. Law of Large Numbers - why averages converge
  5. Central Limit Theorem - why the normal distribution is everywhere

Types of Data

Qualitative vs Quantitative Data
Qualitative (Categorical)
📝

Describes characteristics or categories

Non-numerical - Labels, Names, Categories

🌤️ Weather Examples:

☀️ Sunny ☁️ Cloudy 🌧️ Rainy 🌪️ Sandstorm

🏠 Real Estate Examples:

  • City: Riyadh, Jeddah, Dammam
  • Type: Villa, Apartment, Townhouse
  • Condition: New, Renovated, Old
Quantitative (Numerical)
🔢

Numeric measurements or counts

Numerical - Numbers, Quantities, Measurements

🌡️ Weather Examples:

  • Temperature: 28°C, 35°C, 18°C
  • Rainfall: 0mm, 5mm, 25mm
  • Wind speed: 15 km/h, 30 km/h

💰 Real Estate Examples:

  • Price: 1,500,000 SAR
  • Area: 250 m², 180 m²
  • Bedrooms: 3, 4, 5
Discrete vs Continuous Data
Discrete Data
⚫ ⚫ ⚫
D = {d1, d2, d3, ...}
Countable - Separate, distinct values
X ∈ {0, 1, 2, 3, ...}

Visual: Dots on a number line

{1, 2, 3, 4, 5, 6}

Examples:

  • 🎲 Dice roll: {1, 2, 3, 4, 5, 6}
  • 🌧️ Rainy days: {0, 1, 2, ..., 31}
  • 🛏️ Bedrooms: {1, 2, 3, 4, 5, ...}
Continuous Data
📈
C
Uncountable - Any value in a range
X ∈ [a, b] ⊂

Visual: Solid line (infinite points)

[0, ∞) - continuous range

Examples:

  • 🌡️ Temperature: [−50°C, 50°C]
  • 💧 Rainfall: [0, ∞) mm
  • 💰 Price: [0, ∞) SAR

Population vs Sample - Foundation of Statistical Inference

Population (Π)

Definition: The entire set of all possible observations

Π = {x1, x2, ..., xN}

Example (Weather):

Π = All daily temperatures in Riyadh for entire year (N = 365)

μ = 1365 i=1365 Ti
Sample (S)

Definition: A subset selected for analysis

S = {s1, s2, ..., sn} ⊂ Π where n « N

Example (Weather):

S = Temperatures on 30 randomly selected days (n = 30)

x = 130 i=130 siμ
Interactive: Population vs Sample Simulator

Population of 100 Riyadh daily temperatures. Take random samples to see how sample mean estimates population mean.

Population: All 100 Days (hover to see temperature)
Population Mean (μ)

Sample Mean (x̄)

Error |x̄ - μ|

Key Insight

Sample mean (x̄) estimates population mean (μ): Larger samples give better estimates. With n=30, sample mean is typically within 1-2°C of true population mean. This is the foundation of all statistical inference!

Sample Spaces & Events

Sample Space (Ω)
🎯

Set of ALL possible outcomes

Ω = {ω1, ω2, ω3, ...}

🪙 Coin flip:

H T

Ω = {H, T}

🎲 Dice roll:

1 2 3 4 5 6

Ω = {1, 2, 3, 4, 5, 6}

🌤️ Weather:

☀️ Sunny ☁️ Cloudy 🌧️ Rainy

Ω = {sunny, cloudy, rainy}

Event (E)

A subset of the sample space

EΩ

🎲 Even numbers:

1 2 ✓ 3 4 ✓ 5 6 ✓

E = {2, 4, 6} ⊆ Ω

🌡️ High temp (>30°C):

E = (30, ∞)

🌤️ Good weather:

☀️ Sunny ✓ ☁️ Cloudy ✓ 🌧️ Rainy

E = {sunny, cloudy}

The Three Axioms of Probability (Kolmogorov, 1933)

All of probability theory is built on just three simple axioms. Everything else (Bayes' rule, distributions, inference) follows from these foundations.

Axiom 1: Non-negativity
📊
0 ≤ P(E) ≤ 1

"Probabilities are always
between 0 and 1"

Axiom 2: Certainty
💯
P(Ω) = 1

"Something must happen
(sample space = certain)"

Axiom 3: Additivity
If E1E2 = ∅:
P(E1E2) = P(E1) + P(E2)

"Disjoint events
add up"

Important Derived Rules
Impossible Event
P(∅) = 0

Empty set has probability zero

Complement Rule
P(E) = 1 − P(E)

Probability of "not E"

General Addition Rule
P(E1E2) = P(E1) + P(E2) − P(E1E2)

For any two events (even if they overlap)

Random Variables - Mapping Outcomes to Numbers

What is a Random Variable?

A random variable is a function that assigns a number to each outcome in the sample space. It transforms outcomes (which might not be numbers) into numerical values we can work with mathematically.

X: Ω

"X maps each outcome ωΩ to a real number"

Example: Mapping Outcomes to Numbers
Outcomes (Sample Space)

Flip 3 coins

HHH HHT HTH HTT THH THT TTH TTT

Ω = {HHH, HHT, ...}

Random Variable (X = # of Heads)

Count heads in each outcome

HHH → 3 HHT → 2 HTH → 2 HTT → 1 THH → 2 THT → 1 TTH → 1 TTT → 0

X ∈ {0, 1, 2, 3}

Interactive: How Sampling Builds a Distribution

Sample from a random variable repeatedly. Watch the histogram converge to the true distribution!

Total Samples Collected: 0

True Distribution (theoretical)

Empirical Distribution (from samples)

Key Insight - Distributions Emerge from Data

Start with few samples (n=10): Empirical distribution is rough, doesn't match true distribution.
Increase samples (n=100, n=500, n=1000): Empirical distribution converges to true distribution!
This is how AI learns: Collecting more data reveals the underlying probability distribution. This is also the Law of Large Numbers in action!

Sample Mean & Variance - Measuring Center and Spread

Two Key Statistics

Any dataset can be summarized by two fundamental numbers: the mean (center/average) and the variance (spread/variability).

Sample Mean (x̄)
📍

Measures the CENTER

x = 1n i=1n xi

"Average of all values"

Example:
Data: {2, 4, 6, 8, 10}
x̄ = (2+4+6+8+10)/5 = 6
Sample Variance (s²)
📊

Measures the SPREAD

s2 = 1n i=1n (xix)2

"Average squared distance from mean"

Example:
Data: {2, 4, 6, 8, 10}, x̄=6
s² = [(2-6)²+(4-6)²+...]/5 = 8

Add temperature data points and see mean (center) and variance (spread) calculated in real-time!

Dataset (0 points)
No data yet. Add some points!
Sample Mean (x̄)

Sample Variance (s²)

Std Dev (s)

√(variance)
Key Insights
  • Mean (x̄): The balance point - where data centers
  • Variance (s²): How spread out the data is - low variance = clustered, high variance = scattered
  • Standard Deviation (s): Variance in same units as data - easier to interpret
  • For AI: Mean tells us "typical value", variance tells us "uncertainty/confidence"
Interactive Demo 2: Sensor Readings - Reducing Variance Through Sampling

Distance sensor gives conflicting readings due to noise. Take multiple samples to reduce variance and increase confidence!

📡 Sensor Scenario: Robot measuring distance to obstacle

🎯 True distance: 3.5 meters (unknown to robot)
⚠️ Sensor noise: ±0.8m error
Problem: Each reading is different! How confident can we be?

Readings
0
Mean Estimate
Variance (s²)
Confidence (±)
Key Insight - Variance Decreases with More Samples

With 1 reading: High variance, low confidence. Can't trust a single noisy measurement!
With 20 readings: Variance reduced, confidence increased. Average of many readings is more reliable!
Formula: Confidence interval = x̄ ± 1.96×(s/√n) - Gets narrower as n increases.
For AI: This is why robots take multiple sensor readings before making critical decisions!

Method 1: Plain Python (from scratch)
# Sample data
data = [22, 28, 19, 25, 31, 24, 27]

# Calculate mean
n = len(data)
mean = sum(data) / n
print(f"Sample Mean (x̄): {mean:.2f}")

# Calculate variance
variance = sum((x - mean)**2 for x in data) / n
print(f"Sample Variance (s²): {variance:.2f}")

# Calculate standard deviation
std_dev = variance ** 0.5
print(f"Sample Std Dev (s): {std_dev:.2f}")

# Output:
# Sample Mean (x̄): 25.14
# Sample Variance (s²): 13.55
# Sample Std Dev (s): 3.68
Method 2: NumPy (efficient for large datasets)
import numpy as np

# Sample data
data = np.array([22, 28, 19, 25, 31, 24, 27])

# Calculate mean
mean = np.mean(data)
print(f"Mean: {mean:.2f}")

# Calculate variance
variance = np.var(data)  # Population variance
print(f"Variance: {variance:.2f}")

# Calculate standard deviation
std_dev = np.std(data)
print(f"Std Dev: {std_dev:.2f}")

# Alternative: sample variance (divides by n-1)
sample_var = np.var(data, ddof=1)
sample_std = np.std(data, ddof=1)
print(f"Sample Variance (unbiased): {sample_var:.2f}")
print(f"Sample Std Dev (unbiased): {sample_std:.2f}")
Method 3: Pandas (for dataframes)
import pandas as pd

# Create DataFrame with temperature data
df = pd.DataFrame({
    'temperature': [22, 28, 19, 25, 31, 24, 27],
    'city': ['Riyadh', 'Jeddah', 'Dammam', 'Riyadh', 'Jeddah', 'Riyadh', 'Dammam']
})

# Calculate mean
mean_temp = df['temperature'].mean()
print(f"Mean Temperature: {mean_temp:.2f}°C")

# Calculate variance and std dev
var_temp = df['temperature'].var()   # Sample variance (n-1)
std_temp = df['temperature'].std()   # Sample std dev (n-1)
print(f"Variance: {var_temp:.2f}")
print(f"Std Dev: {std_temp:.2f}°C")

# Get full statistics summary
print(df['temperature'].describe())

# Group by city and calculate stats
city_stats = df.groupby('city')['temperature'].agg(['mean', 'var', 'std'])
print(city_stats)
Which Method to Use?

Plain Python: Learning, small datasets, understanding the math

NumPy: Large numerical arrays, fast computation, scientific computing

Pandas: Tabular data, data analysis, grouping operations

Probability Distributions - Modeling Uncertainty

What is a Probability Distribution?

A probability distribution describes how probabilities are distributed over possible values of a random variable. It tells us "how likely is each outcome?"

Interactive: Distribution Visualizer & Probability Calculator
Calculate P(a ≤ X ≤ b) - Probability in Range

PDF: f(x) - Probability Density/Mass

CDF: F(x) = P(X ≤ x) - Cumulative

Try This
  • Select Normal distribution, adjust μ and σ to see shape change
  • Calculate P(-1 ≤ X ≤ 1) - see the shaded region update!
  • Select Binomial, change n and p to see discrete probabilities
  • Compare PDF (density) with CDF (cumulative) - CDF always goes 0→1

Law of Large Numbers - Averages Converge

The Law
limn→∞ 1n i=1n Xi = E[X]

"Sample mean converges to expected value as n → ∞"

Interactive: Dice Rolling - Law of Large Numbers

Expected value of dice: E[X] = 3.5. Watch running average converge!

Total Rolls

0

Running Average

Expected Value

3.5

Key Insight

Early rolls are wild, but average stabilizes: With few rolls, average jumps around. With many rolls, it converges to 3.5. This is why AI systems improve with more training data - the Law of Large Numbers guarantees convergence!

Central Limit Theorem - The Magic of the Normal Distribution

The Theorem
Xn ~ N(μ, σ2n)   for large n

"Sample means follow a normal distribution, regardless of the population's original distribution!"

Interactive: Central Limit Theorem Visualizer

Watch the bell curve emerge from ANY population distribution!

n = 1

Population Distribution (weird shape)

Sample Means Distribution (becomes normal!)

The "Magic" Moment

Try this: Select "Exponential" or "Bimodal" (very non-normal!), then increase sample size to n=30. Watch the right chart transform into a perfect bell curve! This is why the normal distribution appears everywhere in nature and AI.

Two Interpretations of Probability

What Does Probability Mean?

There are two fundamentally different ways to interpret what "probability" means. Both are valid and useful, but they lead to different approaches in statistics and AI.

📊
Frequentist
Probability = Long-run Frequency

Repeat Experiment ∞ Times

🪙 🪙 🪙 🪙 🪙 ...
Count outcomes
🪙

Coin Flip:

P(H) = 0.5 = "50% of flips are heads"

🌧️

Weather:

P(rain) = 0.3 = "30% of similar days had rain"

💡 Objective
Probability exists in the world

🧠
Bayesian
Probability = Degree of Belief

Update Belief with Evidence

🤔 Prior → 📊 Evidence → ✅ Posterior
Bayes' Rule
⚖️

Guilt:

P(guilty) = 0.7 = "70% confident they're guilty"

🌧️

Weather:

P(rain) = 0.3 = "I believe 30% chance"

💭 Subjective
Probability is in the mind

Detailed Comparison Table
Aspect Frequentist Bayesian
Meaning Long-run frequency in repeated trials Degree of belief given evidence
Parameters Fixed (but unknown) Random variables with distributions
Prior Beliefs Not used Explicitly modeled as priors
Update Rule More data → better estimate Bayes' rule: posterior ∝ likelihood × prior
One-time Events Problematic ("can't repeat") Natural (degree of belief)
Example Use Clinical trials, hypothesis testing Machine learning, Bayesian networks, AI
Interactive: Coin Flip Comparison

Is the coin fair? Flip it and watch how Frequentist and Bayesian approaches differ!

Total Flips: 0 Heads: 0 Tails: 0
Frequentist Approach
Observed Frequency
P(Heads) = # Heads / Total Flips
Flip coins to see frequency
Bayesian Approach
Belief (Posterior)
50%
P(Fair Coin | Data)
Prior: 50% (neutral belief)
What You're Seeing

Blue line (Frequentist): Simple frequency - heads/total. Jumps around early, stabilizes near 0.5.
Green line (Bayesian): Updates belief using Bayes' rule. Starts at 50% (prior), converges to data.
Notice: Both converge to the same value with enough flips! But Bayesian smoothly updates belief while Frequentist just counts.

Both Are Valid!

Frequentist: Good for repeatable experiments, objective inference.
Bayesian: Good for incorporating prior knowledge, one-time decisions, AI systems.
Modern AI is primarily Bayesian because it naturally handles belief updating and prior knowledge. But both interpretations are mathematically consistent and useful!

Key Takeaways

Probability Foundations
  1. Data Types: Qualitative vs Quantitative, Discrete vs Continuous
  2. Population & Sample: Inference from part to whole (x̄ → μ)
  3. Sample Space (Ω): All possible outcomes
  4. Events (E): Subsets we assign probabilities to
  5. Three Axioms: Foundation of probability theory
  6. Random Variables: Mapping outcomes to numbers
  7. Mean & Variance: Center and spread of data
  8. Two Interpretations: Frequentist (frequency) vs Bayesian (belief)
Why These Matter for AI
  • Distributions model uncertainty in data
  • LLN guarantees learning convergence
  • CLT explains why normal distribution works
  • Samples enable training on subsets of data
  • These are foundations of all machine learning
The Power of Probability

"Probability theory transforms uncertainty from a problem into a mathematical structure we can reason with, optimize, and learn from. These foundations enable everything from Bayesian networks to deep learning."

Next: Now that you understand the foundations, let's dive deep into Bayes' Rule and conditional probability! Continue to Topic 5 →