Foundations of Probability Theory | Lecture 10

Introduction: Formalizing Uncertainty

From Intuition to Mathematics

You've seen why AI needs probability (Topic 02) and how uncertainty changes the paradigm (Topic 03). Now we formalize these concepts mathematically. Probability theory provides the rigorous foundation for reasoning under uncertainty.

What You'll Learn

Core concepts with interactive visualizations:

Types of data and random variables
Sample spaces, events, and the three axioms of probability
Probability distributions (PMF, PDF, CDF)
Law of Large Numbers - why averages converge
Central Limit Theorem - why the normal distribution is everywhere

Types of Data

Qualitative vs Quantitative Data

Qualitative (Categorical)

📝

Describes characteristics or categories

Non-numerical - Labels, Names, Categories

🌤️ Weather Examples:

☀️ Sunny ☁️ Cloudy 🌧️ Rainy 🌪️ Sandstorm

🏠 Real Estate Examples:

City: Riyadh, Jeddah, Dammam
Type: Villa, Apartment, Townhouse
Condition: New, Renovated, Old

Quantitative (Numerical)

🔢

Numeric measurements or counts

Numerical - Numbers, Quantities, Measurements

🌡️ Weather Examples:

Temperature: 28°C, 35°C, 18°C
Rainfall: 0mm, 5mm, 25mm
Wind speed: 15 km/h, 30 km/h

💰 Real Estate Examples:

Price: 1,500,000 SAR
Area: 250 m², 180 m²
Bedrooms: 3, 4, 5

Discrete vs Continuous Data

Discrete Data

⚫ ⚫ ⚫

D = {d₁, d₂, d₃, ...}

Countable - Separate, distinct values

X ∈ {0, 1, 2, 3, ...}

Visual: Dots on a number line

{1, 2, 3, 4, 5, 6}

Examples:

🎲 Dice roll: {1, 2, 3, 4, 5, 6}
🌧️ Rainy days: {0, 1, 2, ..., 31}
🛏️ Bedrooms: {1, 2, 3, 4, 5, ...}

Continuous Data

📈

C ⊆ ℜ

Uncountable - Any value in a range

X ∈ [a, b] ⊂ ℜ

Visual: Solid line (infinite points)

[0, ∞) - continuous range

Examples:

🌡️ Temperature: [−50°C, 50°C]
💧 Rainfall: [0, ∞) mm
💰 Price: [0, ∞) SAR

Population vs Sample - Foundation of Statistical Inference

Population (Π)

Definition: The entire set of all possible observations

Π = {x₁, x₂, ..., x_N}

Example (Weather):

Π = All daily temperatures in Riyadh for entire year (N = 365)

μ = ¹⁄₃₆₅ ∑_i=1³⁶⁵ T_i

Sample (S)

Definition: A subset selected for analysis

S = {s₁, s₂, ..., s_n} ⊂ Π where n « N

Example (Weather):

S = Temperatures on 30 randomly selected days (n = 30)

x = ¹⁄₃₀ ∑_i=1³⁰ s_i ≈ μ

Interactive: Population vs Sample Simulator

Population of 100 Riyadh daily temperatures. Take random samples to see how sample mean estimates population mean.

Population: All 100 Days (hover to see temperature)

Population Mean (μ)

—

Sample Mean (x̄)

—

Error |x̄ - μ|

—

Key Insight

Sample mean (x̄) estimates population mean (μ): Larger samples give better estimates. With n=30, sample mean is typically within 1-2°C of true population mean. This is the foundation of all statistical inference!

Sample Spaces & Events

Sample Space (Ω)

🎯

Set of ALL possible outcomes

Ω = {ω₁, ω₂, ω₃, ...}

🪙 Coin flip:

H T

Ω = {H, T}

🎲 Dice roll:

1 2 3 4 5 6

Ω = {1, 2, 3, 4, 5, 6}

🌤️ Weather:

☀️ Sunny ☁️ Cloudy 🌧️ Rainy

Ω = {sunny, cloudy, rainy}

Event (E)

✓

A subset of the sample space

E ⊆ Ω

🎲 Even numbers:

1 2 ✓ 3 4 ✓ 5 6 ✓

E = {2, 4, 6} ⊆ Ω

🌡️ High temp (>30°C):

E = (30, ∞)

🌤️ Good weather:

☀️ Sunny ✓ ☁️ Cloudy ✓ 🌧️ Rainy

E = {sunny, cloudy}

The Three Axioms of Probability (Kolmogorov, 1933)

All of probability theory is built on just three simple axioms. Everything else (Bayes' rule, distributions, inference) follows from these foundations.

Axiom 1: Non-negativity

📊

0 ≤ P(E) ≤ 1

"Probabilities are always
between 0 and 1"

Axiom 2: Certainty

💯

P(Ω) = 1

"Something must happen
(sample space = certain)"

Axiom 3: Additivity

➕

If E₁ ∩ E₂ = ∅:
P(E₁ ∪ E₂) = P(E₁) + P(E₂)

"Disjoint events
add up"

Important Derived Rules

Impossible Event

P(∅) = 0

Empty set has probability zero

Complement Rule

P(E) = 1 − P(E)

Probability of "not E"

General Addition Rule

P(E₁ ∪ E₂) = P(E₁) + P(E₂) − P(E₁ ∩ E₂)

For any two events (even if they overlap)

Random Variables - Mapping Outcomes to Numbers

What is a Random Variable?

A random variable is a function that assigns a number to each outcome in the sample space. It transforms outcomes (which might not be numbers) into numerical values we can work with mathematically.

X: Ω → ℜ

"X maps each outcome ω ∈ Ω to a real number"

Example: Mapping Outcomes to Numbers

Outcomes (Sample Space)

Flip 3 coins

HHH HHT HTH HTT THH THT TTH TTT

Ω = {HHH, HHT, ...}

Random Variable (X = # of Heads)

Count heads in each outcome

HHH → 3 HHT → 2 HTH → 2 HTT → 1 THH → 2 THT → 1 TTH → 1 TTT → 0

X ∈ {0, 1, 2, 3}

Interactive: How Sampling Builds a Distribution

Sample from a random variable repeatedly. Watch the histogram converge to the true distribution!

Random Variable:

Total Samples Collected: 0

True Distribution (theoretical)

Empirical Distribution (from samples)

Key Insight - Distributions Emerge from Data

Start with few samples (n=10): Empirical distribution is rough, doesn't match true distribution.
Increase samples (n=100, n=500, n=1000): Empirical distribution converges to true distribution!
This is how AI learns: Collecting more data reveals the underlying probability distribution. This is also the Law of Large Numbers in action!

Sample Mean & Variance - Measuring Center and Spread

Two Key Statistics

Any dataset can be summarized by two fundamental numbers: the mean (center/average) and the variance (spread/variability).

Sample Mean (x̄)

📍

Measures the CENTER

x = ¹⁄_n ∑_i=1ⁿ x_i

"Average of all values"

Example:
Data: {2, 4, 6, 8, 10}
x̄ = (2+4+6+8+10)/5 = 6

Sample Variance (s²)

📊

Measures the SPREAD

s² = ¹⁄_n ∑_i=1ⁿ (x_i − x)²

"Average squared distance from mean"

Example:
Data: {2, 4, 6, 8, 10}, x̄=6
s² = [(2-6)²+(4-6)²+...]/5 = 8

Interactive Demo 1: Weather Data - Mean & Variance

Add temperature data points and see mean (center) and variance (spread) calculated in real-time!

Add a data point (temperature in °C):

Or add random points:

Dataset (0 points)

No data yet. Add some points!

Sample Mean (x̄)

—

Sample Variance (s²)

—

Std Dev (s)

—

√(variance)

Key Insights

Mean (x̄): The balance point - where data centers
Variance (s²): How spread out the data is - low variance = clustered, high variance = scattered
Standard Deviation (s): Variance in same units as data - easier to interpret
For AI: Mean tells us "typical value", variance tells us "uncertainty/confidence"

Interactive Demo 2: Sensor Readings - Reducing Variance Through Sampling

Distance sensor gives conflicting readings due to noise. Take multiple samples to reduce variance and increase confidence!

📡 Sensor Scenario: Robot measuring distance to obstacle

🎯 True distance: 3.5 meters (unknown to robot)
⚠️ Sensor noise: ±0.8m error
❓ Problem: Each reading is different! How confident can we be?

Readings

0

Mean Estimate

—

Variance (s²)

—

Confidence (±)

—

Key Insight - Variance Decreases with More Samples

With 1 reading: High variance, low confidence. Can't trust a single noisy measurement!
With 20 readings: Variance reduced, confidence increased. Average of many readings is more reliable!
Formula: Confidence interval = x̄ ± 1.96×(s/√n) - Gets narrower as n increases.
For AI: This is why robots take multiple sensor readings before making critical decisions!

Python Implementation: Mean & Variance

Method 1: Plain Python (from scratch)

# Sample data
data = [22, 28, 19, 25, 31, 24, 27]

# Calculate mean
n = len(data)
mean = sum(data) / n
print(f"Sample Mean (x̄): {mean:.2f}")

# Calculate variance
variance = sum((x - mean)**2 for x in data) / n
print(f"Sample Variance (s²): {variance:.2f}")

# Calculate standard deviation
std_dev = variance ** 0.5
print(f"Sample Std Dev (s): {std_dev:.2f}")

# Output:
# Sample Mean (x̄): 25.14
# Sample Variance (s²): 13.55
# Sample Std Dev (s): 3.68

Method 2: NumPy (efficient for large datasets)

import numpy as np

# Sample data
data = np.array([22, 28, 19, 25, 31, 24, 27])

# Calculate mean
mean = np.mean(data)
print(f"Mean: {mean:.2f}")

# Calculate variance
variance = np.var(data)  # Population variance
print(f"Variance: {variance:.2f}")

# Calculate standard deviation
std_dev = np.std(data)
print(f"Std Dev: {std_dev:.2f}")

# Alternative: sample variance (divides by n-1)
sample_var = np.var(data, ddof=1)
sample_std = np.std(data, ddof=1)
print(f"Sample Variance (unbiased): {sample_var:.2f}")
print(f"Sample Std Dev (unbiased): {sample_std:.2f}")

Method 3: Pandas (for dataframes)

import pandas as pd

# Create DataFrame with temperature data
df = pd.DataFrame({
    'temperature': [22, 28, 19, 25, 31, 24, 27],
    'city': ['Riyadh', 'Jeddah', 'Dammam', 'Riyadh', 'Jeddah', 'Riyadh', 'Dammam']
})

# Calculate mean
mean_temp = df['temperature'].mean()
print(f"Mean Temperature: {mean_temp:.2f}°C")

# Calculate variance and std dev
var_temp = df['temperature'].var()   # Sample variance (n-1)
std_temp = df['temperature'].std()   # Sample std dev (n-1)
print(f"Variance: {var_temp:.2f}")
print(f"Std Dev: {std_temp:.2f}°C")

# Get full statistics summary
print(df['temperature'].describe())

# Group by city and calculate stats
city_stats = df.groupby('city')['temperature'].agg(['mean', 'var', 'std'])
print(city_stats)

Which Method to Use?

Plain Python: Learning, small datasets, understanding the math

NumPy: Large numerical arrays, fast computation, scientific computing

Pandas: Tabular data, data analysis, grouping operations

Probability Distributions - Modeling Uncertainty

What is a Probability Distribution?

A probability distribution describes how probabilities are distributed over possible values of a random variable. It tells us "how likely is each outcome?"

Interactive: Distribution Visualizer & Probability Calculator

Select Distribution:

Calculate P(a ≤ X ≤ b) - Probability in Range

Lower bound (a):

Upper bound (b):

Result:

—

PDF: f(x) - Probability Density/Mass

CDF: F(x) = P(X ≤ x) - Cumulative

                        Try This
            Select Normal distribution, adjust μ and σ to see shape change
Calculate P(-1 ≤ X ≤ 1) - see the shaded region update!
Select Binomial, change n and p to see discrete probabilities
Compare PDF (density) with CDF (cumulative) - CDF always goes 0→1

                    

Law of Large Numbers - Averages Converge

The Law

lim_n→∞ ¹⁄_n ∑_i=1ⁿ X_i = E[X]

"Sample mean converges to expected value as n → ∞"

Interactive: Dice Rolling - Law of Large Numbers

Expected value of dice: E[X] = 3.5. Watch running average converge!

Total Rolls

0

Running Average

—

Expected Value

3.5

Key Insight

Early rolls are wild, but average stabilizes: With few rolls, average jumps around. With many rolls, it converges to 3.5. This is why AI systems improve with more training data - the Law of Large Numbers guarantees convergence!

Central Limit Theorem - The Magic of the Normal Distribution

The Theorem

X_n ~ N(μ, σ²⁄_n) for large n

"Sample means follow a normal distribution, regardless of the population's original distribution!"

Interactive: Central Limit Theorem Visualizer

Watch the bell curve emerge from ANY population distribution!

Population Distribution (weird shapes!):

Sample Size (n):

n = 1

Population Distribution (weird shape)

Sample Means Distribution (becomes normal!)

The "Magic" Moment

Try this: Select "Exponential" or "Bimodal" (very non-normal!), then increase sample size to n=30. Watch the right chart transform into a perfect bell curve! This is why the normal distribution appears everywhere in nature and AI.

Two Interpretations of Probability

What Does Probability Mean?

There are two fundamentally different ways to interpret what "probability" means. Both are valid and useful, but they lead to different approaches in statistics and AI.

📊

Frequentist

Probability = Long-run Frequency

Repeat Experiment ∞ Times

🪙 🪙 🪙 🪙 🪙 ...

Count outcomes

🪙

Coin Flip:

P(H) = 0.5 = "50% of flips are heads"

🌧️

Weather:

P(rain) = 0.3 = "30% of similar days had rain"

💡 Objective
Probability exists in the world

🧠

Bayesian

Probability = Degree of Belief

Update Belief with Evidence

🤔 Prior → 📊 Evidence → ✅ Posterior

Bayes' Rule

⚖️

Guilt:

P(guilty) = 0.7 = "70% confident they're guilty"

🌧️

Weather:

P(rain) = 0.3 = "I believe 30% chance"

💭 Subjective
Probability is in the mind

Detailed Comparison Table

Aspect	Frequentist	Bayesian
Meaning	Long-run frequency in repeated trials	Degree of belief given evidence
Parameters	Fixed (but unknown)	Random variables with distributions
Prior Beliefs	Not used	Explicitly modeled as priors
Update Rule	More data → better estimate	Bayes' rule: posterior ∝ likelihood × prior
One-time Events	Problematic ("can't repeat")	Natural (degree of belief)
Example Use	Clinical trials, hypothesis testing	Machine learning, Bayesian networks, AI

Interactive: Coin Flip Comparison

Is the coin fair? Flip it and watch how Frequentist and Bayesian approaches differ!

Total Flips: 0 Heads: 0 Tails: 0

Frequentist Approach

Observed Frequency

—

P(Heads) = # Heads / Total Flips

Flip coins to see frequency

Bayesian Approach

Belief (Posterior)

50%

P(Fair Coin | Data)

Prior: 50% (neutral belief)

What You're Seeing

Blue line (Frequentist): Simple frequency - heads/total. Jumps around early, stabilizes near 0.5.
Green line (Bayesian): Updates belief using Bayes' rule. Starts at 50% (prior), converges to data.
Notice: Both converge to the same value with enough flips! But Bayesian smoothly updates belief while Frequentist just counts.

Both Are Valid!

Frequentist: Good for repeatable experiments, objective inference.
Bayesian: Good for incorporating prior knowledge, one-time decisions, AI systems.
Modern AI is primarily Bayesian because it naturally handles belief updating and prior knowledge. But both interpretations are mathematically consistent and useful!

Key Takeaways

Probability Foundations

Data Types: Qualitative vs Quantitative, Discrete vs Continuous
Population & Sample: Inference from part to whole (x̄ → μ)
Sample Space (Ω): All possible outcomes
Events (E): Subsets we assign probabilities to
Three Axioms: Foundation of probability theory
Random Variables: Mapping outcomes to numbers
Mean & Variance: Center and spread of data
Two Interpretations: Frequentist (frequency) vs Bayesian (belief)

                            Why These Matter for AI
                            Distributions model uncertainty in data
LLN guarantees learning convergence
CLT explains why normal distribution works
Samples enable training on subsets of data
These are foundations of all machine learning

                        

The Power of Probability

"Probability theory transforms uncertainty from a problem into a mathematical structure we can reason with, optimize, and learn from. These foundations enable everything from Bayesian networks to deep learning."

Next: Now that you understand the foundations, let's dive deep into Bayes' Rule and conditional probability! Continue to Topic 5 →