You've seen why AI needs probability (Topic 02) and how uncertainty changes the paradigm (Topic 03). Now we formalize these concepts mathematically. Probability theory provides the rigorous foundation for reasoning under uncertainty.
Core concepts with interactive visualizations:
Describes characteristics or categories
🌤️ Weather Examples:
🏠 Real Estate Examples:
Numeric measurements or counts
🌡️ Weather Examples:
💰 Real Estate Examples:
Visual: Dots on a number line
Examples:
Visual: Solid line (infinite points)
Examples:
Definition: The entire set of all possible observations
Example (Weather):
Π = All daily temperatures in Riyadh for entire year (N = 365)
Definition: A subset selected for analysis
Example (Weather):
S = Temperatures on 30 randomly selected days (n = 30)
Population of 100 Riyadh daily temperatures. Take random samples to see how sample mean estimates population mean.
Sample mean (x̄) estimates population mean (μ): Larger samples give better estimates. With n=30, sample mean is typically within 1-2°C of true population mean. This is the foundation of all statistical inference!
Set of ALL possible outcomes
🪙 Coin flip:
Ω = {H, T}
🎲 Dice roll:
Ω = {1, 2, 3, 4, 5, 6}
🌤️ Weather:
Ω = {sunny, cloudy, rainy}
A subset of the sample space
🎲 Even numbers:
E = {2, 4, 6} ⊆ Ω
🌡️ High temp (>30°C):
🌤️ Good weather:
E = {sunny, cloudy}
All of probability theory is built on just three simple axioms. Everything else (Bayes' rule, distributions, inference) follows from these foundations.
"Probabilities are always
between 0 and 1"
"Something must happen
(sample space = certain)"
"Disjoint events
add up"
Empty set has probability zero
Probability of "not E"
For any two events (even if they overlap)
A random variable is a function that assigns a number to each outcome in the sample space. It transforms outcomes (which might not be numbers) into numerical values we can work with mathematically.
"X maps each outcome ω ∈ Ω to a real number"
Flip 3 coins
Ω = {HHH, HHT, ...}
Count heads in each outcome
X ∈ {0, 1, 2, 3}
Sample from a random variable repeatedly. Watch the histogram converge to the true distribution!
True Distribution (theoretical)
Empirical Distribution (from samples)
Start with few samples (n=10): Empirical distribution is rough, doesn't match true distribution.
Increase samples (n=100, n=500, n=1000): Empirical distribution converges to true distribution!
This is how AI learns: Collecting more data reveals the underlying probability distribution.
This is also the Law of Large Numbers in action!
Any dataset can be summarized by two fundamental numbers: the mean (center/average) and the variance (spread/variability).
Measures the CENTER
"Average of all values"
Measures the SPREAD
"Average squared distance from mean"
Add temperature data points and see mean (center) and variance (spread) calculated in real-time!
Distance sensor gives conflicting readings due to noise. Take multiple samples to reduce variance and increase confidence!
📡 Sensor Scenario: Robot measuring distance to obstacle
🎯 True distance: 3.5 meters (unknown to robot)
⚠️ Sensor noise: ±0.8m error
❓ Problem: Each reading is different! How confident can we be?
With 1 reading: High variance, low confidence. Can't trust a single noisy measurement!
With 20 readings: Variance reduced, confidence increased. Average of many readings is more reliable!
Formula: Confidence interval = x̄ ± 1.96×(s/√n) - Gets narrower as n increases.
For AI: This is why robots take multiple sensor readings before making critical decisions!
# Sample data data = [22, 28, 19, 25, 31, 24, 27] # Calculate mean n = len(data) mean = sum(data) / n print(f"Sample Mean (x̄): {mean:.2f}") # Calculate variance variance = sum((x - mean)**2 for x in data) / n print(f"Sample Variance (s²): {variance:.2f}") # Calculate standard deviation std_dev = variance ** 0.5 print(f"Sample Std Dev (s): {std_dev:.2f}") # Output: # Sample Mean (x̄): 25.14 # Sample Variance (s²): 13.55 # Sample Std Dev (s): 3.68
import numpy as np # Sample data data = np.array([22, 28, 19, 25, 31, 24, 27]) # Calculate mean mean = np.mean(data) print(f"Mean: {mean:.2f}") # Calculate variance variance = np.var(data) # Population variance print(f"Variance: {variance:.2f}") # Calculate standard deviation std_dev = np.std(data) print(f"Std Dev: {std_dev:.2f}") # Alternative: sample variance (divides by n-1) sample_var = np.var(data, ddof=1) sample_std = np.std(data, ddof=1) print(f"Sample Variance (unbiased): {sample_var:.2f}") print(f"Sample Std Dev (unbiased): {sample_std:.2f}")
import pandas as pd # Create DataFrame with temperature data df = pd.DataFrame({ 'temperature': [22, 28, 19, 25, 31, 24, 27], 'city': ['Riyadh', 'Jeddah', 'Dammam', 'Riyadh', 'Jeddah', 'Riyadh', 'Dammam'] }) # Calculate mean mean_temp = df['temperature'].mean() print(f"Mean Temperature: {mean_temp:.2f}°C") # Calculate variance and std dev var_temp = df['temperature'].var() # Sample variance (n-1) std_temp = df['temperature'].std() # Sample std dev (n-1) print(f"Variance: {var_temp:.2f}") print(f"Std Dev: {std_temp:.2f}°C") # Get full statistics summary print(df['temperature'].describe()) # Group by city and calculate stats city_stats = df.groupby('city')['temperature'].agg(['mean', 'var', 'std']) print(city_stats)
Plain Python: Learning, small datasets, understanding the math
NumPy: Large numerical arrays, fast computation, scientific computing
Pandas: Tabular data, data analysis, grouping operations
A probability distribution describes how probabilities are distributed over possible values of a random variable. It tells us "how likely is each outcome?"
PDF: f(x) - Probability Density/Mass
CDF: F(x) = P(X ≤ x) - Cumulative
"Sample mean converges to expected value as n → ∞"
Expected value of dice: E[X] = 3.5. Watch running average converge!
Early rolls are wild, but average stabilizes: With few rolls, average jumps around. With many rolls, it converges to 3.5. This is why AI systems improve with more training data - the Law of Large Numbers guarantees convergence!
"Sample means follow a normal distribution, regardless of the population's original distribution!"
Watch the bell curve emerge from ANY population distribution!
Population Distribution (weird shape)
Sample Means Distribution (becomes normal!)
Try this: Select "Exponential" or "Bimodal" (very non-normal!), then increase sample size to n=30. Watch the right chart transform into a perfect bell curve! This is why the normal distribution appears everywhere in nature and AI.
There are two fundamentally different ways to interpret what "probability" means. Both are valid and useful, but they lead to different approaches in statistics and AI.
Repeat Experiment ∞ Times
Coin Flip:
P(H) = 0.5 = "50% of flips are heads"
Weather:
P(rain) = 0.3 = "30% of similar days had rain"
💡 Objective
Probability exists in the world
Update Belief with Evidence
Guilt:
P(guilty) = 0.7 = "70% confident they're guilty"
Weather:
P(rain) = 0.3 = "I believe 30% chance"
💭 Subjective
Probability is in the mind
| Aspect | Frequentist | Bayesian |
|---|---|---|
| Meaning | Long-run frequency in repeated trials | Degree of belief given evidence |
| Parameters | Fixed (but unknown) | Random variables with distributions |
| Prior Beliefs | Not used | Explicitly modeled as priors |
| Update Rule | More data → better estimate | Bayes' rule: posterior ∝ likelihood × prior |
| One-time Events | Problematic ("can't repeat") | Natural (degree of belief) |
| Example Use | Clinical trials, hypothesis testing | Machine learning, Bayesian networks, AI |
Is the coin fair? Flip it and watch how Frequentist and Bayesian approaches differ!
Blue line (Frequentist): Simple frequency - heads/total. Jumps around early, stabilizes near 0.5.
Green line (Bayesian): Updates belief using Bayes' rule. Starts at 50% (prior), converges to data.
Notice: Both converge to the same value with enough flips! But Bayesian smoothly updates belief while Frequentist just counts.
Frequentist: Good for repeatable experiments, objective inference.
Bayesian: Good for incorporating prior knowledge, one-time decisions, AI systems.
Modern AI is primarily Bayesian because it naturally handles belief updating and prior knowledge.
But both interpretations are mathematically consistent and useful!
"Probability theory transforms uncertainty from a problem into a mathematical structure we can reason with, optimize, and learn from. These foundations enable everything from Bayesian networks to deep learning."