Probability & Uncertainty Cheat Sheet

1. Probability Axioms (Kolmogorov)

Axiom 1: Non-negativity

\[ 0 \leq P(A) \leq 1 \text{ for all events } A \]

Axiom 2: Certainty

\[ P(\Omega) = 1 \text{ (probability of sample space)} \] \[ P(\text{true}) = 1, \quad P(\text{false}) = 0 \]

Axiom 3: Additivity

\[ P(A \cup B) = P(A) + P(B) - P(A \cap B) \] \[ \text{If } A \cap B = \emptyset: \quad P(A \cup B) = P(A) + P(B) \]

Derived Rule: \( P(\neg A) = 1 - P(A) \)

2. Conditional Probability

Definition

\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \quad \text{if } P(B) > 0 \]

Read as: "Probability of A given B"

Product Rule (Chain Rule)

\[ P(A \cap B) = P(A \mid B) \cdot P(B) = P(B \mid A) \cdot P(A) \]

General Chain Rule

\[ P(A_1, A_2, \ldots, A_n) = P(A_1) \cdot P(A_2 \mid A_1) \cdot P(A_3 \mid A_1, A_2) \cdots P(A_n \mid A_1, \ldots, A_{n-1}) \]

Example: P(cavity ∧ toothache) = P(cavity | toothache) × P(toothache)

3. Bayes' Rule ⭐ (Most Important)

Bayes' Theorem (Standard Form)

\[ P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)} \]

Bayes' Theorem (Expanded Form)

\[ P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B \mid A) \cdot P(A) + P(B \mid \neg A) \cdot P(\neg A)} \]

Bayesian Inference Form

\[ P(\text{hypothesis} \mid \text{evidence}) = \frac{P(\text{evidence} \mid \text{hypothesis}) \cdot P(\text{hypothesis})}{P(\text{evidence})} \] \[ \text{Posterior} = \frac{\text{Likelihood} \times \text{Prior}}{\text{Evidence}} \]

Key Insight: Bayes' rule allows us to invert conditional probabilities. If we know P(symptoms | disease), we can compute P(disease | symptoms) using prior knowledge P(disease).

Medical Diagnosis Example:
Given: P(positive test | disease) = 0.99, P(disease) = 0.001, P(positive test | no disease) = 0.05
Find: P(disease | positive test)

\[ P(D \mid +) = \frac{0.99 \times 0.001}{0.99 \times 0.001 + 0.05 \times 0.999} = \frac{0.00099}{0.05094} \approx 0.019 \]
Interpretation: Only 1.9% chance of disease despite positive test! (due to low base rate)

4. Marginalization (Summing Out)

Marginalization Formula

\[ P(A) = \sum_{b \in B} P(A, b) = \sum_{b} P(A \mid b) \cdot P(b) \]

Sum over all possible values of B to eliminate it

Example: Given joint P(Weather, Traffic), compute P(Weather):
P(sunny) = P(sunny, light) + P(sunny, heavy)
P(rainy) = P(rainy, light) + P(rainy, heavy)

5. Independence & Conditional Independence

Independence

\[ A \perp B \iff P(A \cap B) = P(A) \cdot P(B) \] \[ \text{Equivalently: } P(A \mid B) = P(A) \]

Conditional Independence (⭐ Very Important)

\[ A \perp B \mid C \iff P(A \cap B \mid C) = P(A \mid C) \cdot P(B \mid C) \] \[ \text{Equivalently: } P(A \mid B, C) = P(A \mid C) \]

"A and B are independent given C"

Why it matters: Independence reduces parameters exponentially!
• Without independence: n binary variables need \(2^n - 1\) parameters
• With independence: only n parameters
• Conditional independence: enables efficient inference in Bayesian networks

Example: Toothache and Catch are conditionally independent given Cavity:
P(Toothache, Catch | Cavity) = P(Toothache | Cavity) × P(Catch | Cavity)

6. Probability Distributions

Joint Distribution

\[ P(X_1, X_2, \ldots, X_n) \]

Complete probability model: specifies probability of every possible state

Marginal Distribution

\[ P(X) = \sum_{y} P(X, y) \]

Probability of subset of variables (others summed out)

Conditional Distribution

\[ P(X \mid Y) = \frac{P(X, Y)}{P(Y)} \]

Distribution over X given fixed value of Y

Example: Joint Probability Table

Weather	Traffic	P(W, T)
sunny	light	0.3
sunny	heavy	0.1
rainy	light	0.2
rainy	heavy	0.4

Marginal: P(sunny) = 0.3 + 0.1 = 0.4
Conditional: P(heavy | rainy) = 0.4 / (0.2 + 0.4) = 0.67

7. Law of Total Probability

Law of Total Probability

\[ P(A) = \sum_{i} P(A \mid B_i) \cdot P(B_i) \]

Where \(B_1, B_2, \ldots, B_n\) partition the sample space

Example: P(alarm) = P(alarm | burglary) × P(burglary) + P(alarm | no burglary) × P(no burglary)

8. Common Bayesian Inference Patterns

Pattern 1: Medical Diagnosis

\[ P(\text{disease} \mid \text{symptoms}) = \frac{P(\text{symptoms} \mid \text{disease}) \cdot P(\text{disease})}{P(\text{symptoms})} \]

Given: Sensitivity P(+ | disease), Specificity P(- | no disease), Base rate P(disease)
Find: P(disease | +)

Pattern 2: Naive Bayes Classifier

\[ P(\text{class} \mid x_1, \ldots, x_n) \propto P(\text{class}) \prod_{i=1}^{n} P(x_i \mid \text{class}) \]

Assumes features are conditionally independent given class

Pattern 3: Sequential Bayesian Update

\[ P(H \mid e_1, e_2) = \frac{P(e_2 \mid H, e_1) \cdot P(H \mid e_1)}{P(e_2 \mid e_1)} \]

Update posterior with new evidence (posterior becomes new prior)

9. Key Statistical Theorems

Law of Large Numbers (LLN)

\[ \lim_{n \to \infty} \frac{1}{n} \sum_{i=1}^{n} X_i = \mathbb{E}[X] \]

Sample average converges to expected value as n increases

Central Limit Theorem (CLT)

\[ \frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1) \]

Sum of independent random variables approaches normal distribution

Why these matter: • LLN: Justifies using sample statistics to estimate population parameters
• CLT: Explains why normal distribution appears everywhere in nature
• Both are foundations of machine learning and statistical inference

10. Python Implementation

Basic Probability Calculations

import numpy as np

# Joint probability table
joint = np.array([[0.3, 0.1], [0.2, 0.4]])

# Marginalization (sum over axis)
p_weather = joint.sum(axis=1)  # [0.4, 0.6]
p_traffic = joint.sum(axis=0)  # [0.5, 0.5]

# Conditional probability
p_traffic_given_rainy = joint[1, :] / joint[1, :].sum()

# Bayes' rule
def bayes_rule(likelihood, prior, evidence):
    return (likelihood * prior) / evidence

Bayesian Inference Example

# Medical diagnosis
p_disease = 0.001  # prior
sensitivity = 0.99  # P(+ | disease)
specificity = 0.95  # P(- | no disease)

# P(+)
p_positive = sensitivity * p_disease + (1 - specificity) * (1 - p_disease)

# P(disease | +)
posterior = (sensitivity * p_disease) / p_positive
print(f"P(disease | positive test) = {posterior:.4f}")

11. Logic vs Probability Comparison

Aspect	Formal Logic	Probability
Knowledge	Complete, certain	Incomplete, uncertain
Truth Values	True / False (binary)	0 to 1 (degrees of belief)
Inference	Deduction (certain conclusions)	Induction (probable conclusions)
Contradictions	System fails	Handled gracefully
Real World	Struggles with noise, incompleteness	Natural fit for uncertain data
Examples	Theorem provers, expert systems	ML models, Bayesian networks

Key Philosophy: "Intelligence is not about absolute certainty, but about reasoning optimally under uncertainty." Logic and probability are complementary, not competing approaches.

12. Common Mistakes to Avoid

❌ Confusion of the Inverse:
P(A | B) ≠ P(B | A) in general
Example: P(spots | measles) ≠ P(measles | spots)

❌ Base Rate Neglect:
Ignoring P(disease) when computing P(disease | symptoms)
Result: Overestimating rare diseases

❌ Assuming Independence:
P(A, B) = P(A) × P(B) only if A and B are independent
Must verify independence, not assume it

❌ Unnormalized Probabilities:
Probabilities must sum to 1
Always normalize: P(A | B) = P(A, B) / P(B)

❌ Conditional Independence Confusion:
A ⊥ B doesn't imply A ⊥ B | C
A ⊥ B | C doesn't imply A ⊥ B
Must check each separately