Joint Distribution Factorization | Lecture 11

1. The Problem: Exponential Complexity

Full Joint Distribution Table

A full joint probability distribution specifies the probability of every possible combination of variable values.

P(X₁, X₂, ..., X_n)

The Problem

For n binary variables:

2ⁿ entries

10 variables → 1,024 entries
20 variables → 1,048,576 entries
30 variables → 1,073,741,824 entries
Impossible to store or learn!

The Solution

Factorization using BNs:

O(n · 2^k)

k = max parents per node
30 variables, k=3 → ~240 parameters
Reduction: 1 billion → 240!
Makes learning and inference tractable

Key Insight: Bayesian Networks achieve this massive reduction by exploiting conditional independence — not all variables depend on all others. The graph structure tells us which dependencies matter.

2. Chain Rule of Probability

General Chain Rule

The chain rule lets us break down any joint distribution into a product of conditional probabilities:

P(X₁, X₂, ..., X_n) = P(X₁) × P(X₂|X₁) × P(X₃|X₁,X₂) × ... × P(X_n|X₁,...,X_n-1)

"Each variable is conditioned on all previous variables"

Example: Weather System (4 Variables)

Variables: Season (S), Temperature (T), Rain (R), Grass Wet (G)

Full Chain Rule Expansion

P(S, T, R, G) = P(S) × P(T|S) × P(R|S,T) × P(G|S,T,R)

Problem: P(G|S,T,R) still requires a table with 2³ = 8 entries. We need to condition on everything that came before!

The Question: Does every variable really depend on all previous variables? Usually not! For example, does "Grass Wet" really depend on "Season" if we already know "Rain"? This is where Bayesian Networks help us simplify!

3. Bayesian Network Factorization

The Key Simplification

A Bayesian Network exploits conditional independence: each variable depends only on its direct parents, not on all previous variables.

P(X₁, ..., X_n) = ∏_i=1ⁿ P(X_i | Parents(X_i))

"Product of local conditional probabilities"

Chain Rule (General)

Each variable conditioned on all previous variables:

P(X₁)P(X₂|X₁)
P(X₃|X₁,X₂)
P(X₄|X₁,X₂,X₃)
...

Many dependencies!

BN Factorization

Each variable conditioned only on its parents:

P(X₁|Parents(X₁))
P(X₂|Parents(X₂))
P(X₃|Parents(X₃))
...

Only local dependencies!

Weather Example with BN Structure

flowchart TD S["Season
P(S)"] --> T["Temperature
P(T|S)"] S --> R["Rain
P(R|S,T)"] T --> R R --> G["Grass Wet
P(G|R)"] classDef season fill:#635bff,color:white,stroke:#0a2540,stroke-width:2px classDef temp fill:#00d4ff,color:white,stroke:#0a2540,stroke-width:2px classDef rain fill:#00d4ff,color:white,stroke:#0a2540,stroke-width:2px classDef grass fill:#32D583,color:white,stroke:#0a2540,stroke-width:2px class S season class T,R temp class G grass

BN Factorization for Weather Network

P(S, T, R, G) = P(S) × P(T|S) × P(R|S,T) × P(G|R)

Key Simplification:

P(G|R) instead of P(G|S,T,R)
Grass being wet depends directly on rain, not on season or temperature
If we know it's raining, season/temperature don't tell us anything new about grass wetness
Result: Fewer parameters, easier to learn and reason with!

4. Classic Example: Burglary-Earthquake-Alarm Network

Scenario

You have a home alarm that can be triggered by a burglary or an earthquake. When the alarm goes off, your neighbors John and Mary might call you. 5 binary variables: Burglary (B), Earthquake (E), Alarm (A), JohnCalls (J), MaryCalls (M)

Bayesian Network Structure

flowchart TD B["Burglary
P(B)"] --> A["Alarm
P(A|B,E)"] E["Earthquake
P(E)"] --> A A --> J["JohnCalls
P(J|A)"] A --> M["MaryCalls
P(M|A)"] classDef burglary fill:#dc3545,color:white,stroke:#0a2540,stroke-width:2px classDef earthquake fill:#ff6b6b,color:white,stroke:#0a2540,stroke-width:2px classDef alarm fill:#ffc107,color:white,stroke:#0a2540,stroke-width:2px classDef john fill:#635bff,color:white,stroke:#0a2540,stroke-width:2px classDef mary fill:#00d4ff,color:white,stroke:#0a2540,stroke-width:2px class B burglary class E earthquake class A alarm class J john class M mary

Parent-Child Relationships

Burglary (B): No parents (root)
Earthquake (E): No parents (root)
Alarm (A): Parents: B, E
JohnCalls (J): Parent: A
MaryCalls (M): Parent: A

Conditional Independence

B and E are independent
J and M are independent given A
J doesn't depend on B or E given A
M doesn't depend on B or E given A

Factorization Formula

P(B, E, A, J, M) = P(B) × P(E) × P(A|B,E) × P(J|A) × P(M|A)

Click on each term to see its role:

P(B) P(E) P(A|B,E) P(J|A) P(M|A)

Step-by-Step: Calculating P(B=T, E=F, A=T, J=T, M=T)

Let's calculate the probability of a specific scenario: Burglary happens, no earthquake, alarm sounds, both John and Mary call.

Given CPT Values (Example)

Prior Probabilities:

P(B=T) = 0.001
P(E=T) = 0.002

Alarm CPT:

P(A=T|B=T,E=F) = 0.94

Call CPTs:

P(J=T|A=T) = 0.90
P(M=T|A=T) = 0.70

Calculation Steps

P(B=T, E=F, A=T, J=T, M=T)
= P(B=T) × P(E=F) × P(A=T|B=T,E=F) × P(J=T|A=T) × P(M=T|A=T)
= 0.001 × (1-0.002) × 0.94 × 0.90 × 0.70
= 0.001 × 0.998 × 0.94 × 0.90 × 0.70
= 0.000591

Interpretation: The probability of this exact scenario is about 0.059%, or roughly 1 in 1,700. This is a rare event because burglaries are rare to begin with (0.1%).

Parameter Comparison

Full Joint Table

2⁵ - 1 = 31

parameters needed

Bayesian Network

10

parameters needed
(1 + 1 + 4 + 2 + 2)

67% reduction!

5. Why Does Factorization Work?

The Foundation: Conditional Independence

Bayesian Network factorization works because of the Markov assumption: Given its parents, a node is conditionally independent of all its non-descendants.

1. Local Structure

Each variable depends only on its immediate parents, not on the entire history of variables.

2. Independence

The graph structure encodes which variables are conditionally independent, eliminating unnecessary dependencies.

3. Compact Storage

Instead of exponential entries, we store linear number of small CPT tables.

Mathematical Guarantee

The BN factorization is mathematically equivalent to the full joint distribution:

∏_i P(X_i | Parents(X_i)) = P(X₁, ..., X_n)

It's not an approximation — it's exact!
We've just reorganized the representation to exploit structure.

                    Key Takeaways
                    Graph structure determines factorization: Arrows show direct dependencies
Conditional independence is key: Not everything depends on everything
Massive parameter reduction: From exponential to linear (in many cases)
Enables efficient inference: We can exploit structure for faster computation
Mirrors causal relationships: Often reflects how the world actually works

                

Summary & Next Steps

What We Learned

Full joint distributions are exponentially large (2ⁿ)
Chain rule breaks joints into conditionals
BNs factorize using only parent dependencies
Graph structure = conditional independence
Massive reduction in parameters

                            Coming Next
                            Topic 3: Conditional Independence in depth
Topic 4: d-Separation algorithm
Topic 5: How to construct BNs
Topics 6-8: Inference algorithms

                        

The Big Picture

Bayesian Networks give us the best of both worlds:
The expressiveness of full joint distributions with the efficiency of compact, structured representation. This makes probabilistic AI practical for real-world applications!

Back to Lecture 11 Overview Next: Conditional Independence