Joint Distribution Factorization

How Bayesian Networks Compactly Represent Joint Probabilities

1. The Problem: Exponential Complexity

Full Joint Distribution Table

A full joint probability distribution specifies the probability of every possible combination of variable values.

P(X1, X2, ..., Xn)
The Problem

For n binary variables:

2n entries
  • 10 variables → 1,024 entries
  • 20 variables → 1,048,576 entries
  • 30 variables → 1,073,741,824 entries
  • Impossible to store or learn!
The Solution

Factorization using BNs:

O(n · 2k)
  • k = max parents per node
  • 30 variables, k=3 → ~240 parameters
  • Reduction: 1 billion → 240!
  • Makes learning and inference tractable

Key Insight: Bayesian Networks achieve this massive reduction by exploiting conditional independence — not all variables depend on all others. The graph structure tells us which dependencies matter.

2. Chain Rule of Probability

General Chain Rule

The chain rule lets us break down any joint distribution into a product of conditional probabilities:

P(X1, X2, ..., Xn) = P(X1) × P(X2|X1) × P(X3|X1,X2) × ... × P(Xn|X1,...,Xn-1)

"Each variable is conditioned on all previous variables"

Example: Weather System (4 Variables)

Variables: Season (S), Temperature (T), Rain (R), Grass Wet (G)

Full Chain Rule Expansion
P(S, T, R, G) = P(S) × P(T|S) × P(R|S,T) × P(G|S,T,R)
Problem: P(G|S,T,R) still requires a table with 2³ = 8 entries. We need to condition on everything that came before!

The Question: Does every variable really depend on all previous variables? Usually not! For example, does "Grass Wet" really depend on "Season" if we already know "Rain"? This is where Bayesian Networks help us simplify!

3. Bayesian Network Factorization

The Key Simplification

A Bayesian Network exploits conditional independence: each variable depends only on its direct parents, not on all previous variables.

P(X1, ..., Xn) = i=1n P(Xi | Parents(Xi))

"Product of local conditional probabilities"

Chain Rule (General)

Each variable conditioned on all previous variables:

P(X₁)P(X₂|X₁)
P(X₃|X₁,X₂)
P(X₄|X₁,X₂,X₃)
...

Many dependencies!

BN Factorization

Each variable conditioned only on its parents:

P(X₁|Parents(X₁))
P(X₂|Parents(X₂))
P(X₃|Parents(X₃))
...

Only local dependencies!

Weather Example with BN Structure
flowchart TD S["Season
P(S)"] --> T["Temperature
P(T|S)"] S --> R["Rain
P(R|S,T)"] T --> R R --> G["Grass Wet
P(G|R)"] classDef season fill:#635bff,color:white,stroke:#0a2540,stroke-width:2px classDef temp fill:#00d4ff,color:white,stroke:#0a2540,stroke-width:2px classDef rain fill:#00d4ff,color:white,stroke:#0a2540,stroke-width:2px classDef grass fill:#32D583,color:white,stroke:#0a2540,stroke-width:2px class S season class T,R temp class G grass
BN Factorization for Weather Network
P(S, T, R, G) = P(S) × P(T|S) × P(R|S,T) × P(G|R)

Key Simplification:

  • P(G|R) instead of P(G|S,T,R)
  • Grass being wet depends directly on rain, not on season or temperature
  • If we know it's raining, season/temperature don't tell us anything new about grass wetness
  • Result: Fewer parameters, easier to learn and reason with!

4. Classic Example: Burglary-Earthquake-Alarm Network

Scenario

You have a home alarm that can be triggered by a burglary or an earthquake. When the alarm goes off, your neighbors John and Mary might call you. 5 binary variables: Burglary (B), Earthquake (E), Alarm (A), JohnCalls (J), MaryCalls (M)

Bayesian Network Structure
flowchart TD B["Burglary
P(B)"] --> A["Alarm
P(A|B,E)"] E["Earthquake
P(E)"] --> A A --> J["JohnCalls
P(J|A)"] A --> M["MaryCalls
P(M|A)"] classDef burglary fill:#dc3545,color:white,stroke:#0a2540,stroke-width:2px classDef earthquake fill:#ff6b6b,color:white,stroke:#0a2540,stroke-width:2px classDef alarm fill:#ffc107,color:white,stroke:#0a2540,stroke-width:2px classDef john fill:#635bff,color:white,stroke:#0a2540,stroke-width:2px classDef mary fill:#00d4ff,color:white,stroke:#0a2540,stroke-width:2px class B burglary class E earthquake class A alarm class J john class M mary
Parent-Child Relationships
  • Burglary (B): No parents (root)
  • Earthquake (E): No parents (root)
  • Alarm (A): Parents: B, E
  • JohnCalls (J): Parent: A
  • MaryCalls (M): Parent: A
Conditional Independence
  • B and E are independent
  • J and M are independent given A
  • J doesn't depend on B or E given A
  • M doesn't depend on B or E given A
Factorization Formula
P(B, E, A, J, M) = P(B) × P(E) × P(A|B,E) × P(J|A) × P(M|A)

Click on each term to see its role:

P(B) P(E) P(A|B,E) P(J|A) P(M|A)
Step-by-Step: Calculating P(B=T, E=F, A=T, J=T, M=T)

Let's calculate the probability of a specific scenario: Burglary happens, no earthquake, alarm sounds, both John and Mary call.

Given CPT Values (Example)
Prior Probabilities:
  • P(B=T) = 0.001
  • P(E=T) = 0.002
Alarm CPT:
  • P(A=T|B=T,E=F) = 0.94
Call CPTs:
  • P(J=T|A=T) = 0.90
  • P(M=T|A=T) = 0.70
Calculation Steps
P(B=T, E=F, A=T, J=T, M=T)
= P(B=T) × P(E=F) × P(A=T|B=T,E=F) × P(J=T|A=T) × P(M=T|A=T)
= 0.001 × (1-0.002) × 0.94 × 0.90 × 0.70
= 0.001 × 0.998 × 0.94 × 0.90 × 0.70
= 0.000591

Interpretation: The probability of this exact scenario is about 0.059%, or roughly 1 in 1,700. This is a rare event because burglaries are rare to begin with (0.1%).

Parameter Comparison
Full Joint Table
25 - 1 = 31

parameters needed

Bayesian Network
10

parameters needed
(1 + 1 + 4 + 2 + 2)

67% reduction!

5. Why Does Factorization Work?

The Foundation: Conditional Independence

Bayesian Network factorization works because of the Markov assumption: Given its parents, a node is conditionally independent of all its non-descendants.

1. Local Structure

Each variable depends only on its immediate parents, not on the entire history of variables.

2. Independence

The graph structure encodes which variables are conditionally independent, eliminating unnecessary dependencies.

3. Compact Storage

Instead of exponential entries, we store linear number of small CPT tables.

Mathematical Guarantee

The BN factorization is mathematically equivalent to the full joint distribution:

i P(Xi | Parents(Xi)) = P(X1, ..., Xn)

It's not an approximation — it's exact!
We've just reorganized the representation to exploit structure.

Key Takeaways
  • Graph structure determines factorization: Arrows show direct dependencies
  • Conditional independence is key: Not everything depends on everything
  • Massive parameter reduction: From exponential to linear (in many cases)
  • Enables efficient inference: We can exploit structure for faster computation
  • Mirrors causal relationships: Often reflects how the world actually works

Summary & Next Steps

What We Learned
  1. Full joint distributions are exponentially large (2n)
  2. Chain rule breaks joints into conditionals
  3. BNs factorize using only parent dependencies
  4. Graph structure = conditional independence
  5. Massive reduction in parameters
Coming Next
  • Topic 3: Conditional Independence in depth
  • Topic 4: d-Separation algorithm
  • Topic 5: How to construct BNs
  • Topics 6-8: Inference algorithms
The Big Picture

Bayesian Networks give us the best of both worlds:
The expressiveness of full joint distributions with the efficiency of compact, structured representation. This makes probabilistic AI practical for real-world applications!