Constructing Bayesian Networks

Introduction: The Art of Building Bayesian Networks

The Challenge

You understand Bayesian Networks theoretically, but how do you actually build one for a real problem?

This topic teaches you the systematic process of transforming domain knowledge into a working BN.

Inputs

Domain knowledge
Causal relationships
Expert opinions
Historical data

Process

Identify variables
Determine ordering
Add edges (dependencies)
Specify CPTs

Output

Complete DAG structure
Validated CPTs
Ready for inference
Interpretable model

                    What You'll Learn
                    Step 1: How to choose and order variables
Step 2: How to identify direct influences (parent relationships)
Step 3: How to specify and validate CPTs
Step 4: Common pitfalls to avoid and best practices
Step 5: Complete worked example from scratch

                

1. The 5-Step Construction Process

Overview

Building a Bayesian Network is a systematic process. Follow these five steps in order:

Step 1: Identify Variables

List all relevant random variables in your domain

What can we observe?
What do we want to infer?
What hidden causes exist?

Step 2: Choose Variable Ordering

Arrange variables in causal order

Causes before effects
Root causes first
Observations last

Step 3: Add Edges (Dependencies)

For each variable, add edges from its direct influences

Ask: "What directly affects this?"
Only add necessary edges
Check for cycles (must be DAG)

Step 4: Specify CPTs

Define conditional probabilities for each variable

From data (if available)
From expert knowledge
Must sum to 1.0 per row

Step 5: Validate & Test

Verify the network is correct

Check independencies make sense
Test with known scenarios
Refine if needed

Pro Tip: Following causal ordering in Step 2 makes Step 3 much easier!

2. Variable Ordering: The Key to Success

Why Ordering Matters

The order in which you add variables dramatically affects the network structure. Causal ordering (causes → effects) produces the most natural and compact BN.

BAD: Random Ordering

graph TD G[Grass Wet] A[Alarm] R[Rain] S[Sprinkler] G --> R G --> S A --> G S --> R style G fill:#dc3545,stroke:#0a2540,stroke-width:2px,color:#fff style A fill:#ffc107,stroke:#0a2540,stroke-width:2px,color:#000 style R fill:#635bff,stroke:#0a2540,stroke-width:2px,color:#fff style S fill:#00d4ff,stroke:#0a2540,stroke-width:2px,color:#fff

Problems:

Confusing direction (effects → causes)
Many unnecessary edges
Hard to interpret
Complex CPTs

GOOD: Causal Ordering

graph TD R[Rain] S[Sprinkler] G[Grass Wet] R --> G S --> G style R fill:#635bff,stroke:#0a2540,stroke-width:2px,color:#fff style S fill:#00d4ff,stroke:#0a2540,stroke-width:2px,color:#fff style G fill:#32D583,stroke:#0a2540,stroke-width:2px,color:#fff

Benefits:

Intuitive direction (causes → effects)
Minimal edges needed
Easy to interpret
Simple CPTs

Chain of Thought: How to Choose Ordering

Step 1: Identify root causes — variables that are NOT caused by anything else in your domain

Example: Season, Patient Genetics, Economic Policy

Step 2: Identify intermediate variables — variables that are caused by root causes

Example: Temperature (caused by Season), Disease (caused by Genetics)

Step 3: Identify effects/observations — variables that result from other variables

Example: Grass Wet (caused by Rain/Sprinkler), Symptoms (caused by Disease)

Step 4: Arrange in layers: Root Causes → Intermediate → Effects

Practical Example: Medical Diagnosis Ordering

Layer 1: Root Causes

Smoking

Genetics

Age

Layer 2: Diseases

Lung Cancer

Bronchitis

Layer 3: Symptoms

Cough

Fatigue

X-ray Result

3. Adding Edges: Identifying Direct Influences

The Core Question

For each variable X (in order), ask: "What DIRECTLY influences X?"
Only add edges from those direct influences (parents).

Chain of Thought: Adding Edges for Variable X

Step 1: Look at all variables added before X (in your ordering)

Step 2: For each previous variable Y, ask: "Does Y directly influence X?"

Direct influence means: X depends on Y even if we know all other parents of X

Step 3: If YES → Add edge Y → X
If NO → Skip (conditional independence)

Step 4: The parents of X are now its minimal set of direct influences

Example: Building a Medical Network Step-by-Step

Scenario: We want to diagnose lung disease. Variables in causal order: Smoking → LungCancer → Cough

Adding Variable 1: Smoking (Root Cause)

graph TD S[Smoking] style S fill:#635bff,stroke:#0a2540,stroke-width:3px,color:#fff

Analysis: No previous variables → No parents → Root node

Adding Variable 2: LungCancer

graph TD S[Smoking] L[Lung Cancer] S --> L style S fill:#635bff,stroke:#0a2540,stroke-width:2px,color:#fff style L fill:#dc3545,stroke:#0a2540,stroke-width:3px,color:#fff

Question: Does Smoking directly influence LungCancer?
Answer: YES ✓
Smoking is a known cause of lung cancer.

Result: Add edge Smoking → LungCancer

Adding Variable 3: Cough

graph TD S[Smoking] L[Lung Cancer] C[Cough] S --> L L --> C style S fill:#635bff,stroke:#0a2540,stroke-width:2px,color:#fff style L fill:#dc3545,stroke:#0a2540,stroke-width:2px,color:#fff style C fill:#ffc107,stroke:#0a2540,stroke-width:3px,color:#000

Question 1: Does Smoking directly influence Cough?
Answer: NO ✗
Given LungCancer status, smoking doesn't add info about cough

Question 2: Does LungCancer directly influence Cough?
Answer: YES ✓
Lung cancer causes coughing.

Result: Add edge LungCancer → Cough (NOT Smoking → Cough)

Key Principle: Conditional Independence

We didn't add Smoking → Cough because: Cough ⊥ Smoking | LungCancer

Why? The effect of smoking on cough is mediated through lung cancer. Once we know if the patient has lung cancer, smoking status doesn't tell us anything new about cough. This is the power of conditional independence — it lets us avoid unnecessary edges!

4. Specifying Conditional Probability Tables

What are CPTs?

Each variable needs a Conditional Probability Table: P(Variable | Parents). These numbers can come from data, expert knowledge, or a combination of both.

From Data

When you have historical data:

Step 1: Count occurrences

Example: In 1000 smokers, 30 got lung cancer

Step 2: Calculate frequencies

P(Cancer|Smoking=Yes) = 30/1000 = 0.03

Step 3: Normalize (ensure rows sum to 1)

From Experts

When data is unavailable:

Step 1: Interview domain experts

Ask: "How often does X happen given Y?"

Step 2: Use ranges and sensitivity analysis

Example: "Between 60-80% likely"

Step 3: Validate with test cases

Example: CPT Specification

Simple CPT: Smoking (Root Node)

Smoking	Probability
Yes	0.30 (30%)
No	0.70 (70%)
Sum:	1.00 ✓

Source: Population statistics (e.g., "30% of adults smoke")

Conditional CPT: Lung Cancer | Smoking

Smoking	Cancer = Yes	Cancer = No
Yes	0.10	0.90
No	0.01	0.99

Source: Medical studies (e.g., "10% of smokers develop lung cancer vs. 1% of non-smokers")

Important: Each row in a CPT must sum to 1.0. This ensures valid probability distributions.

Pro Tips for CPT Specification

Start with extreme cases: What happens when parent = always true or always false?
Use noisy-OR for combining influences: When multiple parents independently cause effect
Validate with experts: Do the numbers "feel right" to domain experts?
Test edge cases: Check network behavior with known scenarios
Sensitivity analysis: How much do small CPT changes affect results?

5. Common Pitfalls & Best Practices

Learn from Common Mistakes

Even experienced modelers make these mistakes. Knowing them helps you avoid them!

Common Pitfalls

Pitfall 1: Too Many Variables

Problem: Including every possible variable makes the network unmanageable.

Solution: Focus on relevant variables for your specific inference task. Ask: "Does this variable affect what I'm trying to predict?"

Pitfall 2: Confusing Causation with Correlation

Problem: Adding edges between correlated variables that aren't causally related.

Solution: Ask "Does X cause Y, or do they just happen together?" If they share a common cause, model the common cause instead.

Pitfall 3: Creating Cycles

Problem: Adding edges that create loops (A → B → C → A).

Solution: BNs must be acyclic (DAG). If you need cycles, consider Dynamic Bayesian Networks or break the cycle by introducing time steps.

Pitfall 4: Wrong Causal Direction

Problem: Putting arrows backwards (Effect → Cause instead of Cause → Effect).

Solution: Always ask "What causes what?" not "What is correlated?" Follow causal ordering: root causes first, effects last.

Pitfall 5: Incomplete CPTs

Problem: Missing entries or rows that don't sum to 1.0.

Solution: Every possible parent combination needs probabilities. Always verify rows sum to 1.0. Use software validation tools.

Best Practices

Best Practice 1: Start Simple

Begin with a minimal network containing only essential variables. Add complexity incrementally. Test at each step. It's easier to expand than to simplify!

Best Practice 2: Use Domain Experts

Collaborate with people who understand the domain deeply. They can identify causal relationships and validate your structure. Show them the network visually — they can spot errors you'd miss.

Best Practice 3: Follow Causal Ordering

Always order variables causally (causes → effects). This makes edge selection natural and produces compact, interpretable networks. This is the single most important best practice!

Best Practice 4: Test with Known Cases

Validate your network with known scenarios. Example: "If patient is a heavy smoker, what's P(lung cancer)?" If results don't match expert expectations, revise the network.

Best Practice 5: Document Your Assumptions

Write down why you added each edge and where CPT values came from. This helps with debugging, explaining to others, and maintaining the network over time.

Best Practice 6: Iterate and Refine

Your first network won't be perfect. Iterate: build → test → refine → repeat. As you learn more about the domain, update the network structure and parameters.

6. Complete Example: Building a Medical Diagnosis Network

Scenario

Goal: Build a BN to diagnose respiratory diseases based on patient history and symptoms.
Domain: Medical diagnosis for lung conditions (Lung Cancer, Bronchitis)

Step 1: Identify Variables

Variable	Type	Values	Why Include?
Smoking	Risk Factor	Yes / No	Major cause of lung diseases
Lung Cancer	Disease (Query)	Yes / No	What we want to diagnose
Bronchitis	Disease (Query)	Yes / No	Alternative diagnosis
Cough	Symptom (Evidence)	Yes / No	Observable symptom, helps diagnosis
Fatigue	Symptom (Evidence)	Yes / No	Observable symptom for lung cancer
X-ray Result	Test (Evidence)	Abnormal / Normal	Strong diagnostic indicator

Variables NOT included: Patient age, genetics, pollution exposure (simplifying assumption: captured by smoking status)

Step 2: Choose Causal Ordering

Layer 1 (Root Causes): Smoking

Not caused by other variables in our model

Layer 2 (Diseases): Lung Cancer, Bronchitis

Caused by smoking

Layer 3 (Observations): Cough, Fatigue, X-ray

Caused by diseases (effects we observe)

Final Ordering:
Smoking
↓
Lung Cancer
Bronchitis
↓
Cough
Fatigue
X-ray

Step 3: Add Edges (Apply Chain of Thought)

graph TD S[Smoking] style S fill:#635bff,stroke:#0a2540,stroke-width:3px,color:#fff

Decision: No previous variables → No edges to add

Question: Does Smoking directly influence Lung Cancer?

Answer: YES ✓ — Smoking is a direct cause of lung cancer.

graph TD S[Smoking] LC[Lung Cancer] S --> LC style S fill:#635bff,stroke:#0a2540,stroke-width:2px,color:#fff style LC fill:#dc3545,stroke:#0a2540,stroke-width:3px,color:#fff

Decision: Add edge Smoking → Lung Cancer

Question 1: Does Smoking directly influence Bronchitis?

Answer: YES ✓ — Smoking causes bronchitis.

Question 2: Does Lung Cancer directly influence Bronchitis?

Answer: NO ✗ — They are independent diseases (both caused by smoking, but one doesn't cause the other).

graph TD S[Smoking] LC[Lung Cancer] B[Bronchitis] S --> LC S --> B style S fill:#635bff,stroke:#0a2540,stroke-width:2px,color:#fff style LC fill:#dc3545,stroke:#0a2540,stroke-width:2px,color:#fff style B fill:#ff6b6b,stroke:#0a2540,stroke-width:3px,color:#fff

Decision: Add edge Smoking → Bronchitis (NOT Lung Cancer → Bronchitis)

Question 1: Does Smoking directly influence Cough?

Answer: NO ✗ — Given both Lung Cancer and Bronchitis status, smoking doesn't add information about cough.

Note: Smoking affects cough through two paths: Smoking → LC → Cough and Smoking → Bronchitis → Cough. Only when we know both disease states are those paths blocked, making smoking conditionally independent of cough.

Question 2: Does Lung Cancer directly influence Cough?

Answer: YES ✓ — Lung cancer causes persistent coughing.

Question 3: Does Bronchitis directly influence Cough?

Answer: YES ✓ — Bronchitis also causes coughing.

graph TD S[Smoking] LC[Lung Cancer] B[Bronchitis] C[Cough] S --> LC S --> B LC --> C B --> C style S fill:#635bff,stroke:#0a2540,stroke-width:2px,color:#fff style LC fill:#dc3545,stroke:#0a2540,stroke-width:2px,color:#fff style B fill:#ff6b6b,stroke:#0a2540,stroke-width:2px,color:#fff style C fill:#ffc107,stroke:#0a2540,stroke-width:3px,color:#000

Decision: Add edges: Lung Cancer → Cough, Bronchitis → Cough

Question 1: Does Smoking directly influence Fatigue?

Answer: NO ✗ — Given Lung Cancer status, smoking doesn't add information.

Question 2: Does Lung Cancer directly influence Fatigue?

Answer: YES ✓ — Lung cancer causes systemic fatigue.

Question 3: Does Bronchitis directly influence Fatigue?

Answer: NO ✗ — Bronchitis typically doesn't cause systemic fatigue.

graph TD S[Smoking] LC[Lung Cancer] B[Bronchitis] C[Cough] F[Fatigue] S --> LC S --> B LC --> C B --> C LC --> F style S fill:#635bff,stroke:#0a2540,stroke-width:2px,color:#fff style LC fill:#dc3545,stroke:#0a2540,stroke-width:2px,color:#fff style B fill:#ff6b6b,stroke:#0a2540,stroke-width:2px,color:#fff style C fill:#ffc107,stroke:#0a2540,stroke-width:2px,color:#000 style F fill:#00d4ff,stroke:#0a2540,stroke-width:3px,color:#fff

Decision: Add edge: Lung Cancer → Fatigue (NOT Smoking → Fatigue, NOT Bronchitis → Fatigue)

Question 1: Does Smoking directly influence X-ray Result?

Answer: NO ✗ — Given Lung Cancer status, smoking doesn't add information.

Question 2: Does Lung Cancer directly influence X-ray Result?

Answer: YES ✓ — Lung cancer causes abnormal X-ray results.

Question 3: Does Bronchitis directly influence X-ray Result?

Answer: NO ✗ — Bronchitis doesn't typically show abnormalities on X-rays.

Question 4: Does Cough directly influence X-ray Result?

Answer: NO ✗ — Cough is a symptom, not a physical abnormality visible on X-rays.

Question 5: Does Fatigue directly influence X-ray Result?

Answer: NO ✗ — Fatigue doesn't affect X-ray images.

graph TD S[Smoking] LC[Lung Cancer] B[Bronchitis] C[Cough] F[Fatigue] X[X-ray Abnormal] S --> LC S --> B LC --> C B --> C LC --> F LC --> X style S fill:#635bff,stroke:#0a2540,stroke-width:2px,color:#fff style LC fill:#dc3545,stroke:#0a2540,stroke-width:2px,color:#fff style B fill:#ff6b6b,stroke:#0a2540,stroke-width:2px,color:#fff style C fill:#ffc107,stroke:#0a2540,stroke-width:2px,color:#000 style F fill:#00d4ff,stroke:#0a2540,stroke-width:2px,color:#fff style X fill:#32D583,stroke:#0a2540,stroke-width:3px,color:#fff

Decision: Add edge: Lung Cancer → X-ray Result (only Lung Cancer directly affects X-ray)

Step 4: Specify CPTs

Now we need probabilities for each variable given its parents. These come from medical literature:

CPT 1: Smoking (Prior)

Value	P(Smoking)
Yes	0.30
No	0.70

CPT 2: Lung Cancer | Smoking

Smoking	LC=Yes	LC=No
Yes	0.10	0.90
No	0.01	0.99

CPT 3: Bronchitis | Smoking

Smoking	B=Yes	B=No
Yes	0.60	0.40
No	0.30	0.70

CPT 4: Cough | LC, Bronchitis

LC	Bronchitis	Cough=Yes	Cough=No
Yes	Yes	0.98	0.02
Yes	No	0.80	0.20
No	Yes	0.85	0.15
No	No	0.10	0.90

CPT 5: Fatigue | LC

Lung Cancer	Fatigue=Yes	Fatigue=No
Yes	0.70	0.30
No	0.20	0.80

CPT 6: X-ray | LC

Lung Cancer	X-ray=Abnormal	X-ray=Normal
Yes	0.90	0.10
No	0.05	0.95

Step 5: Validate & Test

Test Case 1: Heavy Smoker with Symptoms

Evidence:

Smoking = Yes
Cough = Yes
Fatigue = Yes
X-ray = Abnormal

Expected: High P(Lung Cancer)

Result: P(LC|Evidence) ≈ 0.85 ✓

Test Case 2: Non-Smoker with Cough Only

Evidence:

Smoking = No
Cough = Yes
Fatigue = No
X-ray = Normal

Expected: Low P(Lung Cancer), Higher P(Bronchitis)

Result: P(LC) ≈ 0.05, P(Bronchitis) ≈ 0.45 ✓

Validation Passed! The network produces medically reasonable results. It's ready for deployment!

Complete Network Summary

graph TD S[Smoking
30% prevalence] LC[Lung Cancer
10% if smoker] B[Bronchitis
60% if smoker] C[Cough
Multiple causes] F[Fatigue
70% if LC] X[X-ray Abnormal
90% if LC] S --> LC S --> B LC --> C B --> C LC --> F LC --> X style S fill:#635bff,stroke:#0a2540,stroke-width:3px,color:#fff style LC fill:#dc3545,stroke:#0a2540,stroke-width:3px,color:#fff style B fill:#ff6b6b,stroke:#0a2540,stroke-width:3px,color:#fff style C fill:#ffc107,stroke:#0a2540,stroke-width:3px,color:#000 style F fill:#00d4ff,stroke:#0a2540,stroke-width:3px,color:#fff style X fill:#32D583,stroke:#0a2540,stroke-width:3px,color:#fff

6 variables (1 risk factor, 2 diseases, 3 observations)
6 edges (minimal necessary dependencies)
13 total parameters (much less than 2⁶ − 1 = 63 for full joint!)
Can perform: Diagnosis (P(Disease|Symptoms)), Prediction (P(Symptoms|Disease)), Risk Assessment (P(Disease|Smoking))

Summary & Key Takeaways

What We Learned

The 5-step process: Variables → Ordering → Edges → CPTs → Validation
Causal ordering is crucial: Always order causes before effects
Direct influences only: Ask "Does Y directly affect X given other parents?"
CPTs from data or experts: Multiple sources for probabilities
Avoid common pitfalls: Too many variables, wrong directions, cycles
Iterate and validate: Test with known cases, refine as needed

                            Critical Success Factors
                            Domain expertise: Work with people who know the domain
Causal thinking: Think "what causes what?" not "what's correlated?"
Simplicity first: Start minimal, add complexity incrementally
Conditional independence: Leverage it to avoid unnecessary edges
Validation is key: Always test with known scenarios
Document everything: Record assumptions and data sources

                        

The Art and Science of BN Construction

Building Bayesian Networks is both an art and a science.
The science is the systematic 5-step process.
The art is knowing which variables matter, understanding causal relationships, and making good modeling decisions.
Master both, and you'll build effective probabilistic models for any domain!

Previous: d-Separation Back to Lecture 11 Next: Exact Inference

Introduction: The Art of Building Bayesian Networks

The Challenge

What You'll Learn

1. The 5-Step Construction Process

Overview

Step 1: Identify Variables

Step 2: Choose Variable Ordering

Step 3: Add Edges (Dependencies)

Step 4: Specify CPTs

Step 5: Validate & Test

2. Variable Ordering: The Key to Success

Why Ordering Matters

Chain of Thought: How to Choose Ordering

Practical Example: Medical Diagnosis Ordering

3. Adding Edges: Identifying Direct Influences

The Core Question

Chain of Thought: Adding Edges for Variable X

Example: Building a Medical Network Step-by-Step

Key Principle: Conditional Independence

4. Specifying Conditional Probability Tables

What are CPTs?

Example: CPT Specification

Simple CPT: Smoking (Root Node)

Conditional CPT: Lung Cancer | Smoking

Pro Tips for CPT Specification

5. Common Pitfalls & Best Practices

Learn from Common Mistakes

Common Pitfalls

Pitfall 1: Too Many Variables

Pitfall 2: Confusing Causation with Correlation

Pitfall 3: Creating Cycles

Pitfall 4: Wrong Causal Direction

Pitfall 5: Incomplete CPTs

Best Practices

Best Practice 1: Start Simple

Best Practice 2: Use Domain Experts

Best Practice 3: Follow Causal Ordering

Best Practice 4: Test with Known Cases

Best Practice 5: Document Your Assumptions

Best Practice 6: Iterate and Refine

6. Complete Example: Building a Medical Diagnosis Network

Scenario

Step 1: Identify Variables

Step 2: Choose Causal Ordering

Step 3: Add Edges (Apply Chain of Thought)

Variable 1: Smoking (No parents — root node)

Variable 2: Lung Cancer

Variable 3: Bronchitis

Variable 4: Cough

Variable 5: Fatigue

Variable 6: X-ray Result

Step 4: Specify CPTs

CPT 1: Smoking (Prior)

CPT 2: Lung Cancer | Smoking

CPT 3: Bronchitis | Smoking

CPT 4: Cough | LC, Bronchitis

CPT 5: Fatigue | LC

CPT 6: X-ray | LC

Step 5: Validate & Test

Test Case 1: Heavy Smoker with Symptoms

Test Case 2: Non-Smoker with Cough Only

Complete Network Summary

Summary & Key Takeaways

What We Learned

Critical Success Factors

The Art and Science of BN Construction