Constructing Bayesian Networks

From Domain Knowledge to Working BN

Introduction: The Art of Building Bayesian Networks

The Challenge

You understand Bayesian Networks theoretically, but how do you actually build one for a real problem?

This topic teaches you the systematic process of transforming domain knowledge into a working BN.

Inputs
  • Domain knowledge
  • Causal relationships
  • Expert opinions
  • Historical data
Process
  • Identify variables
  • Determine ordering
  • Add edges (dependencies)
  • Specify CPTs
Output
  • Complete DAG structure
  • Validated CPTs
  • Ready for inference
  • Interpretable model
What You'll Learn
  • Step 1: How to choose and order variables
  • Step 2: How to identify direct influences (parent relationships)
  • Step 3: How to specify and validate CPTs
  • Step 4: Common pitfalls to avoid and best practices
  • Step 5: Complete worked example from scratch

1. The 5-Step Construction Process

Overview

Building a Bayesian Network is a systematic process. Follow these five steps in order:

Step 1: Identify Variables

List all relevant random variables in your domain

  • What can we observe?
  • What do we want to infer?
  • What hidden causes exist?
Step 2: Choose Variable Ordering

Arrange variables in causal order

  • Causes before effects
  • Root causes first
  • Observations last
Step 3: Add Edges (Dependencies)

For each variable, add edges from its direct influences

  • Ask: "What directly affects this?"
  • Only add necessary edges
  • Check for cycles (must be DAG)
Step 4: Specify CPTs

Define conditional probabilities for each variable

  • From data (if available)
  • From expert knowledge
  • Must sum to 1.0 per row
Step 5: Validate & Test

Verify the network is correct

  • Check independencies make sense
  • Test with known scenarios
  • Refine if needed

Pro Tip: Following causal ordering in Step 2 makes Step 3 much easier!

2. Variable Ordering: The Key to Success

Why Ordering Matters

The order in which you add variables dramatically affects the network structure. Causal ordering (causes → effects) produces the most natural and compact BN.

BAD: Random Ordering
graph TD G[Grass Wet] A[Alarm] R[Rain] S[Sprinkler] G --> R G --> S A --> G S --> R style G fill:#dc3545,stroke:#0a2540,stroke-width:2px,color:#fff style A fill:#ffc107,stroke:#0a2540,stroke-width:2px,color:#000 style R fill:#635bff,stroke:#0a2540,stroke-width:2px,color:#fff style S fill:#00d4ff,stroke:#0a2540,stroke-width:2px,color:#fff
Problems:
  • Confusing direction (effects → causes)
  • Many unnecessary edges
  • Hard to interpret
  • Complex CPTs
GOOD: Causal Ordering
graph TD R[Rain] S[Sprinkler] G[Grass Wet] R --> G S --> G style R fill:#635bff,stroke:#0a2540,stroke-width:2px,color:#fff style S fill:#00d4ff,stroke:#0a2540,stroke-width:2px,color:#fff style G fill:#32D583,stroke:#0a2540,stroke-width:2px,color:#fff
Benefits:
  • Intuitive direction (causes → effects)
  • Minimal edges needed
  • Easy to interpret
  • Simple CPTs
Chain of Thought: How to Choose Ordering
Step 1: Identify root causes — variables that are NOT caused by anything else in your domain
Example: Season, Patient Genetics, Economic Policy
Step 2: Identify intermediate variables — variables that are caused by root causes
Example: Temperature (caused by Season), Disease (caused by Genetics)
Step 3: Identify effects/observations — variables that result from other variables
Example: Grass Wet (caused by Rain/Sprinkler), Symptoms (caused by Disease)
Step 4: Arrange in layers: Root CausesIntermediateEffects
Practical Example: Medical Diagnosis Ordering
Layer 1: Root Causes
Smoking
Genetics
Age
Layer 2: Diseases
Lung Cancer
Bronchitis
Layer 3: Symptoms
Cough
Fatigue
X-ray Result

3. Adding Edges: Identifying Direct Influences

The Core Question

For each variable X (in order), ask: "What DIRECTLY influences X?"
Only add edges from those direct influences (parents).

Chain of Thought: Adding Edges for Variable X
Step 1: Look at all variables added before X (in your ordering)
Step 2: For each previous variable Y, ask: "Does Y directly influence X?"
Direct influence means: X depends on Y even if we know all other parents of X
Step 3: If YES → Add edge Y → X
If NO → Skip (conditional independence)
Step 4: The parents of X are now its minimal set of direct influences
Example: Building a Medical Network Step-by-Step

Scenario: We want to diagnose lung disease. Variables in causal order: Smoking → LungCancer → Cough

Adding Variable 1: Smoking (Root Cause)
graph TD S[Smoking] style S fill:#635bff,stroke:#0a2540,stroke-width:3px,color:#fff
Analysis: No previous variables → No parents → Root node
Adding Variable 2: LungCancer
graph TD S[Smoking] L[Lung Cancer] S --> L style S fill:#635bff,stroke:#0a2540,stroke-width:2px,color:#fff style L fill:#dc3545,stroke:#0a2540,stroke-width:3px,color:#fff
Question: Does Smoking directly influence LungCancer?
Answer: YES ✓
Smoking is a known cause of lung cancer.
Result: Add edge Smoking → LungCancer
Adding Variable 3: Cough
graph TD S[Smoking] L[Lung Cancer] C[Cough] S --> L L --> C style S fill:#635bff,stroke:#0a2540,stroke-width:2px,color:#fff style L fill:#dc3545,stroke:#0a2540,stroke-width:2px,color:#fff style C fill:#ffc107,stroke:#0a2540,stroke-width:3px,color:#000
Question 1: Does Smoking directly influence Cough?
Answer: NO ✗
Given LungCancer status, smoking doesn't add info about cough

Question 2: Does LungCancer directly influence Cough?
Answer: YES ✓
Lung cancer causes coughing.
Result: Add edge LungCancer → Cough (NOT Smoking → Cough)
Key Principle: Conditional Independence

We didn't add Smoking → Cough because: Cough ⊥ Smoking | LungCancer

Why? The effect of smoking on cough is mediated through lung cancer. Once we know if the patient has lung cancer, smoking status doesn't tell us anything new about cough. This is the power of conditional independence — it lets us avoid unnecessary edges!

4. Specifying Conditional Probability Tables

What are CPTs?

Each variable needs a Conditional Probability Table: P(Variable | Parents). These numbers can come from data, expert knowledge, or a combination of both.

From Data

When you have historical data:

Step 1: Count occurrences
Example: In 1000 smokers, 30 got lung cancer
Step 2: Calculate frequencies
P(Cancer|Smoking=Yes) = 30/1000 = 0.03
Step 3: Normalize (ensure rows sum to 1)
From Experts

When data is unavailable:

Step 1: Interview domain experts
Ask: "How often does X happen given Y?"
Step 2: Use ranges and sensitivity analysis
Example: "Between 60-80% likely"
Step 3: Validate with test cases
Example: CPT Specification
Simple CPT: Smoking (Root Node)
Smoking Probability
Yes 0.30 (30%)
No 0.70 (70%)
Sum: 1.00

Source: Population statistics (e.g., "30% of adults smoke")

Conditional CPT: Lung Cancer | Smoking
Smoking Cancer = Yes Cancer = No
Yes 0.10 0.90
No 0.01 0.99

Source: Medical studies (e.g., "10% of smokers develop lung cancer vs. 1% of non-smokers")

Important: Each row in a CPT must sum to 1.0. This ensures valid probability distributions.

Pro Tips for CPT Specification
  • Start with extreme cases: What happens when parent = always true or always false?
  • Use noisy-OR for combining influences: When multiple parents independently cause effect
  • Validate with experts: Do the numbers "feel right" to domain experts?
  • Test edge cases: Check network behavior with known scenarios
  • Sensitivity analysis: How much do small CPT changes affect results?

5. Common Pitfalls & Best Practices

Learn from Common Mistakes

Even experienced modelers make these mistakes. Knowing them helps you avoid them!

Common Pitfalls
Pitfall 1: Too Many Variables

Problem: Including every possible variable makes the network unmanageable.

Solution: Focus on relevant variables for your specific inference task. Ask: "Does this variable affect what I'm trying to predict?"

Pitfall 2: Confusing Causation with Correlation

Problem: Adding edges between correlated variables that aren't causally related.

Solution: Ask "Does X cause Y, or do they just happen together?" If they share a common cause, model the common cause instead.

Pitfall 3: Creating Cycles

Problem: Adding edges that create loops (A → B → C → A).

Solution: BNs must be acyclic (DAG). If you need cycles, consider Dynamic Bayesian Networks or break the cycle by introducing time steps.

Pitfall 4: Wrong Causal Direction

Problem: Putting arrows backwards (Effect → Cause instead of Cause → Effect).

Solution: Always ask "What causes what?" not "What is correlated?" Follow causal ordering: root causes first, effects last.

Pitfall 5: Incomplete CPTs

Problem: Missing entries or rows that don't sum to 1.0.

Solution: Every possible parent combination needs probabilities. Always verify rows sum to 1.0. Use software validation tools.

Best Practices
Best Practice 1: Start Simple

Begin with a minimal network containing only essential variables. Add complexity incrementally. Test at each step. It's easier to expand than to simplify!

Best Practice 2: Use Domain Experts

Collaborate with people who understand the domain deeply. They can identify causal relationships and validate your structure. Show them the network visually — they can spot errors you'd miss.

Best Practice 3: Follow Causal Ordering

Always order variables causally (causes → effects). This makes edge selection natural and produces compact, interpretable networks. This is the single most important best practice!

Best Practice 4: Test with Known Cases

Validate your network with known scenarios. Example: "If patient is a heavy smoker, what's P(lung cancer)?" If results don't match expert expectations, revise the network.

Best Practice 5: Document Your Assumptions

Write down why you added each edge and where CPT values came from. This helps with debugging, explaining to others, and maintaining the network over time.

Best Practice 6: Iterate and Refine

Your first network won't be perfect. Iterate: build → test → refine → repeat. As you learn more about the domain, update the network structure and parameters.

6. Complete Example: Building a Medical Diagnosis Network

Scenario

Goal: Build a BN to diagnose respiratory diseases based on patient history and symptoms.
Domain: Medical diagnosis for lung conditions (Lung Cancer, Bronchitis)

Step 1: Identify Variables
Variable Type Values Why Include?
Smoking Risk Factor Yes / No Major cause of lung diseases
Lung Cancer Disease (Query) Yes / No What we want to diagnose
Bronchitis Disease (Query) Yes / No Alternative diagnosis
Cough Symptom (Evidence) Yes / No Observable symptom, helps diagnosis
Fatigue Symptom (Evidence) Yes / No Observable symptom for lung cancer
X-ray Result Test (Evidence) Abnormal / Normal Strong diagnostic indicator

Variables NOT included: Patient age, genetics, pollution exposure (simplifying assumption: captured by smoking status)

Step 2: Choose Causal Ordering
Layer 1 (Root Causes): Smoking
Not caused by other variables in our model
Layer 2 (Diseases): Lung Cancer, Bronchitis
Caused by smoking
Layer 3 (Observations): Cough, Fatigue, X-ray
Caused by diseases (effects we observe)
Final Ordering:
Smoking

Lung Cancer
Bronchitis

Cough
Fatigue
X-ray
Step 3: Add Edges (Apply Chain of Thought)

graph TD S[Smoking] style S fill:#635bff,stroke:#0a2540,stroke-width:3px,color:#fff

Decision: No previous variables → No edges to add

Question: Does Smoking directly influence Lung Cancer?

Answer: YES ✓ — Smoking is a direct cause of lung cancer.

graph TD S[Smoking] LC[Lung Cancer] S --> LC style S fill:#635bff,stroke:#0a2540,stroke-width:2px,color:#fff style LC fill:#dc3545,stroke:#0a2540,stroke-width:3px,color:#fff

Decision: Add edge Smoking → Lung Cancer

Question 1: Does Smoking directly influence Bronchitis?

Answer: YES ✓ — Smoking causes bronchitis.

Question 2: Does Lung Cancer directly influence Bronchitis?

Answer: NO ✗ — They are independent diseases (both caused by smoking, but one doesn't cause the other).

graph TD S[Smoking] LC[Lung Cancer] B[Bronchitis] S --> LC S --> B style S fill:#635bff,stroke:#0a2540,stroke-width:2px,color:#fff style LC fill:#dc3545,stroke:#0a2540,stroke-width:2px,color:#fff style B fill:#ff6b6b,stroke:#0a2540,stroke-width:3px,color:#fff

Decision: Add edge Smoking → Bronchitis (NOT Lung Cancer → Bronchitis)

Question 1: Does Smoking directly influence Cough?

Answer: NO ✗ — Given both Lung Cancer and Bronchitis status, smoking doesn't add information about cough.

Note: Smoking affects cough through two paths: Smoking → LC → Cough and Smoking → Bronchitis → Cough. Only when we know both disease states are those paths blocked, making smoking conditionally independent of cough.

Question 2: Does Lung Cancer directly influence Cough?

Answer: YES ✓ — Lung cancer causes persistent coughing.

Question 3: Does Bronchitis directly influence Cough?

Answer: YES ✓ — Bronchitis also causes coughing.

graph TD S[Smoking] LC[Lung Cancer] B[Bronchitis] C[Cough] S --> LC S --> B LC --> C B --> C style S fill:#635bff,stroke:#0a2540,stroke-width:2px,color:#fff style LC fill:#dc3545,stroke:#0a2540,stroke-width:2px,color:#fff style B fill:#ff6b6b,stroke:#0a2540,stroke-width:2px,color:#fff style C fill:#ffc107,stroke:#0a2540,stroke-width:3px,color:#000

Decision: Add edges: Lung Cancer → Cough, Bronchitis → Cough

Question 1: Does Smoking directly influence Fatigue?

Answer: NO ✗ — Given Lung Cancer status, smoking doesn't add information.

Question 2: Does Lung Cancer directly influence Fatigue?

Answer: YES ✓ — Lung cancer causes systemic fatigue.

Question 3: Does Bronchitis directly influence Fatigue?

Answer: NO ✗ — Bronchitis typically doesn't cause systemic fatigue.

graph TD S[Smoking] LC[Lung Cancer] B[Bronchitis] C[Cough] F[Fatigue] S --> LC S --> B LC --> C B --> C LC --> F style S fill:#635bff,stroke:#0a2540,stroke-width:2px,color:#fff style LC fill:#dc3545,stroke:#0a2540,stroke-width:2px,color:#fff style B fill:#ff6b6b,stroke:#0a2540,stroke-width:2px,color:#fff style C fill:#ffc107,stroke:#0a2540,stroke-width:2px,color:#000 style F fill:#00d4ff,stroke:#0a2540,stroke-width:3px,color:#fff

Decision: Add edge: Lung Cancer → Fatigue (NOT Smoking → Fatigue, NOT Bronchitis → Fatigue)

Question 1: Does Smoking directly influence X-ray Result?

Answer: NO ✗ — Given Lung Cancer status, smoking doesn't add information.

Question 2: Does Lung Cancer directly influence X-ray Result?

Answer: YES ✓ — Lung cancer causes abnormal X-ray results.

Question 3: Does Bronchitis directly influence X-ray Result?

Answer: NO ✗ — Bronchitis doesn't typically show abnormalities on X-rays.

Question 4: Does Cough directly influence X-ray Result?

Answer: NO ✗ — Cough is a symptom, not a physical abnormality visible on X-rays.

Question 5: Does Fatigue directly influence X-ray Result?

Answer: NO ✗ — Fatigue doesn't affect X-ray images.

graph TD S[Smoking] LC[Lung Cancer] B[Bronchitis] C[Cough] F[Fatigue] X[X-ray Abnormal] S --> LC S --> B LC --> C B --> C LC --> F LC --> X style S fill:#635bff,stroke:#0a2540,stroke-width:2px,color:#fff style LC fill:#dc3545,stroke:#0a2540,stroke-width:2px,color:#fff style B fill:#ff6b6b,stroke:#0a2540,stroke-width:2px,color:#fff style C fill:#ffc107,stroke:#0a2540,stroke-width:2px,color:#000 style F fill:#00d4ff,stroke:#0a2540,stroke-width:2px,color:#fff style X fill:#32D583,stroke:#0a2540,stroke-width:3px,color:#fff

Decision: Add edge: Lung Cancer → X-ray Result (only Lung Cancer directly affects X-ray)

Step 4: Specify CPTs

Now we need probabilities for each variable given its parents. These come from medical literature:

CPT 1: Smoking (Prior)
Value P(Smoking)
Yes 0.30
No 0.70
CPT 2: Lung Cancer | Smoking
Smoking LC=Yes LC=No
Yes 0.10 0.90
No 0.01 0.99
CPT 3: Bronchitis | Smoking
Smoking B=Yes B=No
Yes 0.60 0.40
No 0.30 0.70
CPT 4: Cough | LC, Bronchitis
LC Bronchitis Cough=Yes Cough=No
Yes Yes 0.98 0.02
Yes No 0.80 0.20
No Yes 0.85 0.15
No No 0.10 0.90
CPT 5: Fatigue | LC
Lung Cancer Fatigue=Yes Fatigue=No
Yes 0.70 0.30
No 0.20 0.80
CPT 6: X-ray | LC
Lung Cancer X-ray=Abnormal X-ray=Normal
Yes 0.90 0.10
No 0.05 0.95
Step 5: Validate & Test
Test Case 1: Heavy Smoker with Symptoms

Evidence:

  • Smoking = Yes
  • Cough = Yes
  • Fatigue = Yes
  • X-ray = Abnormal

Expected: High P(Lung Cancer)

Result: P(LC|Evidence) ≈ 0.85 ✓

Test Case 2: Non-Smoker with Cough Only

Evidence:

  • Smoking = No
  • Cough = Yes
  • Fatigue = No
  • X-ray = Normal

Expected: Low P(Lung Cancer), Higher P(Bronchitis)

Result: P(LC) ≈ 0.05, P(Bronchitis) ≈ 0.45 ✓

Validation Passed! The network produces medically reasonable results. It's ready for deployment!

Complete Network Summary
graph TD S[Smoking
30% prevalence] LC[Lung Cancer
10% if smoker] B[Bronchitis
60% if smoker] C[Cough
Multiple causes] F[Fatigue
70% if LC] X[X-ray Abnormal
90% if LC] S --> LC S --> B LC --> C B --> C LC --> F LC --> X style S fill:#635bff,stroke:#0a2540,stroke-width:3px,color:#fff style LC fill:#dc3545,stroke:#0a2540,stroke-width:3px,color:#fff style B fill:#ff6b6b,stroke:#0a2540,stroke-width:3px,color:#fff style C fill:#ffc107,stroke:#0a2540,stroke-width:3px,color:#000 style F fill:#00d4ff,stroke:#0a2540,stroke-width:3px,color:#fff style X fill:#32D583,stroke:#0a2540,stroke-width:3px,color:#fff
  • 6 variables (1 risk factor, 2 diseases, 3 observations)
  • 6 edges (minimal necessary dependencies)
  • 13 total parameters (much less than 2⁶ − 1 = 63 for full joint!)
  • Can perform: Diagnosis (P(Disease|Symptoms)), Prediction (P(Symptoms|Disease)), Risk Assessment (P(Disease|Smoking))

Summary & Key Takeaways

What We Learned
  1. The 5-step process: Variables → Ordering → Edges → CPTs → Validation
  2. Causal ordering is crucial: Always order causes before effects
  3. Direct influences only: Ask "Does Y directly affect X given other parents?"
  4. CPTs from data or experts: Multiple sources for probabilities
  5. Avoid common pitfalls: Too many variables, wrong directions, cycles
  6. Iterate and validate: Test with known cases, refine as needed
Critical Success Factors
  • Domain expertise: Work with people who know the domain
  • Causal thinking: Think "what causes what?" not "what's correlated?"
  • Simplicity first: Start minimal, add complexity incrementally
  • Conditional independence: Leverage it to avoid unnecessary edges
  • Validation is key: Always test with known scenarios
  • Document everything: Record assumptions and data sources
The Art and Science of BN Construction

Building Bayesian Networks is both an art and a science.
The science is the systematic 5-step process.
The art is knowing which variables matter, understanding causal relationships, and making good modeling decisions.
Master both, and you'll build effective probabilistic models for any domain!