MDP Fundamentals | Lecture 12

What is a Markov Decision Process?

A Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker.

MDP Components:

States (𝑆): All possible situations
Actions (𝐴): Choices available in each state
Transitions 𝑇(𝑠,𝑎,𝑠′): Probability of reaching 𝑠′ from 𝑠 via 𝑎
Rewards 𝑅(𝑠,𝑎,𝑠′): Immediate payoff for transition
Discount 𝛾: How much we value future rewards (0 ≤ 𝛾 ≤ 1)

Key Insight:

Unlike search problems (deterministic successor), MDPs model stochastic outcomes. Taking action a in state s leads to a probability distribution over next states, not a single next state!

Demo 1: Interactive Dice Game MDP 🎲

📊 MDP State-Action Diagram

Visualizing the Dice Game as a Markov Decision Process

MDP Components

States: {in, end}

Actions: {stay, quit}

STAY Action:

• P(continue) = 2/3, R = $4
• P(end) = 1/3, R = $4

QUIT Action:

• P(end) = 1, R = $10

Expected Values:

• V(stay) = R / (1 - γ·pCont)
• V(quit) = $10 (immediate)
• Optimal: STAY!

Discount Factor γ:

• γ = 1: Future = Present value
• γ < 1: Future rewards worth less
• γ → 0: Only immediate matters

🎲 Play the Dice Game!

Experience the MDP in action - choose STAY or QUIT each round

Game Settings

Stay Reward:

Quit Reward:

Discount Factor γ

0 1 1.00

Only now matters Future = Present

Game ends on dice:

1 2 3 4 5 6

Click to toggle ending values

Expected Value

V(stay) = 12

V(quit) = 10

Optimal: STAY

Last Roll

🎲

Round

0

Dice

🎲

Round $

0

Total $

0

Click "Start" to begin!

Raw Total

$0

Σ rewards

Discounted Total

$0

Σ γ^t × r_t

Cumulative Rewards (with discount)

Game History

Start playing to see history...

Demo 2: Robot Warehouse Navigation 🏭

🤖 Navigate warehouse | ⚠️ Slippery floor = random moves!

🤖 Robot | 📦 Package | 🎯 Delivery | ⚫ Obstacle | Movement cost: -1

Slip Probability

0.1

Mode

Click "Run" to compute optimal policy!

Current Iteration

0

MDP Definition

𝑆: {(0,0), (0,1), ..., (4,4)} = 25 states
𝐴(𝑠): {N, E, S, W}
𝑇(𝑠,𝑎,𝑠′): 𝑃(0.1) slip to random
𝑅(𝑠,𝑎,𝑠′): +100 (delivery), -50 (collision), -1 (move)
𝛾: 0.9

Compare!

Switch between deterministic and stochastic mode. Notice how high slip probability makes the robot take safer paths! 🧠

Demo 3: MDP Component Builder 🔧

🔧 Build your own MDP from scratch!

Add states, define actions, set transition probabilities, assign rewards

Add Components

Add State

Start state?

End state?

Add Action

Add Transition

MDP Visualization

MDP Summary

States: 0
Actions: 0
Transitions: 0
Start State: -
End States: -

Demo 4: Search Problem → MDP Converter 🔄

🔄 Transform deterministic search → stochastic MDP!

Adjust uncertainty and simulate agent traversal through the chain

Failure Probability:

Success: 80% Fail: 20%

Simulation Speed:

400ms per step

Current State

A

Steps Taken

0

Failed Attempts

0

Expected Steps

3.75

Simulation Log

Click "Run Simulation" to start...

Key Insight
                                        Deterministic:
                                        Always 3 steps
Cost = 3
Predictable path

                                    

                                        Stochastic:
                                        Expected: 3.75 steps
Variable outcome
May take longer!