MDP Fundamentals

Understanding Markov Decision Process Components

What is a Markov Decision Process?

A Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker.

MDP Components:
  • States (𝑆): All possible situations
  • Actions (𝐴): Choices available in each state
  • Transitions 𝑇(𝑠,π‘Ž,𝑠′): Probability of reaching 𝑠′ from 𝑠 via π‘Ž
  • Rewards 𝑅(𝑠,π‘Ž,𝑠′): Immediate payoff for transition
  • Discount 𝛾: How much we value future rewards (0 ≀ 𝛾 ≀ 1)
Key Insight:

Unlike search problems (deterministic successor), MDPs model stochastic outcomes. Taking action a in state s leads to a probability distribution over next states, not a single next state!

Demo 1: Interactive Dice Game MDP 🎲

πŸ“Š MDP State-Action Diagram
Visualizing the Dice Game as a Markov Decision Process
in in,stay in,quit end stay quit (2/3): $4 (1/3): $4 1: $10 Discount Factor Ξ³ = 1.00
MDP Components
States: {in, end}
Actions: {stay, quit}
STAY Action:
β€’ P(continue) = 2/3, R = $4
β€’ P(end) = 1/3, R = $4
QUIT Action:
β€’ P(end) = 1, R = $10
Expected Values:
β€’ V(stay) = R / (1 - Ξ³Β·pCont)
β€’ V(quit) = $10 (immediate)
β€’ Optimal: STAY!
Discount Factor Ξ³:
β€’ Ξ³ = 1: Future = Present value
β€’ Ξ³ < 1: Future rewards worth less
β€’ Ξ³ β†’ 0: Only immediate matters
🎲 Play the Dice Game!
Experience the MDP in action - choose STAY or QUIT each round
Game Settings
0 1 1.00
Only now matters Future = Present
Click to toggle ending values
Expected Value
V(stay) = 12
V(quit) = 10
Optimal: STAY
Last Roll
🎲
Round

0

Dice

🎲

Round $

0

Total $

0

Click "Start" to begin!
Raw Total

$0

Ξ£ rewards
Discounted Total

$0

Ξ£ Ξ³t Γ— rt
Cumulative Rewards (with discount)
Game History
Start playing to see history...

Demo 2: Robot Warehouse Navigation 🏭

πŸ€– Navigate warehouse | ⚠️ Slippery floor = random moves!
πŸ€– Robot | πŸ“¦ Package | 🎯 Delivery | ⚫ Obstacle | Movement cost: -1
Slip Probability
0.1
Mode
Click "Run" to compute optimal policy!
Current Iteration
0
MDP Definition
𝑆: {(0,0), (0,1), ..., (4,4)} = 25 states
𝐴(𝑠): {N, E, S, W}
𝑇(𝑠,π‘Ž,𝑠′): 𝑃(0.1) slip to random
𝑅(𝑠,π‘Ž,𝑠′): +100 (delivery), -50 (collision), -1 (move)
𝛾: 0.9
Compare!
Switch between deterministic and stochastic mode. Notice how high slip probability makes the robot take safer paths! 🧠

Demo 3: MDP Component Builder πŸ”§

πŸ”§ Build your own MDP from scratch!
Add states, define actions, set transition probabilities, assign rewards
Add Components
Add State
Add Action
Add Transition
MDP Visualization
MDP Summary
States: 0
Actions: 0
Transitions: 0
Start State: -
End States: -

Demo 4: Search Problem β†’ MDP Converter πŸ”„

πŸ”„ Transform deterministic search β†’ stochastic MDP!
Adjust uncertainty and simulate agent traversal through the chain
Success: 80% Fail: 20%
400ms per step
Current State
A
Steps Taken
0
Failed Attempts
0
Expected Steps
3.75
Simulation Log
Click "Run Simulation" to start...
Key Insight
Deterministic:
  • Always 3 steps
  • Cost = 3
  • Predictable path
Stochastic:
  • Expected: 3.75 steps
  • Variable outcome
  • May take longer!