📊 MDP State-Action Diagram
Visualizing the Dice Game as a Markov Decision Process
MDP Components
States: {in, end}
Actions: {stay, quit}
STAY Action:
• P(continue) = 2/3, R = $4
• P(end) = 1/3, R = $4
• P(end) = 1/3, R = $4
QUIT Action:
• P(end) = 1, R = $10
Expected Values:
• V(stay) = R / (1 - γ·pCont)
• V(quit) = $10 (immediate)
• Optimal: STAY!
• V(quit) = $10 (immediate)
• Optimal: STAY!
Discount Factor γ:
• γ = 1: Future = Present value
• γ < 1: Future rewards worth less
• γ → 0: Only immediate matters
• γ < 1: Future rewards worth less
• γ → 0: Only immediate matters
🎯 Dynamic Policy Comparison: π₁ vs π₂
Interactive exploration with path simulation - values update from game settings!
What is a Policy?
A policy π is a complete strategy that tells the agent what action to take in every state. In the dice game:
π₁: "Always Stay"
π₁(in) = stay → Keep playing until dice ends game
π₂: "Always Quit"
π₂(in) = quit → Take guaranteed reward and exit
Live Parameters (From Game Settings)
Stay Reward
$4
Quit Reward
$10
P(end)
33.3%
P(continue)
66.7%
Discount γ
1.00
Policy π₁: Always STAY
π₁(in) = stay
Possible Paths:
Infinite paths possible!
Expected Value Calculation
Vπ₁(in) =
$12
Policy π₂: Always QUIT
π₂(in) = quit
Possible Paths:
Deterministic! This policy gives exactly $10 every time.
Expected Value
Vπ₂(in) =
$10
No calculation needed - deterministic!
Live Path Simulation
Click a "Simulate" button to see a random path unfold!
Dynamic Policy Comparison
π₁ (Always Stay)
$12
Risky but higher EV
π₁ is OPTIMAL!
π₂ (Always Quit)
$10
Safe and guaranteed
Key Insight: Policy evaluation lets us compare policies mathematically!
The math:
Vπ₁ = 12 > 10 = Vπ₂
Vπ₁ = 12 > 10 = Vπ₂
📐 Policy Evaluation Algorithm
Compute Vπ and Qπ using Bellman equations - Interactive step-by-step
Dice Game MDP
Bellman Equations
\[ V_\pi(s) = \begin{cases} 0 & \text{IsEnd}(s) \\ Q_\pi(s, \pi(s)) & \text{else} \end{cases} \]
\[ Q_\pi(s,a) = \sum_{s'} T(s,a,s')[R + \gamma V_\pi(s')] \]
Select Policy π:
Stay R:
Quit R:
P(end): 33%
γ: 1.00
ITERATIVE POLICY EVALUATION
Let's evaluate the "stay" policy:
\( \pi(\text{in}) = \text{stay} \)
γ = 1
1
Iteration 0
\[ V_\pi^{(0)}(\text{in}) = 0 \]
2
Iteration 1
\[ V_\pi^{(1)}(\text{in}) = \tfrac{2}{3}(4 + 0) + \tfrac{1}{3}(4 + 0) = 4 \]
3
Iteration 2
\[ V_\pi^{(2)}(\text{in}) = \tfrac{2}{3}(4 + 4) + \tfrac{1}{3}(4 + 0) \approx 6.67 \]
4
Iteration 3
\[ V_\pi^{(3)}(\text{in}) = \tfrac{2}{3}(4 + 6.67) + \tfrac{1}{3}(4 + 0) \approx 8.44 \]
∞
Convergence
\[ V_\pi(\text{in}) \rightarrow 12 \quad \checkmark \]
Values converge rapidly to the true expected utility! 🚀
State Value
\( V_\pi(\text{in}) = \$12 \)
Action Values
\( Q(\text{stay}) = \$12 \)
\( Q(\text{quit}) = \$10 \)
\( Q(\text{quit}) = \$10 \)
Insight
Q(stay) > Q(quit) → STAY is better!
🎲 Play the Dice Game!
Experience the MDP in action - choose STAY or QUIT each round
Game Settings
0
1
1.00
Only now matters
Future = Present
Click to toggle ending values
Expected Value
V(stay) = 12
V(quit) = 10
Optimal: STAY
Last Roll
🎲
Round
0
Dice
🎲
Round $
0
Total $
0
Click "Start" to begin!
Raw Total
$0
Σ rewards
Discounted Total
$0
Σ γt × rt
Cumulative Rewards (with discount)
Game History
Start playing to see history...