Policy Utility & Evaluation

Computing Vπ and Qπ - Understanding Policy Value

What is Policy Evaluation?

Policy Evaluation is the process of computing the value function Vπ(s) or Qπ(s,a) for a fixed policy π. Given a policy, how good is each state or state-action pair?

Key Concepts:
  • Vπ(s): Expected total reward from state s following policy π
  • Qπ(s,a): Expected total reward from (s,a) then following π
  • Bellman Expectation: Vπ(s) = Σs' T(s,π(s),s')[R(s,π(s),s') + γVπ(s')]
  • Iterative method: V(k+1) ← update based on V(k)
Why Evaluate?

Before improving a policy, we need to know how good it is! Policy evaluation gives us the "score" for each state under the current policy. This is crucial for policy iteration algorithms.

Demo 1: Interactive Dice Game MDP 🎲

📊 MDP State-Action Diagram
Visualizing the Dice Game as a Markov Decision Process
in in,stay in,quit end stay quit (2/3): $4 (1/3): $4 1: $10 Discount Factor γ = 1.00
MDP Components
States: {in, end}
Actions: {stay, quit}
STAY Action:
• P(continue) = 2/3, R = $4
• P(end) = 1/3, R = $4
QUIT Action:
• P(end) = 1, R = $10
Expected Values:
• V(stay) = R / (1 - γ·pCont)
• V(quit) = $10 (immediate)
Optimal: STAY!
Discount Factor γ:
• γ = 1: Future = Present value
• γ < 1: Future rewards worth less
• γ → 0: Only immediate matters
🎯 Dynamic Policy Comparison: π₁ vs π₂
Interactive exploration with path simulation - values update from game settings!
What is a Policy?

A policy π is a complete strategy that tells the agent what action to take in every state. In the dice game:

π₁: "Always Stay"
π₁(in) = stay → Keep playing until dice ends game
π₂: "Always Quit"
π₂(in) = quit → Take guaranteed reward and exit
Live Parameters (From Game Settings)
Stay Reward
$4
Quit Reward
$10
P(end)
33.3%
P(continue)
66.7%
Discount γ
1.00
Policy π₁: Always STAY
π₁(in) = stay
Possible Paths:
Infinite paths possible!

Expected Value Calculation
Vπ₁(in) = $12
Policy π₂: Always QUIT
π₂(in) = quit
Possible Paths:
Deterministic! This policy gives exactly $10 every time.

Expected Value
Vπ₂(in) = $10
No calculation needed - deterministic!
Live Path Simulation
Click a "Simulate" button to see a random path unfold!
Dynamic Policy Comparison
π₁ (Always Stay)
$12
Risky but higher EV
π₁ is OPTIMAL!
π₂ (Always Quit)
$10
Safe and guaranteed
Key Insight: Policy evaluation lets us compare policies mathematically!
The math:
Vπ₁ = 12 > 10 = Vπ₂
📐 Policy Evaluation Algorithm
Compute Vπ and Qπ using Bellman equations - Interactive step-by-step
Dice Game MDP
in in,stay in,quit end stay quit (0.67): $4 (0.33): $4 1: $10 Discount Factor γ = 1.00
Bellman Equations
\[ V_\pi(s) = \begin{cases} 0 & \text{IsEnd}(s) \\ Q_\pi(s, \pi(s)) & \text{else} \end{cases} \]
\[ Q_\pi(s,a) = \sum_{s'} T(s,a,s')[R + \gamma V_\pi(s')] \]
Select Policy π:
Stay R:
Quit R:
P(end): 33%
γ: 1.00
ITERATIVE POLICY EVALUATION
Let's evaluate the "stay" policy: \( \pi(\text{in}) = \text{stay} \) γ = 1
1
Iteration 0
\[ V_\pi^{(0)}(\text{in}) = 0 \]
2
Iteration 1
\[ V_\pi^{(1)}(\text{in}) = \tfrac{2}{3}(4 + 0) + \tfrac{1}{3}(4 + 0) = 4 \]
3
Iteration 2
\[ V_\pi^{(2)}(\text{in}) = \tfrac{2}{3}(4 + 4) + \tfrac{1}{3}(4 + 0) \approx 6.67 \]
4
Iteration 3
\[ V_\pi^{(3)}(\text{in}) = \tfrac{2}{3}(4 + 6.67) + \tfrac{1}{3}(4 + 0) \approx 8.44 \]
Convergence
\[ V_\pi(\text{in}) \rightarrow 12 \quad \checkmark \]
Values converge rapidly to the true expected utility! 🚀
State Value
\( V_\pi(\text{in}) = \$12 \)
Action Values
\( Q(\text{stay}) = \$12 \)
\( Q(\text{quit}) = \$10 \)
Insight
Q(stay) > Q(quit) → STAY is better!
🎲 Play the Dice Game!
Experience the MDP in action - choose STAY or QUIT each round
Game Settings
0 1 1.00
Only now matters Future = Present
Click to toggle ending values
Expected Value
V(stay) = 12
V(quit) = 10
Optimal: STAY
Last Roll
🎲
Round

0

Dice

🎲

Round $

0

Total $

0

Click "Start" to begin!
Raw Total

$0

Σ rewards
Discounted Total

$0

Σ γt × rt
Cumulative Rewards (with discount)
Game History
Start playing to see history...

Demo 1: Policy Value Calculator (Vπ & Qπ)

Compute Vπ and Qπ with Bellman equations!
Define a simple chain MDP and a policy, then compute values
Chain MDP
A
B
Goal
Transitions:
T(A, go, B) = 0.8, T(A, go, A) = 0.2
T(B, go, Goal) = 0.9, T(B, go, B) = 0.1
T(Goal, -, Goal) = 1.0
Rewards: R(Goal) = +100, others = -1
γ: 0.9
Policy π
π(A) = go
π(B) = go
π(Goal) = stay
Computed Values
Click "Compute" to see results...
Bellman Expectation Equations
Vπ(s) = Σs' T(s,π(s),s')[R + γVπ(s')]
Qπ(s,a) = Σs' T(s,a,s')[R + γVπ(s')]

Example: Vπ(A) = 0.8×(-1 + γVπ(B)) + 0.2×(-1 + γVπ(A))

Demo 2: Iterative Policy Evaluation Animation

Watch V(0), V(1), V(2), ... converge to Vπ!
Step-by-step iterative policy evaluation with convergence tracking
Iteration 0
V(0)(A) = 0.00, V(0)(B) = 0.00, V(0)(C) = 0.00, V(0)(Goal) = 0.00
Click "Start" to begin iterative evaluation!
Algorithm
1. Initialize: V(0)(s) = 0 for all s
2. Iterate: For k = 0, 1, 2, ...
   V(k+1)(s) ← Σs' T(s,π(s),s')[R + γV(k)(s')]
3. Stop: When |V(k+1) - V(k)| < ε
Convergence Metrics
Max Change: -
Threshold ε: 0.01
Status: Not started
Key Insight
Values "propagate" backwards from terminal states! Goal's value of +100 flows back through the chain.

Demo 3: Coffee Delivery Robot

Evaluate a fixed policy on a 4x4 grid world!
Robot delivers coffee following a pre-set policy
0.9
Legend: 🤖 Robot Start | ☕ Coffee Station | 🎯 Delivery Goal | ⚫ Obstacle
Current Policy
Direct Policy: Always move towards goal in shortest path. May hit obstacles!
Policy Score
-
Vπ(start) - Expected total reward
Evaluation Details
Click "Evaluate" to see detailed results...
Compare Policies!
Try different policies and see which one gives the highest Vπ(start). This is the essence of policy comparison!