Policy Utility & Evaluation

What is Policy Evaluation?

Policy Evaluation is the process of computing the value function V^π(s) or Q^π(s,a) for a fixed policy π. Given a policy, how good is each state or state-action pair?

Key Concepts:

V^π(s): Expected total reward from state s following policy π
Q^π(s,a): Expected total reward from (s,a) then following π
Bellman Expectation: V^π(s) = Σ_s' T(s,π(s),s')[R(s,π(s),s') + γV^π(s')]
Iterative method: V^(k+1) ← update based on V^(k)

Why Evaluate?

Before improving a policy, we need to know how good it is! Policy evaluation gives us the "score" for each state under the current policy. This is crucial for policy iteration algorithms.

Demo 1: Interactive Dice Game MDP 🎲

📊 MDP State-Action Diagram

Visualizing the Dice Game as a Markov Decision Process

MDP Components

States: {in, end}

Actions: {stay, quit}

STAY Action:

• P(continue) = 2/3, R = $4
• P(end) = 1/3, R = $4

QUIT Action:

• P(end) = 1, R = $10

Expected Values:

• V(stay) = R / (1 - γ·pCont)
• V(quit) = $10 (immediate)
• Optimal: STAY!

Discount Factor γ:

• γ = 1: Future = Present value
• γ < 1: Future rewards worth less
• γ → 0: Only immediate matters

🎯 Dynamic Policy Comparison: π₁ vs π₂

Interactive exploration with path simulation - values update from game settings!

What is a Policy?

A policy π is a complete strategy that tells the agent what action to take in every state. In the dice game:

π₁: "Always Stay"

π₁(in) = stay → Keep playing until dice ends game

π₂: "Always Quit"

π₂(in) = quit → Take guaranteed reward and exit

Live Parameters (From Game Settings)

Stay Reward

$4

Quit Reward

$10

P(end)

33.3%

P(continue)

66.7%

Discount γ

1.00

Policy π₁: Always STAY

π₁(in) = stay

Possible Paths:

Infinite paths possible!

Expected Value Calculation

V^π₁(in) = $12

Policy π₂: Always QUIT

π₂(in) = quit

Possible Paths:

Deterministic! This policy gives exactly $10 every time.

Expected Value

V^π₂(in) = $10

No calculation needed - deterministic!

Live Path Simulation

Click a "Simulate" button to see a random path unfold!

Dynamic Policy Comparison

π₁ (Always Stay)

$12

Risky but higher EV

π₁ is OPTIMAL!

π₂ (Always Quit)

$10

Safe and guaranteed

Key Insight: Policy evaluation lets us compare policies mathematically!

The math:
V^π₁ = 12 > 10 = V^π₂

📐 Policy Evaluation Algorithm

Compute V_π and Q_π using Bellman equations - Interactive step-by-step

Dice Game MDP

Bellman Equations

\[ V_\pi(s) = \begin{cases} 0 & \text{IsEnd}(s) \\ Q_\pi(s, \pi(s)) & \text{else} \end{cases} \]

\[ Q_\pi(s,a) = \sum_{s'} T(s,a,s')[R + \gamma V_\pi(s')] \]

Select Policy π:

π(in) = stay π(in) = quit

Stay R:

Quit R:

P(end): 33%

γ: 1.00

ITERATIVE POLICY EVALUATION

Let's evaluate the "stay" policy: $ \pi(\text{in}) = \text{stay} $ γ = 1

1

Iteration 0

\[ V_\pi^{(0)}(\text{in}) = 0 \]

2

Iteration 1

\[ V_\pi^{(1)}(\text{in}) = \tfrac{2}{3}(4 + 0) + \tfrac{1}{3}(4 + 0) = 4 \]

3

Iteration 2

\[ V_\pi^{(2)}(\text{in}) = \tfrac{2}{3}(4 + 4) + \tfrac{1}{3}(4 + 0) \approx 6.67 \]

4

Iteration 3

\[ V_\pi^{(3)}(\text{in}) = \tfrac{2}{3}(4 + 6.67) + \tfrac{1}{3}(4 + 0) \approx 8.44 \]

∞

Convergence

\[ V_\pi(\text{in}) \rightarrow 12 \quad \checkmark \]

Values converge rapidly to the true expected utility! 🚀

State Value

$ V_\pi(\text{in}) = \$12 $

Action Values

$ Q(\text{stay}) = \$12 $
$ Q(\text{quit}) = \$10 $

Insight

Q(stay) > Q(quit) → STAY is better!

🎲 Play the Dice Game!

Experience the MDP in action - choose STAY or QUIT each round

Game Settings

Stay Reward:

Quit Reward:

Discount Factor γ

0 1 1.00

Only now matters Future = Present

Game ends on dice:

1 2 3 4 5 6

Click to toggle ending values

Expected Value

V(stay) = 12

V(quit) = 10

Optimal: STAY

Last Roll

🎲

Round

0

Dice

🎲

Round $

0

Total $

0

Click "Start" to begin!

Raw Total

$0

Σ rewards

Discounted Total

$0

Σ γ^t × r_t

Cumulative Rewards (with discount)

Game History

Start playing to see history...

Demo 1: Policy Value Calculator (V^π & Q^π)

Compute V^π and Q^π with Bellman equations!

Define a simple chain MDP and a policy, then compute values

Chain MDP

A

→

B

→

Goal

Transitions:
T(A, go, B) = 0.8, T(A, go, A) = 0.2
T(B, go, Goal) = 0.9, T(B, go, B) = 0.1
T(Goal, -, Goal) = 1.0
Rewards: R(Goal) = +100, others = -1
γ: 0.9

Policy π

π(A) = go
π(B) = go
π(Goal) = stay

Discount Factor γ:

Computed Values

Click "Compute" to see results...

Bellman Expectation Equations

V^π(s) = Σ_s' T(s,π(s),s')[R + γV^π(s')]
Q^π(s,a) = Σ_s' T(s,a,s')[R + γV^π(s')]

Example: V^π(A) = 0.8×(-1 + γV^π(B)) + 0.2×(-1 + γV^π(A))

Demo 2: Iterative Policy Evaluation Animation

Watch V⁽⁰⁾, V⁽¹⁾, V⁽²⁾, ... converge to V^π!

Step-by-step iterative policy evaluation with convergence tracking

Iteration 0

                                    V(0)(A) = 0.00, V(0)(B) = 0.00, V(0)(C) = 0.00, V(0)(Goal) = 0.00
                                

Click "Start" to begin iterative evaluation!

Algorithm

1. Initialize: V⁽⁰⁾(s) = 0 for all s
2. Iterate: For k = 0, 1, 2, ...
V^(k+1)(s) ← Σ_s' T(s,π(s),s')[R + γV^(k)(s')]
3. Stop: When |V^(k+1) - V^(k)| < ε

Convergence Metrics

Max Change: -
Threshold ε: 0.01
Status: Not started

Key Insight

Values "propagate" backwards from terminal states! Goal's value of +100 flows back through the chain.

Demo 3: Coffee Delivery Robot

Evaluate a fixed policy on a 4x4 grid world!

Robot delivers coffee following a pre-set policy

Select Policy:

Discount γ:

0.9

Legend: 🤖 Robot Start | ☕ Coffee Station | 🎯 Delivery Goal | ⚫ Obstacle

Current Policy

Direct Policy: Always move towards goal in shortest path. May hit obstacles!

Policy Score

-

V^π(start) - Expected total reward

Evaluation Details

Click "Evaluate" to see detailed results...

                                        Compare Policies!
                                    
                                        Try different policies and see which one gives the highest Vπ(start). This is the essence of policy comparison!

What is Policy Evaluation?

Key Concepts:

Why Evaluate?

Demo 1: Interactive Dice Game MDP 🎲

MDP Components

What is a Policy?

Live Parameters (From Game Settings)

Policy π₁: Always STAY

Possible Paths:

Policy π₂: Always QUIT

Possible Paths:

Live Path Simulation

Dynamic Policy Comparison

π₁ (Always Stay)

π₂ (Always Quit)

ITERATIVE POLICY EVALUATION

Game Settings

Expected Value

0

🎲

0

0

$0

$0

Cumulative Rewards (with discount)

Game History

Demo 1: Policy Value Calculator (Vπ & Qπ)

Chain MDP

Policy π

Computed Values

Bellman Expectation Equations

Demo 2: Iterative Policy Evaluation Animation

Iteration 0

Algorithm

Convergence Metrics

Demo 3: Coffee Delivery Robot

Current Policy

Policy Score

Evaluation Details

Demo 1: Policy Value Calculator (V^π & Q^π)