Value Iteration | Lecture 12

What is Value Iteration?

Value Iteration is an algorithm to compute the optimal value function $V^*(s)$ and optimal policy $\pi^*$. Unlike policy evaluation (fixed policy), value iteration finds the BEST policy!

Bellman Optimality Equation

$$V^*(s) = \max_a \sum_{s'} T(s,a,s') \left[ R(s,a,s') + \gamma V^*(s') \right]$$

$$\pi^*(s) = \arg\max_a \sum_{s'} T(s,a,s') \left[ R(s,a,s') + \gamma V^*(s') \right]$$

where: $V^*(s)$ = optimal value at state $s$, $\pi^*(s)$ = optimal policy, $T$ = transition probability, $R$ = reward, $\gamma$ = discount factor

Algorithm Steps

Initialize: $V^{(0)}(s) = 0$ for all $s$
Iterate: $V^{(k+1)}(s) \leftarrow \max_a Q(s,a)$
Converge: When $|V^{(k+1)} - V^{(k)}| < \varepsilon$
Extract: $\pi^*(s)$ from $V^*$

Key Insight

Value iteration combines policy evaluation and policy improvement into a single update step by always taking the maximum over all actions.

Demo 1: Value Iteration Algorithm Stepper

Watch $V_{opt}^{(t)}$ Converge Step-by-Step!

Visual execution of value iteration on a 3×3 grid world

How Value Iteration Works

1

Initialize
Set $V^{(0)}(s) = 0$ for all states

2

Compute Q-values
For each action, compute expected value

3

Take Maximum
$V^{(k+1)}(s) = \max_a Q(s,a)$

4

Repeat
Until values converge

🎯
Goal State
Reward: +10

⚫
Obstacle
Reward: -10

➡️
Movement
Cost: -0.1

Iteration Progress

0

Current iteration number

Max Value Change

—

Convergence threshold: $\varepsilon = 0.01$

What's Happening

Before Starting:

All values are initialized to 0. Click "Step" to update values using the Bellman equation, or "Auto Run" to watch automatic convergence.

Algorithm Step

                                    For each state $s$:

                                      $V_{new}(s) \leftarrow \max_a Q(s,a)$

                                      $\pi(s) \leftarrow \arg\max_a Q(s,a)$

Demo 2: Policy Evolution Tracker

Watch policy change as values converge!

See how optimal policy emerges from value updates

Iteration: 0

Policy Changes

Waiting for evolution to start...

Policy Stability

Observation
                                    Policy typically stabilizes before values fully converge!
                                    This is why policy iteration can be more efficient.
                                

Demo 3: Convergence Analyzer

Analyze convergence rate with different γ values!

Compare convergence speed: cyclic vs acyclic MDPs

Parameters

Discount Factor γ:

0.9

MDP Type:

Results

Run test to see results...

Theory
                                        Convergence Rate:

                                        $O(\gamma^k)$ - exponential in $\gamma$

                                        Higher γ → slower convergence
                                    
                                        Acyclic MDPs:

                                        Converge in finite iterations

                                        = topological depth of graph

Demo 4: Agricultural Planning MDP (Saudi Context)

Optimal crop selection with weather uncertainty!

Saudi farm planning: dates, wheat, or vegetables?

Weather States

Dry Season: Low rainfall, high temperature
Moderate: Average conditions
Wet Season: Higher rainfall

Transitions: Based on historical Saudi weather patterns

Crop Actions

🌴 Dates: High profit in dry, stable
🌾 Wheat: Reliable, moderate profit
🥬 Vegetables: High profit in wet, risky in dry

Optimal Policy

Click "Compute" to see results...

State Transitions

Expected Rewards

Dry → Dates: +50 SAR
Dry → Wheat: +30 SAR
Dry → Vegetables: +10 SAR
Wet → Vegetables: +60 SAR
Wet → Wheat: +35 SAR
Wet → Dates: +40 SAR