Value Iteration

Computing Optimal Policy $V^*$ and $\pi^*$

What is Value Iteration?

Value Iteration is an algorithm to compute the optimal value function $V^*(s)$ and optimal policy $\pi^*$. Unlike policy evaluation (fixed policy), value iteration finds the BEST policy!

Bellman Optimality Equation
$$V^*(s) = \max_a \sum_{s'} T(s,a,s') \left[ R(s,a,s') + \gamma V^*(s') \right]$$
$$\pi^*(s) = \arg\max_a \sum_{s'} T(s,a,s') \left[ R(s,a,s') + \gamma V^*(s') \right]$$

where: $V^*(s)$ = optimal value at state $s$, $\pi^*(s)$ = optimal policy, $T$ = transition probability, $R$ = reward, $\gamma$ = discount factor

Algorithm Steps
  1. Initialize: $V^{(0)}(s) = 0$ for all $s$
  2. Iterate: $V^{(k+1)}(s) \leftarrow \max_a Q(s,a)$
  3. Converge: When $|V^{(k+1)} - V^{(k)}| < \varepsilon$
  4. Extract: $\pi^*(s)$ from $V^*$
Key Insight

Value iteration combines policy evaluation and policy improvement into a single update step by always taking the maximum over all actions.

Demo 1: Value Iteration Algorithm Stepper

Watch $V_{opt}^{(t)}$ Converge Step-by-Step!
Visual execution of value iteration on a 3×3 grid world
How Value Iteration Works
1

Initialize
Set $V^{(0)}(s) = 0$ for all states

2

Compute Q-values
For each action, compute expected value

3

Take Maximum
$V^{(k+1)}(s) = \max_a Q(s,a)$

4

Repeat
Until values converge

🎯
Goal State
Reward: +10

Obstacle
Reward: -10
➡️
Movement
Cost: -0.1
Iteration Progress
0

Current iteration number

Max Value Change

Convergence threshold: $\varepsilon = 0.01$

What's Happening

Before Starting:

All values are initialized to 0. Click "Step" to update values using the Bellman equation, or "Auto Run" to watch automatic convergence.

Algorithm Step
For each state $s$:
  $V_{new}(s) \leftarrow \max_a Q(s,a)$
  $\pi(s) \leftarrow \arg\max_a Q(s,a)$

Demo 2: Policy Evolution Tracker

Watch policy change as values converge!
See how optimal policy emerges from value updates
Iteration: 0
Policy Changes
Waiting for evolution to start...
Policy Stability
Observation
Policy typically stabilizes before values fully converge! This is why policy iteration can be more efficient.

Demo 3: Convergence Analyzer

Analyze convergence rate with different γ values!
Compare convergence speed: cyclic vs acyclic MDPs
Parameters
0.9
Results
Run test to see results...
Theory
Convergence Rate:
$O(\gamma^k)$ - exponential in $\gamma$
Higher γ → slower convergence
Acyclic MDPs:
Converge in finite iterations
= topological depth of graph

Demo 4: Agricultural Planning MDP (Saudi Context)

Optimal crop selection with weather uncertainty!
Saudi farm planning: dates, wheat, or vegetables?
Weather States
Dry Season: Low rainfall, high temperature
Moderate: Average conditions
Wet Season: Higher rainfall

Transitions: Based on historical Saudi weather patterns
Crop Actions
🌴 Dates: High profit in dry, stable
🌾 Wheat: Reliable, moderate profit
🥬 Vegetables: High profit in wet, risky in dry
Optimal Policy
Click "Compute" to see results...
State Transitions
Expected Rewards
Dry → Dates: +50 SAR
Dry → Wheat: +30 SAR
Dry → Vegetables: +10 SAR
Wet → Vegetables: +60 SAR
Wet → Wheat: +35 SAR
Wet → Dates: +40 SAR