Summary & Applications

Putting It All Together: From Theory to Practice

Course Summary

We've covered Markov Decision Processes (MDPs) - a powerful framework for sequential decision-making under uncertainty. Now let's compare approaches and see real-world applications!

Core Concepts:
  • States, Actions, Transitions, Rewards
  • Value Functions: Vπ, V*, Q
  • Bellman Equations
  • Policies: fixed vs optimal
Algorithms:
  • Policy Evaluation
  • Value Iteration
  • Policy Iteration
  • Model-based planning
Applications:
  • Robotics & Navigation
  • Healthcare & Treatment
  • Autonomous Vehicles
  • Reinforcement Learning

Demo 1: Algorithm Comparison Dashboard

Compare Policy Evaluation vs Value Iteration side-by-side!
Run both algorithms on the same MDP and compare performance
Policy Evaluation
Input: Fixed policy π
Output: Vπ(s) for all s
Use case: Evaluate given policy
Value Iteration
Input: MDP dynamics
Output: V*(s) and π*
Use case: Find optimal policy
Policy Iteration
Input: MDP dynamics
Output: π* (converges fast)
Use case: When policy stabilizes early
Iterations to Converge
Comparison Results
Run algorithms to see comparison...
Key Differences
Policy Evaluation: Fixed policy, computes Vπ
Value Iteration: Implicit policy, computes V*
Policy Iteration: Explicit policy update, often faster
Complexity:
Both O(|S|²|A|) per iteration
Policy iteration typically needs fewer iterations

Demo 2: Autonomous Vehicle Lane Changing

Should the car change lanes? MDP decides!
Lane changing with traffic uncertainty
Traffic Scenario
States: {Left Lane, Center Lane, Right Lane}
Actions: {Stay, Change Left, Change Right}
Uncertainty: Other vehicles may block lane changes
Traffic Density
Medium
Optimal Policy
Click "Compute" to find optimal lane changes...
Rewards
Stay in lane: +1 (safe)
Successful change: +5 (faster)
Blocked change: -10 (dangerous)
Collision risk: -50 (critical)
Real-World Insight
Autonomous vehicles use MDPs (or POMDPs for partial observability) to make safe driving decisions. The policy balances speed (changing to faster lane) with safety (risk of collision). High traffic → more conservative policy!

Demo 3: Medical Treatment Planning

Optimal treatment sequence for chronic disease!
Balance treatment efficacy vs side effects
Disease States
Healthy
Goal
↕️
Mild
Monitor
↕️
Severe
Urgent
Treatment Actions
No Treatment: Monitor only
Medication A: Effective, mild side effects
Medication B: Very effective, strong side effects
Surgery: High efficacy, high risk
Optimal Treatment Policy
Click to compute optimal treatment sequence...
Expected Outcomes
Clinical Decision Support
MDPs help doctors choose treatment plans that maximize patient quality-of-life considering treatment efficacy, side effects, and disease progression uncertainty.

Demo 4: Bridge to Reinforcement Learning

From MDPs (model-based) to RL (model-free)!
Preview Q-learning: learn optimal policy without knowing T or R
Model-Based (MDP)
Given: T(s,a,s'), R(s,a,s'), γ
Compute: V*(s), π*(s)
Methods: Value iteration, policy iteration
Pros: Guaranteed optimal, fast if model known
Cons: Need accurate model
Model-Free (RL)
Given: Ability to interact with environment
Learn: Q*(s,a) from experience
Methods: Q-learning, SARSA, Actor-Critic
Pros: No model needed, learns from data
Cons: Slower, needs exploration
Q-Learning Algorithm
Initialize: Q(s,a) = 0 for all s,a
Repeat:
  Take action a, observe r, s'
  Q(s,a) ← Q(s,a) + α[r + γ·maxa'Q(s',a') - Q(s,a)]
Policy: π(s) = argmaxa Q(s,a)
Learning Progress
Next Steps in AI
Deep RL:
Q-learning + Neural Networks
Learn from raw pixels/sensors
Applications:
Game AI (AlphaGo, Dota2)
Robotics control
Resource optimization
Advanced Topics:
POMDPs (partial observability)
Multi-agent RL
Transfer learning