Course outcomes
CO1: Understand the basics of reinforcement learning. Its elements and limitations.
CO2: Understand the finite Markov decision process.
CO3: Understand the temporal difference learning and its advantages.
CO4: Understand the Sarsa maximization bias and double learning.
Introduction: Reinforcement Learning, Elements of Reinforcement Learning, Limitations and Scope, An Extended Example- Tic-Tac-Toe. Multi-armed Bandits: A k-armed Bandit Problem, Action-value Methods, The 10-armed Testbed, Incremental Implementation, Tracking a Nonstationary Problem, Optimistic Initial Values, Upper-Confidence-Bound Action Selection, Gradient Bandit Algorithms.
Finite Markov Decision Processes: The Agent–Environment Interface, Goals and Rewards, Returns and Episodes , Unified Notation for Episodic and Continuing Tasks, Policies and Value Functions, Optimal Policies and Optimal Value Functions, Optimality and Approximation.
Review of Markov process and Dynamic Programming.
Temporal-Difference Learning: TD Prediction, Advantages of TD Prediction Methods, Optimality of TD, Sarsa: On-policy TD Control, Q-learning: Policy TD Control. Expected Sarsa. Maximization Bias and Double Learning.