Dissecting “Reinforcement Studying” by Richard S. Sutton with customized Python implementations, Episode V
In our earlier put up, we wrapped up the introductory sequence on basic reinforcement studying (RL) methods by exploring Temporal-Distinction (TD) studying. TD strategies merge the strengths of Dynamic Programming (DP) and Monte Carlo (MC) strategies, leveraging their greatest options to kind among the most vital RL algorithms, reminiscent of Q-learning.
Constructing on that basis, this put up delves into n-step TD studying, a flexible strategy launched in Chapter 7 of Sutton’s ebook [1]. This technique bridges the hole between classical TD and MC methods. Like TD, n-step strategies use bootstrapping (leveraging prior estimates), however in addition they incorporate the following n
rewards, providing a novel mix of short-term and long-term studying. In a future put up, we’ll generalize this idea even additional with eligibility traces.
We’ll observe a structured strategy, beginning with the prediction drawback earlier than transferring to management. Alongside the best way, we’ll:
- Introduce n-step Sarsa,
- Lengthen it to off-policy studying,
- Discover the n-step tree backup algorithm, and
- Current a unifying perspective with n-step Q(σ).
As at all times, you will discover all accompanying code on GitHub. Let’s dive in!