Certainly, RL gives helpful options to quite a lot of sequential decision-making issues. Temporal-Distinction Studying (TD studying) strategies are a well-liked subset of RL algorithms. TD studying strategies mix key facets of Monte Carlo and Dynamic Programming strategies to speed up studying with out requiring an ideal mannequin of the setting dynamics.
On this article, we’ll evaluate completely different sorts of TD algorithms in a customized Grid World. The design of the experiment will define the significance of steady exploration in addition to the particular person traits of the examined algorithms: Q-learning, Dyna-Q, and Dyna-Q+.
The define of this put up comprises:
- Description of the setting
- Temporal-Distinction (TD) Studying
- Mannequin-free TD strategies (Q-learning) and model-based TD strategies (Dyna-Q and Dyna-Q+)
- Parameters
- Efficiency comparisons
- Conclusion
The complete code permitting the replica of the outcomes and the plots is offered right here: https://github.com/RPegoud/Temporal-Distinction-learning
The Surroundings
The setting we’ll use on this experiment is a grid world with the next options:
- The grid is 12 by 8 cells.
- The agent begins within the backside left nook of the grid, the goal is to achieve the treasure positioned within the prime proper nook (a terminal state with reward 1).
- The blue portals are linked, going by the portal positioned on the cell (10, 6) results in the cell (11, 0). The agent can’t take the portal once more after its first transition.
- The purple portal solely seems after 100 episodes however permits the agent to achieve the treasure sooner. This encourages frequently exploring the setting.
- The crimson portals are traps (terminal states with reward 0) and finish the episode.
- Bumping right into a wall causes the agent to stay in the identical state.

This experiment goals to check the behaviour of Q-learning, Dyna-Q, and Dyna-Q+ brokers in a altering setting. Certainly, after 100 episodes, the optimum coverage is sure to vary and the optimum variety of steps throughout a profitable episode will lower from 17 to 12.

Introduction to Temporal-Distinction Studying:
Temporal-Distinction Studying is a mixture of Monte Carlo (MC) and Dynamic Programming (DP) strategies:
- Like MC strategies, TD strategies can study from expertise with out requiring a mannequin of the setting’s dynamics.
- Like DP strategies, TD strategies replace estimates after each step based mostly on different realized estimates with out ready for the consequence (that is known as bootstrapping).
One particularity of TD strategies is that they replace their worth estimate each time step, versus MC strategies that wait till the top of an episode.
Certainly, each strategies have completely different replace targets. MC strategies intention to replace the return Gt, which is just obtainable on the finish of an episode. As an alternative, TD strategies goal:

The place V is an estimate of the true worth perform Vπ.
Subsequently, TD strategies mix the sampling of MC (by utilizing an estimate of the true worth) and the bootstrapping of DP (by updating V based mostly on estimates counting on additional estimates).
The best model of temporal-difference studying is known as TD(0) or one-step TD, a sensible implementation of TD(0) would appear to be this:

When transitioning from a state S to a brand new state S’, the TD(0) algorithm will compute a backed-up worth and replace V(S) accordingly. This backed-up worth is known as TD error, the distinction between the noticed reward R plus the discounted worth of the brand new state γV(St+1) and the present worth estimate V(S) :

In conclusion, TD strategies current a number of benefits:
- They don’t require an ideal mannequin of the setting’s dynamics p
- They’re applied in a web-based trend, updating the goal after every time step
- TD(0) is assured to converge for any fastened coverage π if α (the studying charge or step dimension) follows stochastic approximation circumstances (for extra element see web page 55 ”Monitoring a Nonstationary Drawback” of [4])
Implementation particulars:
The next sections discover a number of TD algorithms’ principal traits and efficiency son the grid world.
The identical parameters had been used for all fashions, for the sake of simplicity:
- Epsilon (ε) = 0.1: likelihood of choosing a random motion in ε-greedy insurance policies
- Gamma (γ)= 0.9: low cost issue utilized to future rewards or worth estimates
- Aplha (α) = 0.25: studying charge proscribing the Q worth updates
- Planning steps = 100: for Dyna-Q and Dyna-Q+, the variety of planning steps executed for every direct interplay
- Kappa (κ)= 0.001: for Dyna-Q+, the burden of bonus rewards utilized throughout planning steps
The performances of every algorithm are first offered for a single run of 400 episodes (sections: Q-learning, Dyna-Q, and Dyna-Q+) after which averaged over 100 runs of 250 episodes within the “abstract and algorithms comparability” part.
Q-learning
The primary algorithm we implement right here is the well-known Q-learning (Watkins, 1989):

Q-learning is known as an off-policy algorithm as its purpose is to approximate the optimum worth perform instantly, as an alternative of the worth perform of π, the coverage adopted by the agent.
In follow, Q-learning nonetheless depends on a coverage, sometimes called the ‘conduct coverage’ to pick out which state-action pairs are visited and up to date. Nevertheless, Q-learning is off-policy becauseit updates its Q-values based mostly on the greatest estimate of future rewards, no matter whether or not the chosen actions comply with the present coverage π.
In comparison with the earlier TD studying pseudo-code, there are three principal variations:
- We have to initialize the Q perform for all states and actions and Q(terminal) ought to be 0
- The actions are chosen from a coverage based mostly on the Q values (for example the ϵ-greedy coverage with respect to the Q values)
- The replace targets the motion worth perform Q relatively than the state worth perform V

Now that we’ve our first algorithm studying for testing, we will begin the coaching part. Our agent will navigate the Grid World utilizing its ε-greedy coverage, with respect to the Q values. This coverage selects the motion with the highest Q-value with a likelihood of (1 – ε) and chooses a random motion with a likelihood of ε. After every motion, the agent will replace its Q-value estimates.
We are able to visualize the evolution of the estimated most action-value Q(S, a) of every cell of the Grid World utilizing a heatmap. Right here the agent performs 400 episodes. As there is just one replace per episode, the evolution of the Q values is sort of sluggish and a big a part of the states stay unmapped:

Upon completion of the 400 episodes, an evaluation of the overall visits to every cell gives us with a good estimate of the agent’s common route. As depicted on the right-hand plot under, the agent appears to have converged to a sub-optimal route, avoiding cell (4,4) and persistently following the decrease wall.

On account of this sub-optimal technique, the agent reaches a minimal of 21 steps per episode, following the trail outlined within the “variety of complete visits” plot. Variations in step counts may be attributed to the ε-greedy coverage, which introduces a ten% likelihood of random actions. Given this coverage, following the decrease wall is a good technique to restrict potential disruptions attributable to random actions.

In conclusion, the Q-learning agent converged to a sub-optimal technique as talked about beforehand. Furthermore, a portion of the setting stays unexplored by the Q-function, which prevents the agent from discovering the brand new optimum path when the purple portal seems after the a hundredth episode.
These efficiency limitations may be attributed to the comparatively low variety of coaching steps (400), limiting the probabilities of interplay with the setting and the exploration induced by the ε-greedy coverage.
Planning, an integral part of model-based reinforcement studying strategies is especially helpful to enhance pattern effectivity and estimation of motion values. Dyna-Q and Dyna-Q+ are good examples of TD algorithms incorporating planning steps.
Dyna-Q
The Dyna-Q algorithm (Dynamic Q-learning) is a mixture of model-based RL and TD studying.
Mannequin-based RL algorithms depend on a mannequin of the setting to include planning as their major approach of updating worth estimates. In distinction, model-free algorithms depend on direct studying.
”A mannequin of the setting is something that an agent can use to foretell how the setting will reply to its actions” — Reinforcement Studying: an introduction.
Within the scope of this text, the mannequin may be seen as an approximation of the transition dynamics p(s’, r|s, a). Right here, p returns a single next-state and reward pair given the present state-action pair.
In environments the place p is stochastic, we distinguish distribution fashions and pattern fashions, the previous returns a distribution of the subsequent states and actions whereas the latter returns a single pair, sampled from the estimated distribution.
Fashions are particularly helpful to simulate episodes, and due to this fact prepare the agent by changing real-world interactions with planning steps, i.e. interactions with the simulated setting.
Brokers implementing the Dyna-Q algorithm are a part of the category of planning brokers, brokers that mix direct reinforcement studying and mannequin studying. They use direct interactions with the setting to replace their worth perform (as in Q-learning) and likewise to study a mannequin of the setting. After every direct interplay, they’ll additionally carry out planning steps to replace their worth perform utilizing simulated interactions.
A fast Chess instance
Think about taking part in a very good sport of chess. After taking part in every transfer, the response of your opponent lets you assess the high quality of your transfer. That is much like receiving a optimistic or damaging reward, which lets you “replace” your technique. In case your transfer results in a blunder, you in all probability wouldn’t do it once more, supplied with the identical configuration of the board. To date, that is similar to direct reinforcement studying.
Now let’s add planning to the combination. Think about that after every of your strikes, whereas the opponent is pondering, you mentally return over every of your earlier strikes to reassess their high quality. You would possibly discover weaknesses that you just uncared for at first sight or discover out that particular strikes had been higher than you thought. These ideas may let you replace your technique. That is precisely what planning is about, updating the worth perform with out interacting with the true setting however relatively a mannequin of mentioned setting.

Dyna-Q due to this fact comprises some further steps in comparison with Q-learning:
After every direct replace of the Q values, the mannequin shops the state-action pair and the reward and next-state that had been noticed. This step is known as mannequin coaching.
- After mannequin coaching, Dyna-Q performs n planning steps:
- A random state-action pair is chosen from the mannequin buffer (i.e. this state-action pair was noticed throughout direct interactions)
- The mannequin generates the simulated reward and next-state
- The worth perform is up to date utilizing the simulated observations (s, a, r, s’)

We now replicate the training course of with the Dyna-Q algorithm utilizing n=100. Which means after every direct interplay with the setting, we use the mannequin to carry out 100 planning steps (i.e. updates).
The next heatmap reveals the quick convergence of the Dyna-Q mannequin. In truth, it solely takes the algorithm round 10 episodes to search out an optimum path. This is because of the truth that each step results in 101 updates of the Q values (as an alternative of 1 for Q-learning).

One other advantage of planning steps is a greater estimation of motion values throughout the grid. Because the oblique updates goal random transitions saved contained in the mannequin, states which can be distant from the purpose additionally get up to date.
In distinction, the motion values slowly propagate from the purpose in Q-learning, resulting in an incomplete mapping of the grid.

Utilizing Dyna-Q, we discover an optimum path permitting the decision of the grid world in 17 steps, as depicted on the plot under by crimson bars. Optimum performances are attained usually, regardless of the occasional interference of ε-greedy actions for the sake of exploration.
Lastly, whereas Dyna-Q might seem extra convincing than Q-learning attributable to its incorporation of planning, it’s important to do not forget that planning introduces a tradeoff between computational prices and real-world exploration.

Dyna-Q+
Thus far, neither of the examined algorithms managed to search out the optimum path showing after step 100 (the purple portal). Certainly, each algorithms quickly converged to an optimum resolution that remained fastened till the top of the coaching part. This highlights the necessity for steady exploration all through coaching.
Dyna-Q+ is basically much like Dyna-Q however provides a small twist to the algorithm. Certainly, Dyna-Q+ consistently tracks the variety of time steps elapsed since every state-action pair was tried in actual interplay with the setting.
Specifically, contemplate a transition yielding a reward r that has not been tried in τ time steps. Dyna-Q+ would carry out planning as if the reward for this transition was r + κ √τ, with κ small enough (0.001 within the experiment).
This transformation in reward design encourages the agent to repeatedly discover the setting. It assumes that the longer a state-action pair hasn’t been tried, the larger the probabilities that the dynamics of this pair have modified or that the mannequin is wrong.

As depicted by the next heatmap, Dyna-Q+ is far more energetic with its updates in comparison with the earlier algorithms. Earlier than episode 100, the agent explores the entire grid and finds the blue portal and the primary optimum route.
The motion values for the remainder of the grid lower earlier than slowly rising once more, as states-action pairs within the prime left nook usually are not explored for a while.
As quickly because the purple portal seems in episode 100, the agent finds the brand new shortcut and the worth for the entire space rises. Till the completion of the 400 episodes, the agent will repeatedly replace the motion worth of every state-action pair whereas sustaining occasional exploration of the grid.

Because of the bonus added to mannequin rewards, we lastly get hold of a full mapping of the Q perform (every state or cell has an motion worth).
Mixed with steady exploration, the agent manages to search out the brand new greatest route (i.e. optimum coverage) because it seems, whereas retaining the earlier resolution.

Nevertheless, the exploration-exploitation trade-off in Dyna-Q+ certainly comes with a price. When state-action pairs haven’t been visited for a ample period, the exploration bonus encourages the agent to revisit these states, which might briefly lower its fast efficiency. This exploration conduct prioritizes updating the mannequin to enhance long-term decision-making.
This explains why some episodes performed by Dyna-Q+ may be as much as 70 steps lengthy, in comparison with at most 35 and 25 steps for Q-learning and Dyna-Q, respectively. The longer episodes in Dyna-Q+ mirror the agent’s willingness to take a position further steps in exploration to assemble extra details about the setting and refine its mannequin, even when it leads to short-term efficiency reductions.
In distinction, Dyna-Q+ usually achieves optimum efficiency (depicted by inexperienced bars on the plot under) that earlier algorithms didn’t attain.

Abstract and Algorithms Comparability
As a way to evaluate the important thing variations between the algorithms, we use two metrics (remember the fact that the outcomes rely on the enter parameters, which had been an identical amongst all fashions for simplicity):
- Variety of steps per episode: this metric characterizes the speed of convergence of the algorithms in the direction of an optimum resolution. It additionally describes the conduct of the algorithm after convergence, notably when it comes to exploration.
- Common cumulative reward: the proportion of episodes resulting in a optimistic reward
Analyzing the variety of steps per episode (see plot under), reveals a number of facets of model-based and model-free strategies:
- Mannequin-Based mostly Effectivity: Mannequin-based algorithms (Dyna-Q and Dyna-Q+) are typically extra sample-efficient on this explicit Grid World (this property can also be noticed extra typically in RL). It is because they’ll plan forward utilizing the realized mannequin of the setting, which might result in faster convergence to close optimum or optimum options.
- Q-Studying Convergence: Q-learning, whereas finally converging to a close to optimum resolution, requires extra episodes (125) to take action. It’s necessary to spotlight that Q-learning performs only one replace per step, which contrasts with the a number of updates carried out by Dyna-Q and Dyna-Q+.
- A number of Updates: Dyna-Q and Dyna-Q+ execute 101 updates per step, which contributes to their sooner convergence. Nevertheless the tradeoff for this sample-efficiency is computational value (see the runtime part within the desk under).
- Advanced Environments: In additional complicated or stochastic environments, the benefit of model-based strategies would possibly diminish. Fashions can introduce errors or inaccuracies, which might result in suboptimal insurance policies. Subsequently, this comparability ought to be seen as an overview of the strengths and weaknesses of various approaches relatively than a direct efficiency comparability.

We now introduce the typical cumulative reward (ACR), which represents the proportion of episodes the place the agent reaches the purpose (because the reward is 1 for reaching the purpose and 0 for triggering a entice), the ACR is then just by:

With N the variety of episodes (250) and Okay the variety of unbiased runs (100) and Rn,okay the cumulative reward for episode n in run okay.
Right here’s a breakdown of the efficiency of all algorithms:
- Dyna-Q converges quickly and achieves the very best total return, with an ACR of 87%. Which means it effectively learns and reaches the purpose in a good portion of episodes.
- Q-learning additionally reaches the same degree of efficiency however requires extra episodes to converge, explaining its barely decrease ACR, at 70%.
- Dyna-Q+ promptly finds a very good coverage, reaching a cumulative reward of 0.8 after solely 15 episodes. Nevertheless, the variability and exploration induced by the bonus reward reduces efficiency till step 100. After 100 steps, it begins to enhance because it discovers the brand new optimum path. Nevertheless, the short-term exploration compromises its efficiency, leading to an ACR of 79%, which is decrease than Dyna-Q however larger than Q-learning.

Conclusion
On this article, we offered the elemental ideas of Temporal Distinction studying and utilized Q-learning, Dyna-Q, and Dyna-Q+ to a customized grid world. The design of this grid world helps emphasize the significance of continuous exploration as a strategy to uncover and exploit new optimum insurance policies in altering environments. The distinction in performances (evaluated utilizing the variety of steps per episode and the cumulative reward) illustrate the strengths and weaknesses of those algorithms.
In abstract, model-based strategies (Dyna-Q, Dyna-Q+) profit from elevated pattern effectivity in comparison with model-based strategies (Q-learning), at the price of computation effectivity. Nevertheless, in stochastic or extra complicated environments, inaccuracies within the mannequin might hinder performances and result in sub-optimal insurance policies.

References:
[1] Demis Hassabis, AlphaFold reveals the construction of the protein universe (2022), DeepMind
[2] Elia Kaufmann, Leonard Bauersfeld, Antonio Loquercio, Matthias Müller, Vladlen Koltun &Davide Scaramuzza, Champion-level drone racing utilizing deep reinforcement studying (2023), Nature
[3] Nathan Lambert, LouisCastricato, Leandro von Werra, Alex Havrilla, Illustrating Reinforcement Studying from Human Suggestions (RLHF), HuggingFace
[4] Sutton, R. S., & Barto, A. G. . Reinforcement Studying: An Introduction (2018), Cambridge (Mass.): The MIT Press.
[5] Christopher J. C. H. Watkins & Peter Dayan, Q-learning (1992), Machine Studying, Springer Hyperlink