Recent from talks
Contribute something
Nothing was collected or created yet.
Temporal difference learning
View on Wikipedia| Part of a series on |
| Machine learning and data mining |
|---|
Temporal difference (TD) learning refers to a class of model-free reinforcement learning methods which learn by bootstrapping from the current estimate of the value function. These methods sample from the environment, like Monte Carlo methods, and perform updates based on current estimates, like dynamic programming methods.[1]
While Monte Carlo methods only adjust their estimates once the outcome is known, TD methods adjust predictions to match later, more accurate, predictions about the future before the outcome is known.[2] This is a form of bootstrapping, as illustrated with the following example:
Suppose you wish to predict the weather for Saturday, and you have some model that predicts Saturday's weather, given the weather of each day in the week. In the standard case, you would wait until Saturday and then adjust all your models. However, when it is, for example, Friday, you should have a pretty good idea of what the weather would be on Saturday – and thus be able to change, say, Saturday's model before Saturday arrives.[2]
Temporal difference methods are related to the temporal difference model of animal learning.[3][4][5][6][7]
Mathematical formulation
[edit]The tabular TD(0) method is one of the simplest TD methods. It is a special case of more general stochastic approximation methods. It estimates the state value function of a finite-state Markov decision process (MDP) under a policy . Let denote the state value function of the MDP with states , rewards and discount rate[8] under the policy :[9]
We drop the action from the notation for convenience. satisfies the Hamilton-Jacobi-Bellman Equation:
so is an unbiased estimate for . This observation motivates the following algorithm for estimating .
The algorithm starts by initializing a table arbitrarily, with one value for each state of the MDP. A positive learning rate is chosen.
We then repeatedly evaluate the policy , obtain a reward and update the value function for the current state using the rule:[10]
where and are the current and next states, respectively. The value is known as the TD target, and is known as the TD error.
TD-Lambda
[edit]TD-Lambda is a learning algorithm invented by Richard S. Sutton based on earlier work on temporal difference learning by Arthur Samuel.[11] This algorithm was famously applied by Gerald Tesauro to create TD-Gammon, a program that learned to play the game of backgammon at the level of expert human players.[12]
The lambda () parameter refers to the trace decay parameter, with . Higher settings lead to longer lasting traces; that is, a larger proportion of credit from a reward can be given to more distant states and actions when is higher, with producing parallel learning to Monte Carlo RL algorithms.[13]
In neuroscience
[edit]The TD algorithm has also received attention in the field of neuroscience. Researchers discovered that the firing rate of dopamine neurons in the ventral tegmental area (VTA) and substantia nigra (SNc) appear to mimic the error function in the algorithm.[3][4][5][6][7] The error function reports back the difference between the estimated reward at any given state or time step and the actual reward received. The larger the error function, the larger the difference between the expected and actual reward. When this is paired with a stimulus that accurately reflects a future reward, the error can be used to associate the stimulus with the future reward.
Dopamine cells appear to behave in a similar manner. In one experiment measurements of dopamine cells were made while training a monkey to associate a stimulus with the reward of juice.[14] Initially the dopamine cells increased firing rates when the monkey received juice, indicating a difference in expected and actual rewards. Over time this increase in firing back propagated to the earliest reliable stimulus for the reward. Once the monkey was fully trained, there was no increase in firing rate upon presentation of the predicted reward. Subsequently, the firing rate for the dopamine cells decreased below normal activation when the expected reward was not produced. This mimics closely how the error function in TD is used for reinforcement learning.
The relationship between the model and potential neurological function has produced research attempting to use TD to explain many aspects of behavioral research.[15][16] It has also been used to study conditions such as schizophrenia or the consequences of pharmacological manipulations of dopamine on learning.[17]
See also
[edit]Notes
[edit]- ^ Sutton & Barto (2018), p. 133.
- ^ a b Sutton, Richard S. (1 August 1988). "Learning to predict by the methods of temporal differences". Machine Learning. 3 (1): 9–44. doi:10.1007/BF00115009. ISSN 1573-0565. S2CID 207771194.
- ^ a b Schultz, W, Dayan, P & Montague, PR. (1997). "A neural substrate of prediction and reward". Science. 275 (5306): 1593–1599. CiteSeerX 10.1.1.133.6176. doi:10.1126/science.275.5306.1593. PMID 9054347. S2CID 220093382.
{{cite journal}}: CS1 maint: multiple names: authors list (link) - ^ a b Montague, P. R.; Dayan, P.; Sejnowski, T. J. (1996-03-01). "A framework for mesencephalic dopamine systems based on predictive Hebbian learning" (PDF). The Journal of Neuroscience. 16 (5): 1936–1947. doi:10.1523/JNEUROSCI.16-05-01936.1996. ISSN 0270-6474. PMC 6578666. PMID 8774460.
- ^ a b Montague, P.R.; Dayan, P.; Nowlan, S.J.; Pouget, A.; Sejnowski, T.J. (1993). "Using aperiodic reinforcement for directed self-organization" (PDF). Advances in Neural Information Processing Systems. 5: 969–976.
- ^ a b Montague, P. R.; Sejnowski, T. J. (1994). "The predictive brain: temporal coincidence and temporal order in synaptic learning mechanisms". Learning & Memory. 1 (1): 1–33. doi:10.1101/lm.1.1.1. ISSN 1072-0502. PMID 10467583. S2CID 44560099.
- ^ a b Sejnowski, T.J.; Dayan, P.; Montague, P.R. (1995). "Predictive Hebbian learning". Proceedings of the eighth annual conference on Computational learning theory - COLT '95. pp. 15–18. doi:10.1145/225298.225300. ISBN 0897917235. S2CID 1709691.
- ^ Discount rate parameter allows for a time preference toward more immediate rewards, and away from distant future rewards
- ^ Sutton & Barto (2018), p. 134.
- ^ Sutton & Barto (2018), p. 135.
- ^ Sutton & Barto (2018), p. 130?.
- ^ Tesauro (1995).
- ^ Sutton & Barto (2018), p. 175.
- ^ Schultz, W. (1998). "Predictive reward signal of dopamine neurons". Journal of Neurophysiology. 80 (1): 1–27. CiteSeerX 10.1.1.408.5994. doi:10.1152/jn.1998.80.1.1. PMID 9658025. S2CID 52857162.
- ^ Dayan, P. (2001). "Motivated reinforcement learning" (PDF). Advances in Neural Information Processing Systems. 14. MIT Press: 11–18. Archived from the original (PDF) on 2012-05-25. Retrieved 2009-03-03.
- ^ Tobia, M. J., etc. (2016). "Altered behavioral and neural responsiveness to counterfactual gains in the elderly". Cognitive, Affective, & Behavioral Neuroscience. 16 (3): 457–472. doi:10.3758/s13415-016-0406-7. PMID 26864879. S2CID 11299945.
{{cite journal}}: CS1 maint: multiple names: authors list (link) - ^ Smith, A., Li, M., Becker, S. and Kapur, S. (2006). "Dopamine, prediction error, and associative learning: a model-based account". Network: Computation in Neural Systems. 17 (1): 61–84. doi:10.1080/09548980500361624. PMID 16613795. S2CID 991839.
{{cite journal}}: CS1 maint: multiple names: authors list (link)
Works cited
[edit]- Sutton, Richard S.; Barto, Andrew G. (2018). Reinforcement Learning: An Introduction (2nd ed.). Cambridge, MA: MIT Press.
- Tesauro, Gerald (March 1995). "Temporal Difference Learning and TD-Gammon". Communications of the ACM. 38 (3): 58–68. doi:10.1145/203330.203343. S2CID 6023746.
Further reading
[edit]- Meyn, S. P. (2007). Control Techniques for Complex Networks. Cambridge University Press. ISBN 978-0521884419. See final chapter and appendix.
- Sutton, R. S.; Barto, A. G. (1990). "Time Derivative Models of Pavlovian Reinforcement" (PDF). Learning and Computational Neuroscience: Foundations of Adaptive Networks: 497–537. Archived from the original (PDF) on 2017-03-30. Retrieved 2017-03-29.
External links
[edit]- Connect Four TDGravity Applet Archived 2012-07-24 at the Wayback Machine (+ mobile phone version) – self-learned using TD-Leaf method (combination of TD-Lambda with shallow tree search)
- Self Learning Meta-Tic-Tac-Toe Archived 2014-03-19 at the Wayback Machine Example web app showing how temporal difference learning can be used to learn state evaluation constants for a minimax AI playing a simple board game.
- Reinforcement Learning Problem, document explaining how temporal difference learning can be used to speed up Q-learning
- TD-Simulator Temporal difference simulator for classical conditioning
Temporal difference learning
View on GrokipediaIntroduction
Definition and Overview
Temporal difference (TD) learning is a foundational class of model-free reinforcement learning methods that learns value functions by updating estimates based on the differences between temporally successive predictions, rather than depending on complete episodes or final outcomes.[1] This approach allows agents to refine their understanding of expected future rewards directly from raw experience in an environment, without requiring an explicit model of its dynamics.[2] By focusing on incremental adjustments driven by prediction errors, TD learning bridges concepts from Monte Carlo evaluation and dynamic programming, enabling efficient adaptation in sequential decision-making tasks.[2] Central to TD learning are the core elements of the reinforcement learning framework: states, which capture the agent's current situation in the environment; actions, the possible choices available in each state; and rewards, the immediate scalar feedback signaling progress or setback after an action.[2] The method approximates value functions to guide behavior, including the state-value function , which estimates the expected cumulative reward starting from state under a given policy, and the action-value function , which estimates the expected return for selecting action in state and thereafter following the policy.[2] At a high level, TD updates propagate these prediction errors backward through time by bootstrapping from the current estimate of the next state or action's value, combined with the immediate reward, to incrementally correct earlier predictions and improve overall accuracy.[1] A simple illustration of TD learning's incremental process appears in a random walk scenario, resembling a one-dimensional gridworld with five non-terminal states labeled A through E, flanked by terminal positions left of A (reward 0) and right of E (reward +1), where the agent begins at C, moving left or right with equal probability.[2] As the agent experiences transitions—such as from C to B or D—and receives rewards, TD learning adjusts value estimates step-by-step; for instance, the value of C gradually converges toward 0.5, while A approaches 0.167, B 0.333, D 0.667, and E 0.833, reflecting the probabilities of eventually reaching the rewarding terminal state, all without waiting for full episodes to complete.[2] TD learning provides key advantages over dynamic programming techniques, including the ability to learn online by updating estimates in real time as interactions occur, rather than in offline batches.[2] It also demonstrates superior sample efficiency, achieving reliable value approximations with fewer environmental interactions and lower computational demands, as it avoids the need for exhaustive state evaluations or a complete transition model.[1] These features position TD learning as a versatile tool for practical reinforcement learning applications in uncertain, dynamic settings.[2]Historical Development
Temporal difference (TD) learning originated from efforts to model prediction errors in animal conditioning, drawing heavily on the Rescorla-Wagner model proposed in 1972, which formalized how organisms update expectations based on discrepancies between predicted and actual outcomes. This model influenced early computational approaches to learning by emphasizing error-driven adjustments, a core idea later extended in temporal frameworks to handle sequential predictions.[4] The formal invention of TD learning is credited to Richard S. Sutton during his 1984 PhD thesis at the University of Massachusetts Amherst, where he developed it as a component of adaptive critic architectures for solving temporal credit assignment problems in reinforcement learning.[5] Sutton's work built on these foundations to create methods that bootstrap value estimates from ongoing experience rather than complete episodes, enabling more efficient learning in dynamic environments. In his seminal 1988 paper, Sutton introduced TD methods as a general class of incremental prediction algorithms, separating them from control tasks and highlighting their applicability beyond immediate rewards.[1] During the 1990s, TD learning gained prominence through its integration into practical reinforcement learning systems, most notably in Gerald Tesauro's TD-Gammon program, developed in the early 1990s, which achieved expert-level performance in backgammon by 1992 through self-play and updating value functions via TD errors, with subsequent versions demonstrating superhuman play.[6] This breakthrough illustrated TD's power in high-dimensional games, spurring wider adoption in the field and contributing to foundational texts like Sutton and Barto's 1998 book on reinforcement learning.[2] Post-2000, TD methods evolved with the rise of deep reinforcement learning, contributing to advancements in systems like AlphaGo, which in 2016 defeated world champions in Go through self-play and neural network-based value estimation.[7] This marked a shift toward scalable, end-to-end learning in complex domains. In the 2020s, TD learning has seen extensions for multi-agent settings, such as decentralized adaptive algorithms over time-varying networks that enable coordinated learning without central communication, and in continual learning paradigms, exemplified by temporal-difference variational methods that mitigate catastrophic forgetting by recursively updating posteriors across tasks.[8][9] These developments, up to 2025, underscore TD's ongoing relevance in distributed and lifelong AI systems.Core Principles
Prediction vs. Control in Reinforcement Learning
In reinforcement learning, the prediction problem focuses on estimating the value function associated with a given policy, where the value function or quantifies the expected cumulative future rewards starting from a state or state-action pair while following policy , without modifying the policy's behavior.[2] This task is central to evaluating how effective a fixed policy is in achieving long-term rewards, serving as a foundational step for understanding agent performance in stochastic environments.[2] In contrast, the control problem seeks to determine the optimal policy that maximizes the expected rewards over time, leveraging value estimates to guide action selection and policy improvement.[2] Temporal difference learning addresses control by iteratively refining value functions to inform decisions, such as choosing actions that lead to states with higher estimated values, thereby converging toward optimality.[2] Both prediction and control are framed within Markov decision processes (MDPs), which model the environment as a tuple consisting of state space , action space , transition probabilities , reward function , and discount factor that weights future rewards.[2] MDPs assume stationarity, meaning transition probabilities and rewards remain constant over time, enabling consistent learning despite uncertainty in outcomes.[2] Temporal difference methods bridge prediction and control by using bootstrapped value updates to refine estimates, which in turn support policy enhancements through mechanisms like greedy action selection.[2] A simple illustrative example is a chain MDP with states arranged linearly (e.g., states 1 through 5, where state 5 is terminal with reward +1 and others yield 0), actions to move left or right (with absorbing barriers at ends), and . For prediction, policy evaluation under a fixed policy—such as always moving right from states 1-4—computes the value of each state as the discounted sum of future rewards, yielding decreasing values from state 4 backward. For control, policy improvement applies greedy selection based on these values, potentially shifting the policy to avoid suboptimal moves like unnecessary left actions, maximizing overall returns.[2]Bootstrapping and Temporal Difference Error
Bootstrapping in temporal difference (TD) learning refers to the process of updating value estimates for states using other current estimates rather than complete sample returns from full episodes, allowing for incremental and online adjustments without requiring the environment to terminate.[10] This approach contrasts with Monte Carlo methods, which rely on averaging complete episode returns, by enabling immediate updates based on partial experience, thus facilitating learning in continuing or non-episodic tasks.[1] Introduced as a core efficiency mechanism in TD methods, bootstrapping leverages the existing value function to approximate future rewards, promoting faster convergence in sequential prediction problems.[1] The temporal difference error, denoted as , quantifies the discrepancy between the current value estimate and a bootstrapped target, serving as the primary signal for driving learning updates in TD methods.[10] For state-value prediction, it is computed as , where is the immediate reward, is the discount factor, is the estimated value of the next state, and is the current state's value estimate.[1] This error is available immediately after each transition, without needing to observe the full trajectory, and its sign indicates whether the current estimate under- or over-predicts the true value.[10] The TD error enables temporal credit assignment by propagating value corrections backward through time steps, attributing the impact of rewards to earlier states and actions even when outcomes are delayed, without waiting for episode completion.[10] In environments with long horizons, this mechanism distributes the reinforcement signal efficiently, resolving the credit assignment problem inherent in reinforcement learning by linking distant events via successive prediction errors.[1] Unlike methods that require terminal rewards to backpropagate, TD updates occur at every step, making it suitable for ongoing interactions.[10] While analogous to error-correction in supervised learning—such as the Rescorla-Wagner model where predictions are adjusted based on discrepancies—TD learning operates in an unsupervised manner within reinforcement learning, deriving targets from self-generated bootstraps rather than external labeled examples.[10] This unsupervised aspect emphasizes evaluative feedback from the environment's dynamics, without a teacher providing direct targets, highlighting TD's adaptation for sequential decision-making under uncertainty.[1] Consider a simple looping environment where an agent starts in state A, transitions to state B with reward 0, and from B loops back to A with reward 1 on every cycle.[10] Initial value estimates might be and ; after the first loop, the TD error at B becomes , updating toward 1, and then at A, , propagating the value backward to .[10] Repeated transitions yield converging estimates, with and approaching the true discounted infinite sum, demonstrating how successive TD errors refine predictions in cyclic settings without episode ends.[1]Mathematical Formulation
TD(0) for Value Prediction
Temporal difference learning includes methods for predicting state values under a given policy, with TD(0) being the simplest form that performs one-step updates.[2] In TD(0), the value function estimates the expected discounted return starting from state following policy , defined as , where is the discount factor (0 ≤ γ ≤ 1) and rewards are observed along the trajectory.[2] The core update bootstraps the current estimate using the immediate reward and the estimated value of the successor state, enabling online learning without waiting for episode completion.[2] The TD(0) update rule is: where (0 < α ≤ 1) is the learning rate, and the term in brackets is the temporal difference (TD) error, , measuring the discrepancy between the predicted and observed value.[2] This update adjusts the value of the current state toward the target , blending actual experience with bootstrapped estimates.[2] For episodic tasks, where episodes terminate in a finite number of steps, terminal states have fixed value and are not updated. The TD(0) algorithm initializes for all non-terminal (or an arbitrary value) and proceeds as follows:- For each episode:
2. Initialize (e.g., a starting state).
3. Repeat until is terminal:
- Take action according to policy , observe .
- Update: .
- Set .
TD Methods for Control: Expected SARSA and Q-Learning
Temporal difference (TD) methods for control extend the prediction framework to select actions in reinforcement learning environments, focusing on action-value functions that estimate the expected return starting from state , taking action , and following a policy thereafter. These methods learn optimal policies by updating -values based on the TD error, balancing exploration and exploitation to solve Markov decision processes. Unlike passive value prediction, control involves active policy improvement, often using greedy or soft-max policies for action selection. Q-learning is an off-policy TD control algorithm that learns the optimal action-value function independently of the agent's behavior policy. The update rule is: where is the learning rate, is the discount factor, is the reward, and the max operator selects the best action in the next state. This bootstrap uses the maximum -value over all actions, assuming optimal future behavior, enabling off-policy learning where the target policy (greedy) differs from the behavior policy (e.g., -greedy for exploration). Q-learning converges to the optimal under standard conditions like infinite visits to state-action pairs.[11] SARSA, an on-policy TD control method, updates -values based on actions actually selected by the current policy, making it sensitive to exploration strategies. The standard SARSA update incorporates the next action sampled from the policy : This ensures the learned aligns with the policy's behavior, including exploratory actions like those from an -greedy policy, where the agent chooses the best action with probability and a random action otherwise. SARSA converges to the -values of the policy under similar conditions to Q-learning.[12] Expected SARSA addresses variance in standard SARSA by using the expected value over the policy's action distribution instead of a single sample: This on-policy variant reduces estimation variance while maintaining policy evaluation, often leading to more stable learning, especially with softmax policies. It converges faster than standard SARSA in some environments due to lower bias in the target.[13] The key difference between Q-learning and (Expected) SARSA lies in their policy dependence: Q-learning's off-policy nature allows decoupling learning from exploration, promoting aggressive value estimates, while SARSA's on-policy updates incorporate exploration directly, yielding more conservative policies. Both typically employ -greedy exploration, with decaying over time to shift toward exploitation. In episodic settings, Q-learning pseudocode initializes for all state-action pairs and repeats episodes until convergence:For each episode:
Initialize s
Choose a from s using ε-greedy(Q)
While s is not terminal:
Take action a, observe r, s'
Q(s, a) ← Q(s, a) + α [r + γ max_a' Q(s', a') - Q(s, a)]
s ← s'; a ← argmax_a' Q(s', a') (or ε-greedy)
ε ← ε * decay (optional)
For each episode:
Initialize s
Choose a from s using ε-greedy(Q)
While s is not terminal:
Take action a, observe r, s'
Q(s, a) ← Q(s, a) + α [r + γ max_a' Q(s', a') - Q(s, a)]
s ← s'; a ← argmax_a' Q(s', a') (or ε-greedy)
ε ← ε * decay (optional)
For each episode:
Initialize s
Choose a from s using policy π (e.g., ε-greedy(Q))
While s is not terminal:
Take action a, observe r, s'
Compute expected_Q = ∑_a' π(a'|s') Q(s', a')
Q(s, a) ← Q(s, a) + α [r + γ expected_Q - Q(s, a)]
Choose a' from s' using π
s ← s'; a ← a'
ε ← ε * decay (optional)
For each episode:
Initialize s
Choose a from s using policy π (e.g., ε-greedy(Q))
While s is not terminal:
Take action a, observe r, s'
Compute expected_Q = ∑_a' π(a'|s') Q(s', a')
Q(s, a) ← Q(s, a) + α [r + γ expected_Q - Q(s, a)]
Choose a' from s' using π
s ← s'; a ← a'
ε ← ε * decay (optional)
Advanced Techniques
TD(λ) and Multi-Step Prediction
Temporal difference learning with parameter λ, denoted TD(λ), generalizes the one-step TD(0) method by incorporating multi-step predictions, allowing for a flexible trade-off between bias and variance in value function estimation. Introduced by Richard Sutton, this approach unifies one-step TD methods and full Monte Carlo evaluation through a single parameter λ ∈ [0, 1], enabling more efficient learning in environments with varying episode lengths or prediction horizons.[1] Central to TD(λ) is the λ-return, a weighted average of n-step returns that blends short- and long-term predictions: where denotes the n-step return starting from time t, defined as the discounted sum of rewards over n steps followed by the current value estimate for bootstrapping. This formulation, derived from the forward-view perspective, serves as the target for updating the value function V(s_t) toward G_t^λ using a learning rate α, as in the update ΔV(s_t) = α (G_t^λ - V(s_t)). The backward-view formulation equivalently achieves the same expected updates through incremental adjustments propagated over past states, weighted by λ.[1][2] The parameter λ controls the method's behavior: when λ = 0, TD(λ) reduces to the one-step TD(0) method, relying solely on immediate bootstrapping with low variance but higher bias; when λ = 1, it becomes equivalent to Monte Carlo evaluation, using complete episode returns for unbiased but high-variance updates; intermediate values of λ provide a bias-variance balance, often accelerating convergence by incorporating partial multi-step lookahead without waiting for full episodes. Optimal λ depends on the task's temporal structure, with higher values favoring longer horizons in persistent environments.[1] A refined version, true online TD(λ), addresses limitations in earlier online implementations by fully accounting for changes in value estimates during an episode, ensuring unbiased updates even in non-stationary settings; this algorithm, building on foundational TD(λ) principles, was formalized in the 2010s and demonstrates superior performance in empirical tests.[14] For instance, in a classic 19-state random walk task where an agent moves left or right with equal probability until absorbing barriers, learning curves for TD(λ) show markedly faster mean squared error reduction compared to TD(0) or Monte Carlo methods, with an optimal λ around 0.5-0.7 achieving convergence in fewer episodes by effectively balancing local and global predictions.[2]Eligibility Traces
Eligibility traces serve as an efficient computational mechanism for implementing the TD(λ) algorithm in both prediction and control tasks within temporal difference learning. They accumulate credit assignment for recently visited states or state-action pairs, enabling the propagation of TD errors backward in time without explicitly computing multi-step returns. The eligibility trace e(s,a) for a state-action pair is initialized to zero and updated incrementally: upon visiting (s,a), it is incremented (in accumulating traces) or set to a fixed value like 1 (in replacing traces), and subsequently decays by the factor γλ at each step, where γ is the discount factor and λ controls the trace persistence. This decay balances local one-step updates (λ=0) with full backward-looking credit assignment (λ=1).[1] The core update process begins with computing the one-step TD error δ_t = r_{t+1} + γ V(s_{t+1}) - V(s_t) for prediction, or analogously δ_t = r_{t+1} + γ \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) for control. The value function is then updated as ΔV(s) = α δ_t e(s) (or ΔQ(s,a) = α δ_t e(s,a)) for all states or state-action pairs with non-zero traces, where α is the learning rate. Following the update, all traces are decayed uniformly: e(s) ← γλ e(s) for all s (or e(s,a) ← γλ e(s,a) for all s,a). This mechanism ensures that errors influence past states in proportion to their recency and eligibility, accelerating convergence over single-step TD methods.[1] Two primary variants of accumulating eligibility traces differ in how visits are handled, affecting learning stability particularly in off-policy settings. In accumulating traces, the trace increments additively upon each visit (e(s,a) ← e(s,a) + 1), allowing multiple visits to build higher eligibility, but this can lead to instability and slower convergence due to over-accumulation. In contrast, replacing traces reset the trace to 1 upon a new visit (e(s,a) ← 1 if visited, else γλ e(s,a)), preventing buildup and promoting more reliable credit assignment, especially when function approximation is used; empirical results show replacing traces reduce variance and improve performance on tasks like mountain-car by up to twofold in episodes to convergence. For off-policy control, Watkins's Q(λ) employs replacing traces with a key modification: traces are reset to zero following non-greedy actions to avoid crediting exploratory deviations, ensuring updates align with the greedy policy. The following pseudocode illustrates this algorithm:Initialize Q(s, a) ← arbitrary (e.g., 0) for all s, a
Initialize e(s, a) ← 0 for all s, a
For each episode:
Obtain initial state s
While s is not a terminal state:
Select action a using ε-greedy policy derived from Q(s, ·)
Execute a in s, observe reward r and next state s'
δ ← r + γ max_{a'} Q(s', a') - Q(s, a)
If a is the greedy action in s (i.e., a = argmax_{a'} Q(s, a')):
e(s, a) ← 1 // replacing trace
else:
e(s, a) ← 0 // reset on exploration
For all state-action pairs (s', a'):
Q(s', a') ← Q(s', a') + α δ e(s', a')
e(s', a') ← γ λ e(s', a')
s ← s'
Initialize Q(s, a) ← arbitrary (e.g., 0) for all s, a
Initialize e(s, a) ← 0 for all s, a
For each episode:
Obtain initial state s
While s is not a terminal state:
Select action a using ε-greedy policy derived from Q(s, ·)
Execute a in s, observe reward r and next state s'
δ ← r + γ max_{a'} Q(s', a') - Q(s, a)
If a is the greedy action in s (i.e., a = argmax_{a'} Q(s, a')):
e(s, a) ← 1 // replacing trace
else:
e(s, a) ← 0 // reset on exploration
For all state-action pairs (s', a'):
Q(s', a') ← Q(s', a') + α δ e(s', a')
e(s', a') ← γ λ e(s', a')
s ← s'
