Q-learning
View on Wikipedia| Part of a series on |
| Machine learning and data mining |
|---|
Q-learning is a reinforcement learning algorithm that trains an agent to assign values to its possible actions based on its current state, without requiring a model of the environment (model-free). It can handle problems with stochastic transitions and rewards without requiring adaptations.[1]
For example, in a grid maze, an agent learns to reach an exit worth 10 points. At a junction, Q-learning might assign a higher value to moving right than left if right gets to the exit faster, improving this choice by trying both directions over time.
For any finite Markov decision process, Q-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over any and all successive steps, starting from the current state.[2] Q-learning can identify an optimal action-selection policy for any given finite Markov decision process, given infinite exploration time and a partly random policy.[2]
"Q" refers to the function that the algorithm computes: the expected reward—that is, the quality—of an action taken in a given state.[3]
Reinforcement learning
[edit]Reinforcement learning involves an agent, a set of states , and a set of actions per state. By performing an action , the agent transitions from state to state. Executing an action in a specific state provides the agent with a reward (a numerical score).
The goal of the agent is to maximize its total reward. It does this by adding the maximum reward attainable from future states to the reward for achieving its current state, effectively influencing the current action by the potential future reward. This potential reward is a weighted sum of expected values of the rewards of all future steps starting from the current state.[1]
As an example, consider the process of boarding a train, in which the reward is measured by the negative of the total time spent boarding (alternatively, the cost of boarding the train is equal to the boarding time). One strategy is to enter the train door as soon as they open, minimizing the initial wait time for yourself. If the train is crowded, however, then you will have a slow entry after the initial action of entering the door as people are fighting you to depart the train as you attempt to board. The total boarding time, or cost, is then:
- 0 seconds wait time + 15 seconds fight time
On the next day, by random chance (exploration), you decide to wait and let other people depart first. This initially results in a longer wait time. However, less time is spent fighting the departing passengers. Overall, this path has a higher reward than that of the previous day, since the total boarding time is now:
- 5 second wait time + 0 second fight time
Through exploration, despite the initial (patient) action resulting in a larger cost (or negative reward) than in the forceful strategy, the overall cost is lower, thus revealing a more rewarding strategy.
Algorithm
[edit]
After steps into the future the agent will decide some next step. The weight for this step is calculated as , where (the discount factor) is a number between 0 and 1 (). Assuming , it has the effect of valuing rewards received earlier higher than those received later (reflecting the value of a "good start"). may also be interpreted as the probability to succeed (or survive) at every step .
The algorithm, therefore, has a function that calculates the quality of a state–action combination:
- .
Before learning begins, is initialized to a possibly arbitrary fixed value (chosen by the programmer). Then, at each time the agent selects an action , observes a reward , enters a new state (that may depend on both the previous state and the selected action), and is updated. The core of the algorithm is a Bellman equation as a simple value iteration update, using the weighted average of the current value and the new information:[4]
where is the reward received when moving from the state to the state , and is the learning rate .
Note that is the sum of three terms:
- : the current value (weighted by one minus the learning rate)
- : the reward to obtain if action is taken when in state (weighted by learning rate)
- : the maximum reward that can be obtained from state (weighted by learning rate and discount factor)
An episode of the algorithm ends when state is a final or terminal state. However, Q-learning can also learn in non-episodic tasks (as a result of the property of convergent infinite series). If the discount factor is lower than 1, the action values are finite even if the problem can contain infinite loops or paths.
For all final states , is never updated, but is set to the reward value observed for state . In most cases, can be taken to equal zero.
Influence of variables
[edit]Learning rate
[edit]The learning rate or step size determines to what extent newly acquired information overrides old information. A factor of 0 makes the agent learn nothing (exclusively exploiting prior knowledge), while a factor of 1 makes the agent consider only the most recent information (ignoring prior knowledge to explore possibilities). In fully deterministic environments, a learning rate of is optimal. When the problem is stochastic, the algorithm converges under some technical conditions on the learning rate that require it to decrease to zero. In practice, often a constant learning rate is used, such as for all .[5]
Discount factor
[edit]The discount factor determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, i.e. (in the update rule above), while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For , without a terminal state, or if the agent never reaches one, all environment histories become infinitely long, and utilities with additive, undiscounted rewards generally become infinite.[6] Even with a discount factor only slightly lower than 1, Q-function learning leads to propagation of errors and instabilities when the value function is approximated with an artificial neural network.[7] In that case, starting with a lower discount factor and increasing it towards its final value accelerates learning.[8]
Initial conditions (Q0)
[edit]Since Q-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. High initial values, also known as "optimistic initial conditions",[9] can encourage exploration: no matter what action is selected, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. The first reward can be used to reset the initial conditions.[10] According to this idea, the first time an action is taken the reward is used to set the value of . This allows immediate learning in case of fixed deterministic rewards. A model that incorporates reset of initial conditions (RIC) is expected to predict participants' behavior better than a model that assumes any arbitrary initial condition (AIC).[10] RIC seems to be consistent with human behaviour in repeated binary choice experiments.[10]
Implementation
[edit]Q-learning at its simplest stores data in tables. This approach falters with increasing numbers of states/actions since the likelihood of the agent visiting a particular state and performing a particular action is increasingly small.
Function approximation
[edit]Q-learning can be combined with function approximation.[11] This makes it possible to apply the algorithm to larger problems, even when the state space is continuous.
One solution is to use an (adapted) artificial neural network as a function approximator.[12] Another possibility is to integrate Fuzzy Rule Interpolation (FRI) and use sparse fuzzy rule-bases[13] instead of discrete Q-tables or ANNs, which has the advantage of being a human-readable knowledge representation form. Function approximation may speed up learning in finite problems, due to the fact that the algorithm can generalize earlier experiences to previously unseen states.
Quantization
[edit]Another technique to decrease the state/action space quantizes possible values. Consider the example of learning to balance a stick on a finger. To describe a state at a certain point in time involves the position of the finger in space, its velocity, the angle of the stick and the angular velocity of the stick. This yields a four-element vector that describes one state, i.e. a snapshot of one state encoded into four values. The problem is that infinitely many possible states are present. To shrink the possible space of valid actions multiple values can be assigned to a bucket. The exact distance of the finger from its starting position (-Infinity to Infinity) is not known, but rather whether it is far away or not (Near, Far).[14]
History
[edit]Q-learning was introduced by Chris Watkins in 1989.[15] A convergence proof was presented by Watkins and Peter Dayan in 1992,[16] building on Watkins' doctoral dissertation, Learning from Delayed Rewards. Eight years earlier in 1981, the same problem, under the name of "Delayed reinforcement learning," was solved by Bozinovski's Crossbar Adaptive Array (CAA).[17][18] The memory matrix was the same as the eight years later Q-table of Q-learning. The architecture introduced the term "state evaluation" in reinforcement learning. The crossbar learning algorithm, written in mathematical pseudocode in the paper, in each iteration performs the following computation:
- In state s perform action a;
- Receive consequence state s';
- Compute state evaluation ;
- Update crossbar value .
The term "secondary reinforcement" is borrowed from animal learning theory, to model state values via backpropagation: the state value of the consequence situation is backpropagated to the previously encountered situations. CAA computes state values vertically and actions horizontally (the "crossbar"). Demonstration graphs showing delayed reinforcement learning contained states (desirable, undesirable, and neutral states), which were computed by the state evaluation function. This learning system was a forerunner of the Q-learning algorithm.[19]
In 2014, Google DeepMind patented[20] an application of Q-learning to deep learning, entitled "deep reinforcement learning" or "deep Q-learning," that can play Atari 2600 games at expert human levels.
Variants
[edit]Deep Q-learning
[edit]The DeepMind system used a deep convolutional neural network, with layers of tiled convolutional filters to mimic the effects of receptive fields. Reinforcement learning is unstable or divergent when a nonlinear function approximator such as a neural network is used to represent Q. This instability comes from the correlations present in the sequence of observations, the fact that small updates to Q may significantly change the policy of the agent and the data distribution, and the correlations between Q and the target values. The method can be used for stochastic search in various domains and applications.[1][21]
The technique used experience replay, a biologically inspired mechanism that uses a random sample of prior actions instead of the most recent action to proceed.[3] This removes correlations in the observation sequence and smooths changes in the data distribution. Iterative updates adjust Q towards target values that are only periodically updated, further reducing correlations with the target.[22]
Double Q-learning
[edit]Because the future maximum approximated action value in Q-learning is evaluated using the same Q function as in current action selection policy, in noisy environments Q-learning can sometimes overestimate the action values, slowing the learning. A variant called Double Q-learning was proposed to correct this. Double Q-learning[23] is an off-policy reinforcement learning algorithm, where a different policy is used for value evaluation than what is used to select the next action.
In practice, two separate value functions and are trained in a mutually symmetric fashion using separate experiences. The double Q-learning update step is then as follows:
- , and
Now the estimated value of the discounted future is evaluated using a different policy, which solves the overestimation issue.
This algorithm was later modified in 2015 and combined with deep learning,[24] as in the DQN algorithm, resulting in Double DQN, which outperforms the original DQN algorithm.[25]
Others
[edit]Delayed Q-learning is an alternative implementation of the online Q-learning algorithm, with probably approximately correct (PAC) learning.[26]
Greedy GQ is a variant of Q-learning to use in combination with (linear) function approximation.[27] The advantage of Greedy GQ is that convergence is guaranteed even when function approximation is used to estimate the action values.
Distributional Q-learning is a variant of Q-learning which seeks to model the distribution of returns rather than the expected return of each action. It has been observed to facilitate estimate by deep neural networks and can enable alternative control methods, such as risk-sensitive control.[28]
Multi-agent learning
[edit]Q-learning has been proposed in the multi-agent setting (see Section 4.1.2 in [29]). One approach consists in pretending the environment is passive.[30] Littman proposes the minimax Q learning algorithm.[31]
Limitations
[edit]The standard Q-learning algorithm (using a table) applies only to discrete action and state spaces. Discretization of these values leads to inefficient learning, largely due to the curse of dimensionality. However, there are adaptations of Q-learning that attempt to solve this problem such as Wire-fitted Neural Network Q-Learning.[32]
See also
[edit]References
[edit]- ^ a b c Li, Shengbo (2023). Reinforcement Learning for Sequential Decision and Optimal Control (First ed.). Springer Verlag, Singapore. pp. 1–460. doi:10.1007/978-981-19-7784-8. ISBN 978-9-811-97783-1. S2CID 257928563.
{{cite book}}: CS1 maint: location missing publisher (link) - ^ a b Melo, Francisco S. "Convergence of Q-learning: a simple proof" (PDF). Archived from the original (PDF) on 2017-11-18. Retrieved 2017-08-08.
- ^ a b Matiisen, Tambet (December 19, 2015). "Demystifying Deep Reinforcement Learning". neuro.cs.ut.ee. Computational Neuroscience Lab. Archived from the original on 2018-04-07. Retrieved 2018-04-06.
- ^ Dietterich, Thomas G. (21 May 1999). "Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition". arXiv:cs/9905014.
- ^ Sutton, Richard; Barto, Andrew (1998). Reinforcement Learning: An Introduction. MIT Press.
- ^ Russell, Stuart J.; Norvig, Peter (2010). Artificial Intelligence: A Modern Approach (Third ed.). Prentice Hall. p. 649. ISBN 978-0136042594.
- ^ Baird, Leemon (1995). "Residual algorithms: Reinforcement learning with function approximation" (PDF). ICML: 30–37.
- ^ François-Lavet, Vincent; Fonteneau, Raphael; Ernst, Damien (2015-12-07). "How to Discount Deep Reinforcement Learning: Towards New Dynamic Strategies". arXiv:1512.02011 [cs.LG].
- ^ Sutton, Richard S.; Barto, Andrew G. "2.7 Optimistic Initial Values". Reinforcement Learning: An Introduction. Archived from the original on 2013-09-08. Retrieved 2013-07-18.
- ^ a b c Shteingart, Hanan; Neiman, Tal; Loewenstein, Yonatan (May 2013). "The role of first impression in operant learning" (PDF). Journal of Experimental Psychology: General. 142 (2): 476–488. doi:10.1037/a0029550. ISSN 1939-2222. PMID 22924882.
- ^ Hasselt, Hado van (5 March 2012). "Reinforcement Learning in Continuous State and Action Spaces". In Wiering, Marco; Otterlo, Martijn van (eds.). Reinforcement Learning: State-of-the-Art. Springer Science & Business Media. pp. 207–251. ISBN 978-3-642-27645-3.
- ^ Tesauro, Gerald (March 1995). "Temporal Difference Learning and TD-Gammon". Communications of the ACM. 38 (3): 58–68. doi:10.1145/203330.203343. S2CID 8763243. Retrieved 2010-02-08.
- ^ Vincze, David (2017). "Fuzzy rule interpolation and reinforcement learning" (PDF). 2017 IEEE 15th International Symposium on Applied Machine Intelligence and Informatics (SAMI). IEEE. pp. 173–178. doi:10.1109/SAMI.2017.7880298. ISBN 978-1-5090-5655-2. S2CID 17590120.
- ^ Krishnan, Srivatsan; Lam, Maximilian; Chitlangia, Sharad; Wan, Zishen; Barth-Maron, Gabriel; Faust, Aleksandra; Reddi, Vijay Janapa (13 November 2022). "QuaRL: Quantization for Fast and Environmentally Sustainable Reinforcement Learning". arXiv:1910.01055 [cs.LG].
- ^ Watkins, C.J.C.H. (1989). Learning from Delayed Rewards (PDF) (Ph.D. thesis). University of Cambridge. EThOS uk.bl.ethos.330022.
- ^ Watkins, Chris; Dayan, Peter (1992). "Q-learning". Machine Learning. 8 (3–4): 279–292. doi:10.1007/BF00992698. hdl:21.11116/0000-0002-D738-D.
- ^ Bozinovski, S. (15 July 1999). "Crossbar Adaptive Array: The first connectionist network that solved the delayed reinforcement learning problem". In Dobnikar, Andrej; Steele, Nigel C.; Pearson, David W.; Albrecht, Rudolf F. (eds.). Artificial Neural Nets and Genetic Algorithms: Proceedings of the International Conference in Portorož, Slovenia, 1999. Springer Science & Business Media. pp. 320–325. ISBN 978-3-211-83364-3.
- ^ Bozinovski, S. (1982). "A self learning system using secondary reinforcement". In Trappl, Robert (ed.). Cybernetics and Systems Research: Proceedings of the Sixth European Meeting on Cybernetics and Systems Research. North Holland. pp. 397–402. ISBN 978-0-444-86488-8.
- ^ Barto, A. (24 February 1997). "Reinforcement learning". In Omidvar, Omid; Elliott, David L. (eds.). Neural Systems for Control. Elsevier. ISBN 978-0-08-053739-9.
- ^ "Methods and Apparatus for Reinforcement Learning, US Patent #20150100530A1" (PDF). US Patent Office. 9 April 2015. Retrieved 28 July 2018.
- ^ Matzliach B.; Ben-Gal I.; Kagan E. (2022). "Detection of Static and Mobile Targets by an Autonomous Agent with Deep Q-Learning Abilities". Entropy. 24 (8): 1168. Bibcode:2022Entrp..24.1168M. doi:10.3390/e24081168. PMC 9407070. PMID 36010832.
- ^ Mnih, Volodymyr; Kavukcuoglu, Koray; Silver, David; Rusu, Andrei A.; Veness, Joel; Bellemare, Marc G.; Graves, Alex; Riedmiller, Martin; Fidjeland, Andreas K. (Feb 2015). "Human-level control through deep reinforcement learning". Nature. 518 (7540): 529–533. Bibcode:2015Natur.518..529M. doi:10.1038/nature14236. ISSN 0028-0836. PMID 25719670. S2CID 205242740.
- ^ van Hasselt, Hado (2011). "Double Q-learning" (PDF). Advances in Neural Information Processing Systems. 23: 2613–2622.
- ^ van Hasselt, Hado; Guez, Arthur; Silver, David (8 December 2015). "Deep Reinforcement Learning with Double Q-learning". arXiv:1509.06461 [cs.LG].
- ^ van Hasselt, Hado; Guez, Arthur; Silver, David (2015). "Deep reinforcement learning with double Q-learning" (PDF). AAAI Conference on Artificial Intelligence: 2094–2100. arXiv:1509.06461.
- ^ Strehl, Alexander L.; Li, Lihong; Wiewiora, Eric; Langford, John; Littman, Michael L. (2006). "Pac model-free reinforcement learning" (PDF). Proc. 22nd ICML: 881–888.
- ^ Maei, Hamid; Szepesvári, Csaba; Bhatnagar, Shalabh; Sutton, Richard (2010). "Toward off-policy learning control with function approximation in Proceedings of the 27th International Conference on Machine Learning" (PDF). pp. 719–726. Archived from the original (PDF) on 2012-09-08. Retrieved 2016-01-25.
- ^ Hessel, Matteo; Modayil, Joseph; van Hasselt, Hado; Schaul, Tom; Ostrovski, Georg; Dabney, Will; Horgan, Dan; Piot, Bilal; Azar, Mohammad; Silver, David (February 2018). "Rainbow: Combining Improvements in Deep Reinforcement Learning". Proceedings of the AAAI Conference on Artificial Intelligence. 32. arXiv:1710.02298. doi:10.1609/aaai.v32i1.11796. S2CID 19135734.
- ^ Shoham, Yoav; Powers, Rob; Grenager, Trond (1 May 2007). "If multi-agent learning is the answer, what is the question?". Artificial Intelligence. 171 (7): 365–377. doi:10.1016/j.artint.2006.02.006. ISSN 0004-3702. Retrieved 4 April 2023.
- ^ Sen, Sandip; Sekaran, Mahendra; Hale, John (1 August 1994). "Learning to coordinate without sharing information". Proceedings of the Twelfth AAAI National Conference on Artificial Intelligence. AAAI Press: 426–431. Retrieved 4 April 2023.
- ^ Littman, Michael L. (10 July 1994). "Markov games as a framework for multi-agent reinforcement learning". Proceedings of the Eleventh International Conference on International Conference on Machine Learning. Morgan Kaufmann Publishers Inc.: 157–163. ISBN 9781558603356. Retrieved 4 April 2023.
- ^ Gaskett, Chris; Wettergreen, David; Zelinsky, Alexander (1999). "Q-Learning in Continuous State and Action Spaces" (PDF).
External links
[edit]- Watkins, C.J.C.H. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge University, Cambridge, England.
- Strehl, Li, Wiewiora, Langford, Littman (2006). PAC model-free reinforcement learning
- Reinforcement Learning: An Introduction by Richard Sutton and Andrew S. Barto, an online textbook. See "6.5 Q-Learning: Off-Policy TD Control".
- Piqle: a Generic Java Platform for Reinforcement Learning
- Reinforcement Learning Maze, a demonstration of guiding an ant through a maze using Q-learning
- Q-learning work by Gerald Tesauro
Q-learning
View on GrokipediaBackground and Fundamentals
Reinforcement Learning Overview
Reinforcement learning (RL) is a paradigm in machine learning where an intelligent agent learns to make sequential decisions by interacting with an environment, aiming to maximize the total cumulative reward over time.[4] The agent receives feedback in the form of scalar rewards or penalties after each action, which guides its learning through trial and error rather than direct instruction.[5] This process enables the agent to discover optimal behaviors in complex, dynamic settings without prior knowledge of the environment's rules. Central to RL are several key components: the agent, which is the decision-maker; the environment, which responds to the agent's actions; the state $ S $, representing the current situation; the action $ A $, chosen by the agent; and the reward $ R $, a numerical signal indicating the immediate desirability of the action taken.[4] The agent's behavior is governed by a policy $ \pi $, a strategy that maps states to actions, either deterministically or probabilistically.[4] Additionally, the value function $ V $ estimates the expected long-term reward starting from a given state under the policy, while the action-value function $ Q $ extends this to evaluate state-action pairs, aiding in action selection.[4] RL differs fundamentally from supervised learning, which uses labeled input-output pairs to train models for prediction or classification, and unsupervised learning, which identifies hidden structures in unlabeled data without explicit feedback.[5] In contrast, RL operates without labeled examples of correct actions, relying instead on sparse, delayed rewards that may arrive long after an action is taken, requiring the agent to balance exploration of new strategies with exploitation of known rewarding ones.[5] RL tasks are classified as episodic or continuing based on their structure.[4] Episodic tasks naturally divide into finite episodes, each starting from an initial state and ending at a terminal state, such as a single playthrough of a board game where the agent resets after each game to learn from repeated episodes.[4] Continuing tasks, however, have no terminal states and extend indefinitely, like perpetual inventory management, where the agent must sustain performance over an ongoing horizon.[4]Markov Decision Processes
A Markov decision process (MDP) provides the mathematical foundation for modeling sequential decision-making problems in reinforcement learning, where an agent interacts with an environment to maximize cumulative rewards. Formally, an MDP is defined as a tuple , consisting of a state space representing all possible states of the environment, an action space denoting the set of actions available to the agent in each state, a transition probability function that gives the probability of transitioning to state given current state and action , a reward function specifying the immediate reward received after taking action in state , and a discount factor that weights the importance of future rewards relative to immediate ones. Central to the MDP framework is the Markov property, which assumes that the future state and reward depend only on the current state and action, rendering the history of prior states and actions irrelevant for prediction. This property simplifies the modeling of decision processes by focusing solely on the present context, enabling efficient computation and analysis in environments like games or robotic control tasks. The objective in an MDP is to determine an optimal policy , a mapping from states to actions (or distributions over actions), that maximizes the expected discounted return starting from any state at time , defined as . For a given policy , the state-value function represents the expected return when starting in state and following thereafter:Core Algorithm
Algorithm Description
Q-learning is a temporal-difference (TD) learning method that approximates the optimal action-value function , which represents the maximum expected discounted return starting from state , taking action , and following the optimal policy thereafter, without constructing an explicit model of the environment. This approach enables the agent to learn directly from interactions with the environment in the form of state-action-reward-next state tuples.[4] As an off-policy algorithm, Q-learning derives the optimal policy from the learned values independently of the policy used to generate the experience, allowing the behavior policy—such as an -greedy strategy that balances exploration and exploitation—to differ from the target optimal policy.[4] This separation facilitates learning the optimal action values even when exploratory actions are taken, making it suitable for environments requiring thorough exploration. The core of the algorithm involves iteratively updating the estimates through a TD update rule. The following pseudocode outlines the standard tabular Q-learning procedure:Initialize Q(s, a) arbitrarily for all s, a (e.g., Q(s, a) = 0)
For each episode:
Initialize a random starting state s
While s is not a terminal state:
Choose action a from s using policy derived from Q (e.g., ε-greedy)
Take action a, observe reward r and next state s'
Update Q(s, a) ← Q(s, a) + α [r + γ max_{a'} Q(s', a') − Q(s, a)]
Set s ← s'
This procedure assumes Q-learning solves infinite-horizon discounted Markov decision processes.[4] The hyperparameters (learning rate) and (discount factor) control the update magnitude and future reward weighting, respectively.[4]
The key intuition behind the update lies in bootstrapping: the term serves as a biased but unbiased-in-expectation target for , allowing the estimate to improve incrementally toward based on current knowledge of future values.[4] By relying solely on sampled transitions rather than modeling transition probabilities or rewards , Q-learning maintains its model-free nature, adapting to unknown dynamics through repeated interaction.
Mathematical Formulation
Q-learning is grounded in the theory of Markov decision processes (MDPs), where the goal is to learn an optimal action-value function that satisfies the Bellman optimality equation:Hyperparameters
Learning Rate
The learning rate, denoted as , is a key hyperparameter in Q-learning that governs the magnitude of updates to the action-value function . It is defined within the interval , where implies no updates occur and the Q-values remain fixed, while results in complete replacement of the prior Q-value with the new estimate derived from the temporal difference (TD) error. In the update rule, scales the contribution of the new information relative to the existing estimate, as seen in the standard Q-learning update: .[2] The choice of significantly influences the learning dynamics. A high facilitates rapid incorporation of new experiences, enabling quick adaptation, but it can introduce instability, such as oscillations in Q-value estimates or divergence from optimal policies, particularly in environments with noisy rewards or sparse feedback. In contrast, a low ensures more gradual updates, promoting stable convergence toward optimal values at the cost of slower overall learning progress. This trade-off is especially pronounced when interacting with the TD error, where larger errors amplify the update size via , allowing faster corrections for substantial discrepancies but risking overshooting with elevated .[11] Optimal schedules for depend on the environment's characteristics. In stationary Markov decision processes (MDPs), a decreasing schedule is required for almost-sure convergence to the optimal Q-values, satisfying the conditions , , and for all state-action pairs visited infinitely often; a harmonic schedule like meets these criteria. For non-stationary environments, where the transition dynamics or rewards may change over time, a constant (e.g., 0.1) is typically employed to maintain ongoing adaptation without the Q-values converging prematurely to outdated optima.[2][4] Tuning is often performed through empirical methods tailored to the setting. In tabular Q-learning, grid search over discrete values (e.g., {0.01, 0.1, 0.5}) evaluates performance on validation episodes to balance speed and stability. In function approximation variants like deep Q-networks (DQN), adaptive optimizers such as RMSProp or Adam incorporate implicitly, with initial values around to selected via hyperparameter optimization to handle high-dimensional state spaces.Discount Factor
The discount factor, denoted , is a hyperparameter in Q-learning with that scales the contribution of future rewards to the agent's decision-making process.[12] It discounts rewards received in the future relative to immediate ones, with rewards steps ahead valued at times their nominal amount.[12] When , the agent becomes fully myopic, prioritizing only immediate rewards and ignoring any future consequences, resulting in a purely greedy policy. Conversely, when , the formulation is undiscounted, treating all rewards equally regardless of when they occur, which is suitable for tasks without a natural horizon but requires modifications for convergence.[12] The discount factor fundamentally shapes the definition of the return in Q-learning, defined as the expected discounted sum of future rewards:Initial Q-Values and Exploration
In Q-learning, the Q-values are typically initialized to zero for all state-action pairs, which serves as a neutral starting point and is a common practice in implementations. This zero initialization is considered optimistic in environments with positive rewards, as it encourages the agent to explore by treating all actions as equally promising initially, prompting the discovery of rewarding paths. Alternatively, Q-values can be set to arbitrary values or informed priors based on domain knowledge, though such choices may introduce specific biases in early learning phases.[4][14] The choice of initial Q-values influences the early policy by favoring actions with higher starting estimates, potentially accelerating convergence toward optimal behavior if the initialization aligns with the environment's reward structure. However, under the theoretical guarantees of Q-learning, the algorithm converges to the optimal Q-function regardless of the initial values, provided that all state-action pairs are visited infinitely often and the learning rate satisfies standard conditions. Optimistic initializations, such as setting Q-values to a positive constant like the estimated maximum reward, further promote exploration by inflating estimates of undiscovered actions, which is particularly effective in sparse-reward settings.[15][4][14] Exploration in Q-learning is essential to discover high-reward actions and avoid suboptimal policies, and the ε-greedy strategy is the most widely adopted method for balancing exploration and exploitation. In ε-greedy, the agent selects a random action with probability ε (exploration) or the action maximizing the current Q-value with probability 1-ε (exploitation); ε is often initialized at 1.0 for full exploration at the start and decayed multiplicatively or linearly to a small value like 0.1 over episodes. This decay schedule ensures thorough initial probing of the environment while gradually shifting toward exploitation as the Q-values become more reliable.[4][15] Alternative exploration strategies include softmax (or Boltzmann) selection, which probabilistically chooses actions based on their Q-values exponentiated and normalized by a temperature parameter τ, allowing softer preferences that decrease as τ anneals over time. Upper confidence bound (UCB) methods extend optimistic principles by adding a bonus term to Q-values that accounts for uncertainty, such as the number of visits to a state-action pair, prioritizing actions with high estimated value and low experience. These approaches, while less common than ε-greedy in basic Q-learning, can improve sample efficiency in certain domains by directing exploration more intelligently.[4] The tuning of exploration parameters, such as ε or τ, is crucial to prevent premature convergence to suboptimal policies, with higher initial values recommended for longer episodes or larger state spaces to ensure adequate coverage. In practice, ε is adjusted empirically based on episode length, often decaying after a fixed number of steps to balance the exploration-exploitation trade-off effectively.[4]Implementation Aspects
Tabular Methods
In tabular Q-learning, the action-value function is represented exactly using a lookup table, typically implemented as a two-dimensional array of size , where is the number of states and is the number of actions. This structure allows direct storage and retrieval of Q-values for each state-action pair, with updates applied to specific table entries following the Q-learning rule: , where is the learning rate, is the reward, is the discount factor, and is the next state.[4][6] This method is particularly suitable for environments with small, finite, and discrete state and action spaces, such as gridworld tasks, where it provides an exact representation of the optimal action-value function . In such settings, the table can capture the full policy without approximation errors, enabling convergence to the optimal Q-values under standard assumptions like infinite exploration and appropriate learning rates.[4][6] Implementation begins with initializing the Q-table, often to zero or small random values, to ensure uniform exploration. During training episodes, the agent observes the current state , selects an action (e.g., via -greedy policy), executes it to receive reward and next state , then updates the table entry for using the temporal-difference error. Over multiple episodes, the table entries are iteratively refined, indexing directly by state and action indices for efficient lookups and updates, until convergence to the optimal table representing .[4] A representative example is the FrozenLake environment, a 4x4 gridworld where the agent navigates from a start position (S) to a goal (G) while avoiding holes (H) on slippery ice tiles (F). The state space has 16 discrete positions, and actions are up, down, left, right (). Initially, the Q-table is all zeros:| State | Up | Down | Left | Right |
|---|---|---|---|---|
| 0 (S) | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... |
| 15 (G) | 0.0 | 0.0 | 0.0 | 0.0 |
Function Approximation
In high-dimensional or continuous state spaces, maintaining a tabular representation of the Q-function becomes computationally infeasible due to the exponential growth in the number of states, often referred to as the curse of dimensionality.[4] Function approximation mitigates this by parameterizing the Q-function as $ Q(s, a; \theta) $, where $ \theta $ represents a set of adjustable parameters, allowing generalization across similar states without explicitly storing values for every state-action pair.[17] Linear function approximation methods represent the Q-function as a linear combination of features: $ Q(s, a) = \phi(s, a)^T w $, where $ \phi(s, a) $ is a feature vector derived from the state-action pair and $ w $ is the weight vector. Common feature constructions include tile coding, which partitions the state space into overlapping grids (tilings) and activates binary features for tiles containing the current state, enabling controlled generalization and efficient updates.[18] Radial basis functions (RBFs) provide an alternative, using continuous Gaussian-like features centered at predefined points in the state space to produce smooth approximations.[19] The parameters $ w $ are updated using semi-gradient descent on the temporal-difference (TD) error, minimizing the mean squared error between the current Q-estimate and the TD target $ r + \gamma \max_{a'} Q(s', a'; w) $, with the update rule $ w \leftarrow w + \alpha \delta \phi(s, a) $, where $ \delta $ is the TD error and $ \alpha $ is the learning rate.[20] Nonlinear function approximators, such as deep neural networks, extend this capability by modeling complex, non-linear dependencies in $ Q(s, a; \theta) $, where $ \theta $ denotes the network weights. These networks are trained by minimizing the mean squared error (MSE) loss on TD targets, typically using stochastic gradient descent with backpropagation.[21] A major challenge in function approximation for Q-learning arises from the "deadly triad" of combining function approximation with bootstrapping (TD updates) and off-policy learning, which can lead to instability and divergence of the value estimates.[22] This instability stems from correlated updates and non-stationary targets in off-policy settings, potentially causing the algorithm to oscillate or explode. Solutions include the use of target networks, which maintain a separate, periodically updated copy of the Q-network to stabilize the TD targets and mitigate divergence.[23] Unlike tabular Q-learning, which converges to the optimal Q-function under standard conditions, function approximation offers no such theoretical guarantees, as the approximation may not span the true Q-function and can introduce bias.[20] However, empirical success has been demonstrated in complex environments, such as Atari 2600 games, where deep Q-networks with function approximation achieved human-level or superhuman performance on several tasks by learning directly from pixel inputs.[21]Discretization Techniques
Discretization techniques in Q-learning address the challenge of applying the algorithm to environments with continuous state or action spaces, or excessively large discrete ones, by mapping them to a finite, manageable set of categories. This process involves partitioning the continuous variables into discrete bins or regions, allowing the standard tabular Q-learning update rule to be applied on the resulting pseudo-discrete space. For instance, continuous state coordinates can be rounded to the nearest integer or assigned to predefined intervals, transforming the problem into one with a countable state-action space. Common discretization methods include uniform binning, where the space is divided into equally spaced intervals, such as creating a grid that segments each dimension of the state vector into fixed-width bins. This approach is straightforward and computationally efficient but assumes uniform importance across the space. Another technique employs k-means clustering to partition the state space, where data points from observed states are grouped into clusters based on similarity, with each cluster center representing a discrete state; this method adaptively identifies natural groupings without assuming uniformity. Adaptive grid methods further refine partitions dynamically to allocate finer resolution where needed.[24] These techniques involve inherent trade-offs: coarser discretizations reduce the size of the Q-table and accelerate convergence but introduce approximation errors by aggregating dissimilar states or actions, potentially leading to suboptimal policies; finer grids minimize such errors and better approximate the true continuous dynamics but exponentially increase the state-action space dimensionality, raising storage and update costs. For example, in the CartPole balancing task, the continuous four-dimensional state (cart position, velocity, pole angle, angular velocity) can be discretized into a manageable Q-table using 10 bins for each dimension.[25][26] After discretization, the resulting finite space is treated using standard tabular Q-learning, where the Q-values are updated via the Bellman equation on the binned representations, enabling convergence guarantees under typical assumptions like sufficient exploration.Historical Context
Origins and Development
Q-learning was invented by Christopher J. C. H. Watkins as part of his PhD research at the University of Cambridge.[7] In his 1989 thesis titled Learning from Delayed Rewards, Watkins introduced the algorithm as a method for agents to learn optimal behavior in environments with delayed feedback.[7] The primary motivation for developing Q-learning stemmed from the need to extend temporal-difference (TD) learning methods, originally proposed by Richard S. Sutton in 1988, to support off-policy, model-free control in Markov decision processes (MDPs).[27] Sutton's TD learning focused on prediction tasks, estimating value functions from experience without a model of the environment, but it was primarily on-policy.[4] Q-learning advanced this by enabling learning of action-value functions under any policy while converging to the optimal policy, addressing control problems in stochastic environments without requiring knowledge of transition probabilities or rewards.[7] The name "Q-learning" derives from the action-value function, denoted as $ Q(s, a) $, which estimates the expected return for taking action $ a $ in state $ s $ and following an optimal policy thereafter.[7] In his thesis, Watkins provided a proof of convergence for Q-learning in deterministic MDPs under suitable conditions, such as infinite exploration and learning rates summing to infinity while their squares sum to a finite value.[7] This was followed by the seminal publication "Q-learning" by Watkins and Peter Dayan in 1992, which extended the convergence proof to stochastic MDPs and formalized the algorithm's theoretical guarantees.[2] Q-learning's foundations also drew from the Rescorla-Wagner model in psychology, a 1972 theory of classical conditioning that inspired TD learning's error-driven updates, and early AI planning techniques involving MDPs, as pioneered by Richard Bellman in the 1950s through dynamic programming for sequential decision-making.[4] These influences positioned Q-learning as a bridge between psychological learning models and computational approaches to optimal control in unknown environments.[7]Key Publications and Milestones
Q-learning's development in the 1990s built upon foundational work by Chris Watkins, with key advancements in function approximation explored by Rummery and Niranjan, who introduced on-line Q-learning using connectionist systems to handle continuous state spaces through neural networks.[28] Early applications during this period extended to robotics, where Q-learning was used for tasks like mobile robot navigation and control in uncertain environments.[29] In the 2000s, Q-learning saw increased adoption in game AI, particularly for multi-agent scenarios, as evidenced by the development of Nash Q-learning for general-sum stochastic games, which enabled agents to converge to equilibria in competitive settings like poker.[30] Theoretical progress included the analysis by Tsitsiklis and Van Roy on temporal-difference learning with function approximation, which quantified approximation errors and convergence bounds, highlighting challenges in high-dimensional spaces.[31] The 2010s marked a pivotal milestone with the introduction of Deep Q-Networks (DQN) by Mnih et al., whose 2013 arXiv preprint demonstrated scaling Q-learning to Atari games using convolutional neural networks for raw pixel inputs, achieving human-level performance without domain-specific knowledge.[21] This was further advanced in their 2015 Nature paper, incorporating experience replay and target networks to stabilize training, establishing DQN as a cornerstone of deep reinforcement learning.[32] Q-learning's principles influenced hybrid systems like AlphaGo in 2016, where value networks provided state value estimates to guide Monte Carlo tree search in Go. In the 2020s, extensions addressed offline reinforcement learning, with Fujimoto et al.'s conservative Q-learning (CQL) mitigating overestimation in static datasets by penalizing out-of-distribution actions, yielding 2-5 times better performance on benchmarks like D4RL.[33] Integration with transformers enhanced representation learning, as seen in works like QT-TDM (2024), which combined autoregressive Q-learning with transformer dynamics models for efficient long-term planning in sequential tasks.[34] By 2025, Q-learning remained foundational in libraries such as Stable Baselines3, providing reliable implementations of variants like DQN for practical reinforcement learning workflows.[35]Variants and Extensions
Double Q-Learning
Standard Q-learning exhibits an overestimation bias in its value updates, primarily because the target value $ r + \gamma \max_{a'} Q(s', a') $ relies on the same Q-function for both action selection (via the max operator) and evaluation, which tends to select noisy, overestimated Q-values in stochastic environments, leading to positively biased estimates that propagate and degrade performance. To address this issue, Double Q-learning introduces two independent Q-functions, denoted $ Q^A $ and $ Q^B $, which are updated alternately using a decoupled approach: one Q-function is used to select the action (via argmax), while the other evaluates the corresponding Q-value in the target, thereby reducing the correlation that causes overestimation in the standard algorithm. The update rule proceeds as follows: with equal probability, either update $ Q^A $ using the target $ r + \gamma Q^B(s', \arg\max_{a'} Q^A(s', a')) $, or update $ Q^B $ using the target $ r + \gamma Q^A(s', \arg\max_{a'} Q^B(s', a')) $, following the standard temporal-difference form with learning rate $ \alpha $:Deep Q-Networks
Deep Q-Networks (DQN) integrate Q-learning with deep neural networks to scale reinforcement learning to high-dimensional state spaces, particularly raw pixel inputs from environments like video games. Developed by Mnih et al. in 2015, DQN uses a convolutional neural network (CNN) to approximate the action-value function $ Q(s, a; \theta) $, where $ s $ denotes the state (e.g., stacked image frames), $ a $ is the discrete action, and $ \theta $ represents the network parameters. This end-to-end learning approach eliminates the need for hand-crafted features, enabling agents to process visual observations directly and achieve control policies in complex domains.[32] A core challenge in applying deep networks to Q-learning is training instability due to non-stationary targets and correlated sequential data. DQN addresses this through two key innovations: experience replay and a target network. Experience replay stores agent experiences as transitions $ (s, a, r, s') $ in a large buffer $ \mathcal{D} $, from which random minibatches are sampled to perform independent and identically distributed (i.i.d.) updates, decorrelating samples and improving data efficiency. The target network, a periodic copy of the main Q-network denoted $ Q(s', a'; \theta^-) $, provides stable target values for the Bellman update, with $ \theta^- $ updated infrequently (e.g., every few thousand steps) to mitigate divergence during optimization. These techniques allow gradient-based training via backpropagation on the entire network.[32] The training objective minimizes the mean squared error (MSE) between the predicted Q-value and the TD target:Other Single-Agent Variants
Prioritized experience replay enhances standard experience replay in Q-learning by sampling transitions from the replay buffer based on the magnitude of their temporal-difference (TD) errors, prioritizing those that are more surprising or informative to accelerate learning. This approach addresses the inefficiency of uniform sampling, which treats all experiences equally regardless of their learning potential. When integrated into deep Q-networks, prioritized replay has demonstrated superior performance, outperforming uniform replay on 41 out of 49 Atari games by achieving higher scores and faster convergence.[37] Dueling deep Q-networks extend Q-learning by modifying the neural network architecture to separately estimate the state value function V(s) and the action advantage function A(s, a), with the Q-value computed as their combination: Q(s, a) = V(s) + (A(s, a) - mean_a A(s, a')). This decomposition allows the network to better represent the relative importance of actions while sharing a common feature extractor for the state value, leading to more efficient learning in environments where many actions have similar values. Empirical evaluations on Atari benchmarks show that dueling architectures consistently outperform standard deep Q-networks across a wide range of games, with improvements in both final performance and sample efficiency.[38] Noisy networks introduce parametric noise directly into the weights of the neural networks used in Q-learning to enable integrated exploration, obviating the need for separate ε-greedy or entropy-based mechanisms. By adding learnable noise parameters to the network layers, the agent can adaptively explore by sampling different noisy versions during training and inference, promoting diverse behavior without explicit decay schedules. This method has been shown to yield state-of-the-art results on Atari games when combined with deep Q-networks, surpassing prior exploration strategies in terms of average scores and robustness to hyperparameter choices.[39] Conservative Q-learning (CQL) adapts Q-learning for offline settings by adding a conservatism penalty to the Q-function update, which discourages overestimation of out-of-distribution (OOD) actions not present in the fixed dataset. The penalty is computed by minimizing the Q-values of actions sampled from a distribution broader than the data, ensuring the learned policy remains grounded in the observed behavior. In benchmarks like D4RL, CQL achieves competitive performance with model-free methods while providing theoretical guarantees on policy improvement, making it a stable choice for offline single-agent reinforcement learning tasks.[33] Recent advancements (2023–2025) have explored integrating world models into Q-learning to augment planning and improve sample efficiency in model-based offline reinforcement learning. For instance, lower expectile Q-learning combines a learned dynamics model with a conservative Q-value estimator that uses lower expectiles to bound overestimation errors, enabling effective policy learning from static datasets without online interaction. Evaluations on standard offline benchmarks demonstrate that this approach outperforms prior model-based methods across multiple domains, achieving higher normalized scores by leveraging model rollouts for planning.[40]Multi-Agent Adaptations
In multi-agent environments, Q-learning faces significant challenges due to non-stationarity, as the policies of other agents evolve over time, altering the transition dynamics and rewards from the perspective of any individual learner and violating the Markov assumption underlying standard Q-learning.[41] One straightforward adaptation is Independent Q-Learning (IQL), where each agent maintains its own Q-function and updates it independently, treating the actions of other agents as stochastic elements of the environment rather than modeling their strategic behavior.[42] This approach, introduced in early multi-agent reinforcement learning work, allows for decentralized execution but often struggles with coordination and convergence in competitive or cooperative settings due to the lack of explicit interaction modeling.[42] To address these issues, paradigms like Centralized Training with Decentralized Execution (CTDE) have been developed, enabling agents to learn decentralized policies while leveraging centralized information during training. A prominent example is QMIX, which factorizes the joint action-value function into individual Q-values using a monotonic mixing network, ensuring that the optimal joint action aligns with optimal individual actions under cooperative objectives.[43] QMIX has demonstrated superior performance in cooperative tasks, such as unit micromanagement in the StarCraft II environment, by allowing non-linear mixing of per-agent values conditioned on global state information during training.[43] Opponent modeling extends Q-learning by augmenting the state representation with estimates of other agents' Q-functions or policies, enabling learners to anticipate and adapt to adversaries' strategies. This technique, applied in deep reinforcement learning frameworks, involves training a separate network to predict opponent actions, which informs the primary Q-network's updates and improves robustness in competitive scenarios.[44] Recent advances from 2020 to 2025 have focused on multi-agent deep Q-networks (MADQN) for cooperative tasks, particularly in partially observable environments like StarCraft micromanagement challenges, where extensions handle non-stationarity through parameter sharing and recurrent architectures. These methods, building on benchmarks such as the StarCraft Multi-Agent Challenge (SMAC), have achieved state-of-the-art results in scaling to dozens of agents while managing partial observability via centralized critics or opponent-aware value functions.[41] Despite these progresses, multi-agent Q-learning adaptations face limitations in scalability with increasing agent numbers, as the joint action space grows exponentially, and convergence remains more challenging than in single-agent settings due to persistent non-stationarity and credit assignment problems.[41]Applications
Classic Examples
One of the foundational demonstrations of Q-learning is the gridworld environment, a discrete grid where an agent moves between cells to reach a goal while avoiding obstacles. In the classic cliff walking variant, the grid measures 4 rows by 12 columns, with the agent starting at the bottom-left cell (3,0) and the goal at the bottom-right (3,11); a "cliff" occupies cells (3,1) to (3,10), incurring a -100 reward and reset to start if entered, while all moves yield -1 reward otherwise. Tabular Q-learning enables the agent to learn the optimal policy of traveling along the top row and descending at the end, requiring 47 steps for a total reward of -47, converging to this policy after roughly 500 episodes with parameters α=0.5, ε=0.1 (decaying), and γ=0.9.[4] The taxi problem serves as another illustrative case for Q-learning in structured navigation tasks with multiple subgoals. Here, a taxi operates in a 5x5 grid, with states defined by the taxi's position (25 possibilities), passenger location (including in taxi, 6 options), and destination (4 fixed locations), yielding 500 discrete states; actions include moving north, south, east, or west, picking up, or dropping off the passenger. Rewards are -10 for invalid pick-up/drop-off, -1 per step, and +20 for successful drop-off. Q-learning iteratively improves the policy from random movements to efficient routes that minimize steps and achieve the high reward, often succeeding in under 100 episodes with tabular updates. Frozen Lake highlights Q-learning's handling of stochastic transitions and the need for balanced exploration. The environment is an 4x4 or 8x8 grid of tiles, where the agent starts at one corner and aims for the opposite goal tile; "frozen" tiles allow forward movement with 1/3 probability each to intended, left, or right directions due to slipperiness, while "hole" tiles end the episode with 0 reward, and the goal yields +1. Using an ε-greedy policy (e.g., ε starting at 1.0 and decaying to 0.1), Q-learning explores risky paths initially to discover safe routes, converging to near-optimal performance in 1000-5000 episodes depending on grid size and slipperiness, emphasizing how high initial ε promotes discovery of hole-free paths. Cliff walking further illustrates the influence of the discount factor γ on policy selection in Q-learning. With γ=0.9, the agent favors the high-reward short path despite cliff risk, as future rewards are sufficiently valued to offset potential falls; lowering γ to 0.7 shifts preference toward safer, longer routes (e.g., staying inland), reducing total expected steps but increasing cumulative negative rewards, demonstrating how γ tunes risk-aversion in off-policy learning.[4][45] A simple Python implementation of tabular Q-learning for a basic 3x3 gridworld (goal at (2,2), no obstacles, rewards -1/step, +10 goal) using NumPy for the Q-table is as follows, based on standard algorithmic pseudocode adapted for discrete states:import numpy as np
import random
# Environment parameters
n_rows, n_cols = 3, 3
n_actions = 4 # 0: up, 1: down, 2: left, 3: right
goal = (2, 2)
gamma = 0.9
alpha = 0.1
epsilon = 0.1
episodes = 1000
# Actions: delta row, delta col
actions = [(-1, 0), (1, 0), (0, -1), (0, 1)]
def is_valid(pos, row, col):
return 0 <= row < n_rows and 0 <= col < n_cols
# Initialize Q-table: states as flattened index (row*cols + col)
q_table = np.zeros((n_rows * n_cols, n_actions))
def get_state(row, col):
return row * n_cols + col
def get_reward(state, action):
row, col = divmod(state, n_cols)
drow, dcol = actions[action]
new_row, new_col = row + drow, col + dcol
if not is_valid((new_row, new_col)):
return -1 # Invalid move
new_state = get_state(new_row, new_col)
if (new_row, new_col) == goal:
return 10
return -1
for episode in range(episodes):
state = get_state(0, 0) # Start at (0,0)
done = False
while not done:
if random.uniform(0, 1) < epsilon:
action = random.choice(range(n_actions)) # Explore
else:
action = np.argmax(q_table[state]) # Exploit
reward = get_reward(state, action)
row, col = divmod(state, n_cols)
drow, dcol = actions[action]
new_row, new_col = row + drow, col + dcol
if not is_valid((new_row, new_col)):
new_state = state # Stay if invalid
else:
new_state = get_state(new_row, new_col)
if (new_row, new_col) == goal:
done = True
# Q-update
best_next = np.max(q_table[new_state])
td_target = reward + gamma * best_next
q_table[state, action] += alpha * (td_target - q_table[state, action])
state = new_state
# Learned policy
policy = {}
for i in range(n_rows * n_cols):
row, col = divmod(i, n_cols)
policy[(row, col)] = actions[np.argmax(q_table[i])]
print("Learned policy:", policy)
This code initializes a Q-table, applies ε-greedy action selection, and performs off-policy updates, typically converging to the optimal path in under 500 episodes.[4]