Q-learning

Q-learning

Main page

What are your thoughts?

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Q-learning

Community hub0 subscribers

Talks overview Knowledge Base overview

About hubStatsRules

Wikipedia

Grokipedia

Q-learning is a reinforcement learning algorithm that trains an agent to assign values to its possible actions based on its current state, without requiring a model of the environment (model-free). It can handle problems with stochastic transitions and rewards without requiring adaptations.

For example, in a grid maze, an agent learns to reach an exit worth 10 points. At a junction, Q-learning might assign a higher value to moving right than left if right gets to the exit faster, improving this choice by trying both directions over time.

For any finite Markov decision process, Q-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over any and all successive steps, starting from the current state. Q-learning can identify an optimal action-selection policy for any given finite Markov decision process, given infinite exploration time and a partly random policy.

"Q" refers to the function that the algorithm computes: the expected reward—that is, the quality—of an action taken in a given state.

Reinforcement learning involves an agent, a set of states ${\mathcal {S}}$ , and a set ${\mathcal {A}}$ of actions per state. By performing an action $a\in {\mathcal {A}}$ , the agent transitions from state to state. Executing an action in a specific state provides the agent with a reward (a numerical score).

The goal of the agent is to maximize its total reward. It does this by adding the maximum reward attainable from future states to the reward for achieving its current state, effectively influencing the current action by the potential future reward. This potential reward is a weighted sum of expected values of the rewards of all future steps starting from the current state.

As an example, consider the process of boarding a train, in which the reward is measured by the negative of the total time spent boarding (alternatively, the cost of boarding the train is equal to the boarding time). One strategy is to enter the train door as soon as they open, minimizing the initial wait time for yourself. If the train is crowded, however, then you will have a slow entry after the initial action of entering the door as people are fighting you to depart the train as you attempt to board. The total boarding time, or cost, is then:

On the next day, by random chance (exploration), you decide to wait and let other people depart first. This initially results in a longer wait time. However, less time is spent fighting the departing passengers. Overall, this path has a higher reward than that of the previous day, since the total boarding time is now:

See all

Hub AI

Q-learning AI simulator

(@Q-learning_simulator)

Wikipedia

Grokipedia

Hub AI

Q-learning

"Q" refers to the function that the algorithm computes: the expected reward—that is, the quality—of an action taken in a given state.

See all

Talk Channels

Knowledge Base

Special Pages

Talk Channels

Knowledge Base

Special Pages

Q-learning

Q-learning

Recent from talks

Recent from talks

Knowledge base stats:

Talk channels stats:

Members stats:

Q-learning

Hub AI

Q-learning

Contribute something to knowledge base

History

History

Q-learning

Q-learning

Recent from talks

Recent from talks

Knowledge base stats:

Talk channels stats:

Members stats:

Q-learning

Hub AI

Q-learning

Contribute something to knowledge base