Recent from talks
Contribute something
Nothing was collected or created yet.
MuZero
View on WikipediaThis article needs to be updated. (May 2022) |
| This article is part of the series on |
| Chess programming |
|---|
MuZero is a computer program developed by artificial intelligence research company DeepMind to master games without knowing their rules.[1][2][3] Its release in 2019 included benchmarks of its performance in go, chess, shogi, and a standard suite of Atari games. The algorithm uses an approach similar to AlphaZero. It matched AlphaZero's performance in chess and shogi, improved on its performance in Go, and improved on the state of the art in mastering a suite of 57 Atari games (the Arcade Learning Environment), a visually-complex domain.
MuZero was trained via self-play, with no access to rules, opening books, or endgame tablebases. The trained algorithm used the same convolutional and residual architecture as AlphaZero, but with 20 percent fewer computation steps per node in the search tree.[4]
History
[edit]MuZero really is discovering for itself how to build a model and understand it just from first principles.
— David Silver, DeepMind, Wired[5]
On November 19, 2019, the DeepMind team released a preprint introducing MuZero.
Derivation from AlphaZero
[edit]MuZero (MZ) is a combination of the high-performance planning of the AlphaZero (AZ) algorithm with approaches to model-free reinforcement learning. The combination allows for more efficient training in classical planning regimes, such as Go, while also handling domains with much more complex inputs at each stage, such as visual video games.
MuZero was derived directly from AZ code, sharing its rules for setting hyperparameters. Differences between the approaches include:[6]
- AZ's planning process uses a simulator. The simulator knows the rules of the game. It has to be explicitly programmed. A neural network then predicts the policy and value of a future position. Perfect knowledge of game rules is used in modeling state transitions in the search tree, actions available at each node, and termination of a branch of the tree. MZ does not have access to the rules, and instead learns one with neural networks.
- AZ has a single model for the game (from board state to predictions); MZ has separate models for representation of the current state (from board state into its internal embedding), dynamics of states (how actions change representations of board states), and prediction of policy and value of a future position (given a state's representation).
- MZ's hidden model may be complex, and it may turn out it can host computation; exploring the details of the hidden model in a trained instance of MZ is a topic for future exploration.
- MZ does not expect a two-player game where winners take all. It works with standard reinforcement-learning scenarios, including single-agent environments with continuous intermediate rewards, possibly of arbitrary magnitude and with time discounting. AZ was designed for two-player games that could be won, drawn, or lost.
Comparison with R2D2
[edit]The previous state of the art technique for learning to play the suite of Atari games was R2D2, the Recurrent Replay Distributed DQN.[7]
MuZero surpassed both R2D2's mean and median performance across the suite of games, though it did not do better in every game.
Training and results
[edit]MuZero used 16 third-generation tensor processing units (TPUs) for training, and 1000 TPUs for selfplay for board games, with 800 simulations per step and 8 TPUs for training and 32 TPUs for selfplay for Atari games, with 50 simulations per step.
AlphaZero used 64 second-generation TPUs for training, and 5000 first-generation TPUs for selfplay. As TPU design has improved (third-generation chips are 2x as powerful individually as second-generation chips, with further advances in bandwidth and networking across chips in a pod), these are comparable training setups.
R2D2 was trained for 5 days through 2M training steps.
Initial results
[edit]MuZero matched AlphaZero's performance in chess and Shogi after roughly 1 million training steps. It matched AZ's performance in Go after 500,000 training steps and surpassed it by 1 million steps. It matched R2D2's mean and median performance across the Atari game suite after 500 thousand training steps and surpassed it by 1 million steps, though it never performed well on 6 games in the suite.
Reactions and related work
[edit]MuZero was viewed as a significant advancement over AlphaZero, and a generalizable step forward in unsupervised learning techniques.[8][9] The work was seen as advancing understanding of how to compose systems from smaller components, a systems-level development more than a pure machine-learning development.[10]
While only pseudocode was released by the development team, Werner Duvaud produced an open source implementation based on that.[11]
MuZero has been used as a reference implementation in other work, for instance as a way to generate model-based behavior.[12]
In late 2021, a more efficient variant of MuZero was proposed, named EfficientZero. It "achieves 194.3 percent mean human performance and 109.0 percent median performance on the Atari 100k benchmark with only two hours of real-time game experience".[13]
In early 2022, a variant of MuZero was proposed to play stochastic games (for example 2048, backgammon), called Stochastic MuZero, which uses afterstate dynamics and chance codes to account for the stochastic nature of the environment when training the dynamics network.[14]
See also
[edit]References
[edit]- ^ Wiggers, Kyle (20 November 2019). "DeepMind's MuZero teaches itself how to win at Atari, chess, shogi, and Go". VentureBeat. Retrieved 22 July 2020.
- ^ Friedel, Frederic. "MuZero figures out chess, rules and all". ChessBase GmbH. Retrieved 22 July 2020.
- ^ Rodriguez, Jesus. "DeepMind Unveils MuZero, a New Agent that Mastered Chess, Shogi, Atari and Go Without Knowing the Rules". KDnuggets. Retrieved 22 July 2020.
- ^ Schrittwieser, Julian; Antonoglou, Ioannis; Hubert, Thomas; Simonyan, Karen; Sifre, Laurent; Schmitt, Simon; Guez, Arthur; Lockhart, Edward; Hassabis, Demis; Graepel, Thore; Lillicrap, Timothy (2020). "Mastering Atari, Go, chess and shogi by planning with a learned model". Nature. 588 (7839): 604–609. arXiv:1911.08265. Bibcode:2020Natur.588..604S. doi:10.1038/s41586-020-03051-4. PMID 33361790. S2CID 208158225.
- ^ "What AlphaGo Can Teach Us About How People Learn". Wired. ISSN 1059-1028. Retrieved 2020-12-25.
- ^ Silver, David; Hubert, Thomas; Schrittwieser, Julian; Antonoglou, Ioannis; Lai, Matthew; Guez, Arthur; Lanctot, Marc; Sifre, Laurent; Kumaran, Dharshan; Graepel, Thore; Lillicrap, Timothy; Simonyan, Karen; Hassabis, Demis (5 December 2017). "Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm". arXiv:1712.01815 [cs.AI].
- ^ Kapturowski, Steven; Ostrovski, Georg; Quan, John; Munos, Remi; Dabney, Will. RECURRENT EXPERIENCE REPLAY IN DISTRIBUTED REINFORCEMENT LEARNING. ICLR 2019 – via Open Review.
- ^ Shah, Rohin (27 November 2019). "[AN #75]: Solving Atari and Go with learned game models, and thoughts from a MIRI employee - LessWrong 2.0". www.lesswrong.com. Retrieved 2020-06-07.
- ^ Wu, Jun. "Reinforcement Learning, Deep Learning's Partner". Forbes. Retrieved 2020-07-15.
- ^ "Machine Learning & Robotics: My (biased) 2019 State of the Field". cachestocaches.com. Retrieved 2020-07-15.
- ^ Duvaud, Werner (2020-07-15), werner-duvaud/muzero-general, retrieved 2020-07-15
- ^ van Seijen, Harm; Nekoei, Hadi; Racah, Evan; Chandar, Sarath (2020-07-06). "The LoCA Regret: A Consistent Metric to Evaluate Model-Based Behavior in Reinforcement Learning". arXiv:2007.03158 [cs.stat].
- ^ Ye, Weirui; Liu, Shaohuai; Kurutach, Thanard; Abbeel, Pieter; Gao, Yang (2021-12-11). "Mastering Atari Games with Limited Data". arXiv:2111.00210 [cs.LG].
- ^ Antonoglou, Ioannis; Schrittwieser, Julian; Ozair, Serjil; Hubert, Thomas; Silver, David (2022-01-28). "Planning in Stochastic Environments with a Learned Model". Retrieved 2023-12-12.
External links
[edit]MuZero
View on GrokipediaBackground
Reinforcement Learning Foundations
Reinforcement learning (RL) is a paradigm in machine learning where an intelligent agent learns to select actions in an environment through trial-and-error interactions, with the goal of maximizing the expected cumulative reward over time. Unlike supervised learning, which relies on labeled data, or unsupervised learning, which seeks patterns without explicit feedback, RL emphasizes sequential decision-making under uncertainty, where actions influence future states and rewards. This framework draws from optimal control and behavioral psychology, enabling agents to discover optimal behaviors autonomously. At the core of RL is the Markov Decision Process (MDP), a mathematical model that formalizes the agent's interaction with the environment. An MDP is defined by a tuple , where is the set of states representing the agent's situation, is the set of possible actions, denotes the transition probabilities to next states given state and action , is the reward function providing immediate feedback, and is the discount factor prioritizing near-term rewards. Central to solving MDPs are policies , which map states to action probabilities, and value functions: the state-value function estimating expected discounted returns from state under policy , and the action-value function for state-action pairs. These elements enable the agent to evaluate and improve decision-making strategies. RL algorithms are broadly categorized into model-free and model-based approaches. Model-free methods, exemplified by Q-learning for value estimation and policy gradient techniques for direct policy optimization, learn policies or value functions solely from sampled experiences (state-action-reward-next state tuples) without explicitly modeling the environment's dynamics. This simplicity allows them to operate in unknown environments but often at the cost of requiring extensive data. Model-based RL, conversely, involves learning approximations of the transition function and reward function , which can then support planning algorithms to simulate trajectories and derive better policies, potentially enhancing efficiency in data-scarce settings.[5] A foundational principle in model-based RL and dynamic programming for MDPs is the Bellman equation, which expresses the optimal value function recursively: This equation decomposes the value of a state into the immediate reward plus the discounted value of the best subsequent state, satisfying the Bellman optimality principle. Value iteration, a dynamic programming algorithm, solves it by initializing and iteratively applying the Bellman update operator until convergence to , providing the basis for optimal policy derivation via . Despite its strengths, RL encounters significant challenges, including sample inefficiency—where agents must generate vast amounts of interaction data to achieve reliable performance, limiting applicability to real-world systems with costly or risky trials—and handling partial observability, where the agent observes only incomplete state information, necessitating extensions like partially observable MDPs (POMDPs) to maintain the Markov property through belief states. These issues underscore the need for hybrid approaches that balance exploration, generalization, and robustness. Advanced RL applications, such as AlphaZero in board games, highlight how overcoming these hurdles can yield superhuman performance in structured domains.[6][7]Predecessors: AlphaZero and Related Algorithms
AlphaZero, developed by DeepMind in 2017, represents a landmark in reinforcement learning by combining deep neural networks with Monte Carlo Tree Search (MCTS) to achieve superhuman performance in complex board games including chess, shogi, and Go through self-play.[7] The algorithm employs a single neural network that outputs both a policy function—approximating the probability distribution over actions—and a value function—estimating the expected outcome from a given state—trained end-to-end using data generated from games played against versions of itself.[7] During gameplay, MCTS uses the neural network to guide search, simulating thousands of possible futures based on the known game rules to select high-value actions, enabling tabula rasa learning without human knowledge or domain-specific heuristics.[7] This approach demonstrated dramatic efficiency, mastering chess in under 24 hours of training on a single machine cluster, far surpassing traditional engines reliant on handcrafted evaluations.[7] Despite its successes, AlphaZero assumes access to a perfect model of the environment's transition dynamics, as it requires explicit knowledge of the rules to perform MCTS simulations, limiting its applicability to domains with fully specified mechanics.[7] In environments like Atari games, where dynamics are unknown or observations are raw pixels without predefined rules, AlphaZero's reliance on simulation becomes inefficient or infeasible, as constructing an accurate world model from scratch is computationally prohibitive.[7] These constraints highlight the need for algorithms that can learn effective representations of dynamics implicitly while retaining planning capabilities. To address challenges in model-free settings, DeepMind introduced R2D2 in 2019, a distributed reinforcement learning agent designed for partially observable environments like Atari-57, leveraging recurrent neural networks (RNNs) to maintain hidden states across frames and handle temporal dependencies.[8] R2D2 extends distributional Q-learning with prioritized experience replay across multiple actors, enabling off-policy training on diverse trajectories while using LSTMs to process sequential pixel inputs, achieving state-of-the-art scores on 52 of 57 Atari games through scalable distributed training.[8] However, as a purely model-free method, R2D2 lacks explicit planning mechanisms like MCTS, relying instead on value estimation for action selection, which can hinder performance in tasks requiring long-term strategic foresight or sparse rewards.[8] The table below compares key aspects of AlphaZero and R2D2, illustrating their complementary strengths and the motivation for hybrid approaches like MuZero that integrate learned models with planning in unknown environments.| Aspect | AlphaZero | R2D2 |
|---|---|---|
| Learning Paradigm | Model-based (uses known transition rules for MCTS simulations) | Model-free (direct policy/value learning from experience replay) |
| Core Components | Neural policy/value network + MCTS for planning | Recurrent distributional Q-network + distributed prioritized replay |
| Domains | Board games (e.g., chess, Go) with discrete, rule-based states | Atari games with pixel observations and partial observability |
| Strengths | Superior long-term planning via search; superhuman in strategic games | Scalable to high-dimensional inputs; handles temporal abstraction |
| Limitations | Requires explicit environment model; inefficient for unknown dynamics | No built-in planning; struggles with sparse rewards and strategy |
| Path to MuZero | MuZero learns implicit dynamics to enable planning without rules | MuZero adds model learning and MCTS to enhance strategic capabilities |
Development
Origins at DeepMind
MuZero was developed at DeepMind, an artificial intelligence research laboratory and subsidiary of Google, by a team led by researchers Julian Schrittwieser and David Silver.[1] The project emerged as a natural extension of DeepMind's prior breakthroughs in reinforcement learning, particularly AlphaZero, which had demonstrated superhuman performance in board games through self-play and Monte Carlo tree search.[1] The algorithm's initial public announcement came on November 19, 2019, via a preprint uploaded to arXiv titled "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model," authored by Schrittwieser and collaborators including Ioannis Antonoglou, Thomas Hubert, and others.[2] This marked the first detailed disclosure of MuZero, with no significant prior public leaks or announcements from DeepMind before late 2019.[3] The core motivations for MuZero's creation stemmed from the limitations of existing planning algorithms in handling environments with unknown or complex dynamics, such as real-world scenarios where rules cannot be hardcoded.[1] DeepMind aimed to create a more general reinforcement learning system capable of learning predictive models directly from observations, mirroring human-like adaptation without explicit domain knowledge.[1] Internal development involved experiments that integrated learned models with tree-based search techniques, building toward the system's ability to master diverse domains.[9] Following the preprint, MuZero was presented at the NeurIPS 2019 conference and elaborated in a full peer-reviewed paper published in Nature on December 23, 2020, solidifying its place in the progression of model-based reinforcement learning.[1][3]Key Innovations Over Prior Work
MuZero represents a significant advancement in reinforcement learning by introducing a learned model that predicts future outcomes implicitly, without relying on explicit rules or domain knowledge about the environment's dynamics. Unlike prior algorithms such as AlphaZero, which required predefined transition and reward functions for perfect-information games, MuZero learns a model through three core components: a representation function that encodes observations into latent states, a dynamics function that simulates transitions in this latent space, and a prediction function that estimates rewards and values. This implicit modeling allows the agent to anticipate the consequences of actions solely from interaction data, enabling rule-agnostic performance across diverse domains.[2] A key innovation lies in MuZero's hybrid architecture, which merges model-based planning—exemplified by AlphaZero's Monte Carlo Tree Search (MCTS)—with model-free learning strategies akin to those in R2D2, particularly for handling partial observability in environments like Atari games. By integrating a learned model into the planning process, MuZero performs lookahead simulations in the latent space during decision-making, while the model itself is trained end-to-end using model-free techniques on self-play trajectories. This combination achieves superhuman performance in both board games (matching AlphaZero's results in Go, chess, and shogi) and Atari, without needing human-provided rules, thus broadening applicability to real-world scenarios with imperfect information.[2] MuZero addresses partial observability by deriving latent state representations directly from raw image or video inputs, transforming sequential observations into a compact hidden state that captures the underlying environment dynamics. This approach generalizes seamlessly from fully observable board games to partially observable video games, where history-dependent states are inferred without explicit belief-state maintenance. The conceptual flow proceeds from raw observations to a latent model that simulates future states and rewards, culminating in informed planning that guides action selection—reducing reliance on human expertise and enabling efficient mastery with fewer assumptions about the environment. As outlined in the 2019 DeepMind paper introducing the algorithm, this framework yields efficiency gains, such as outperforming prior model-free methods on 57 Atari games while requiring no game-specific engineering.[2] To illustrate the high-level process:- Observation Encoding: Raw inputs (e.g., board positions or pixel frames) are mapped to a latent state via the representation function.
- Latent Simulation: The dynamics function predicts subsequent latent states and immediate rewards based on actions.
- Outcome Prediction: The prediction function evaluates values and policies from simulated states, informing tree-based planning.