Hubbry Logo
MuZeroMuZeroMain
Open search
MuZero
Community hub
MuZero
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
MuZero
MuZero
from Wikipedia

MuZero is a computer program developed by artificial intelligence research company DeepMind to master games without knowing their rules.[1][2][3] Its release in 2019 included benchmarks of its performance in go, chess, shogi, and a standard suite of Atari games. The algorithm uses an approach similar to AlphaZero. It matched AlphaZero's performance in chess and shogi, improved on its performance in Go, and improved on the state of the art in mastering a suite of 57 Atari games (the Arcade Learning Environment), a visually-complex domain.

MuZero was trained via self-play, with no access to rules, opening books, or endgame tablebases. The trained algorithm used the same convolutional and residual architecture as AlphaZero, but with 20 percent fewer computation steps per node in the search tree.[4]

History

[edit]

MuZero really is discovering for itself how to build a model and understand it just from first principles.

— David Silver, DeepMind, Wired[5]

On November 19, 2019, the DeepMind team released a preprint introducing MuZero.

Derivation from AlphaZero

[edit]

MuZero (MZ) is a combination of the high-performance planning of the AlphaZero (AZ) algorithm with approaches to model-free reinforcement learning. The combination allows for more efficient training in classical planning regimes, such as Go, while also handling domains with much more complex inputs at each stage, such as visual video games.

MuZero was derived directly from AZ code, sharing its rules for setting hyperparameters. Differences between the approaches include:[6]

  • AZ's planning process uses a simulator. The simulator knows the rules of the game. It has to be explicitly programmed. A neural network then predicts the policy and value of a future position. Perfect knowledge of game rules is used in modeling state transitions in the search tree, actions available at each node, and termination of a branch of the tree. MZ does not have access to the rules, and instead learns one with neural networks.
  • AZ has a single model for the game (from board state to predictions); MZ has separate models for representation of the current state (from board state into its internal embedding), dynamics of states (how actions change representations of board states), and prediction of policy and value of a future position (given a state's representation).
  • MZ's hidden model may be complex, and it may turn out it can host computation; exploring the details of the hidden model in a trained instance of MZ is a topic for future exploration.
  • MZ does not expect a two-player game where winners take all. It works with standard reinforcement-learning scenarios, including single-agent environments with continuous intermediate rewards, possibly of arbitrary magnitude and with time discounting. AZ was designed for two-player games that could be won, drawn, or lost.

Comparison with R2D2

[edit]

The previous state of the art technique for learning to play the suite of Atari games was R2D2, the Recurrent Replay Distributed DQN.[7]

MuZero surpassed both R2D2's mean and median performance across the suite of games, though it did not do better in every game.

Training and results

[edit]

MuZero used 16 third-generation tensor processing units (TPUs) for training, and 1000 TPUs for selfplay for board games, with 800 simulations per step and 8 TPUs for training and 32 TPUs for selfplay for Atari games, with 50 simulations per step.

AlphaZero used 64 second-generation TPUs for training, and 5000 first-generation TPUs for selfplay. As TPU design has improved (third-generation chips are 2x as powerful individually as second-generation chips, with further advances in bandwidth and networking across chips in a pod), these are comparable training setups.

R2D2 was trained for 5 days through 2M training steps.

Initial results

[edit]

MuZero matched AlphaZero's performance in chess and Shogi after roughly 1 million training steps. It matched AZ's performance in Go after 500,000 training steps and surpassed it by 1 million steps. It matched R2D2's mean and median performance across the Atari game suite after 500 thousand training steps and surpassed it by 1 million steps, though it never performed well on 6 games in the suite.

[edit]

MuZero was viewed as a significant advancement over AlphaZero, and a generalizable step forward in unsupervised learning techniques.[8][9] The work was seen as advancing understanding of how to compose systems from smaller components, a systems-level development more than a pure machine-learning development.[10]

While only pseudocode was released by the development team, Werner Duvaud produced an open source implementation based on that.[11]

MuZero has been used as a reference implementation in other work, for instance as a way to generate model-based behavior.[12]

In late 2021, a more efficient variant of MuZero was proposed, named EfficientZero. It "achieves 194.3 percent mean human performance and 109.0 percent median performance on the Atari 100k benchmark with only two hours of real-time game experience".[13]

In early 2022, a variant of MuZero was proposed to play stochastic games (for example 2048, backgammon), called Stochastic MuZero, which uses afterstate dynamics and chance codes to account for the stochastic nature of the environment when training the dynamics network.[14]

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
MuZero is a model-based reinforcement learning algorithm developed by researchers at DeepMind that achieves superhuman performance in a variety of challenging games, including the perfect-information board games Go, chess, and shogi, as well as the imperfect-information video games from the Atari 2600 suite, without requiring explicit knowledge of the game rules or environment dynamics. Introduced in a preliminary form in 2019 and detailed in a 2020 Nature paper, MuZero learns an internal representation of the environment through self-play, using a neural network to predict key elements such as rewards, action policies, and state values, which enable effective planning via integration with Monte Carlo tree search (MCTS). Unlike its predecessor AlphaZero, which relies on a known model of the environment's transition function to simulate future states, MuZero's innovation lies in its ability to implicitly learn these dynamics alongside the policy and value functions, making it applicable to domains where rules are unknown, partially observable, or computationally expensive to simulate. The algorithm consists of three core components: a representation network that encodes observations into hidden states, a dynamics network that predicts subsequent hidden states and rewards from action choices, and a prediction network that outputs policy and value estimates from hidden states, all trained end-to-end using temporal difference learning and self-play trajectories. In terms of achievements, MuZero matches or exceeds AlphaZero's superhuman performance in Go (with Elo ratings over 3,000), chess, and shogi after equivalent training compute, while setting new state-of-the-art scores on the Atari benchmark across 57 games, achieving an average human-normalized score of 99.7% when trained on 200 million frames per game and 100.7% on a subset with 20 billion frames. These results demonstrate MuZero's sample efficiency and scalability, as it improves performance dramatically with additional planning steps during inference—for instance, gaining over 1,000 Elo points in Go when search time increases from 0.1 to 50 seconds per move. Beyond gaming, MuZero's principles of learning latent models for planning have been extended to real-world applications, such as optimizing video compression in YouTube's infrastructure, where it outperformed human-engineered heuristics by reducing bandwidth usage while maintaining quality. This adaptability highlights MuZero's potential in broader artificial general intelligence pursuits, emphasizing model-based planning in unknown environments.

Background

Reinforcement Learning Foundations

Reinforcement learning (RL) is a paradigm in machine learning where an intelligent agent learns to select actions in an environment through trial-and-error interactions, with the goal of maximizing the expected cumulative reward over time. Unlike supervised learning, which relies on labeled data, or unsupervised learning, which seeks patterns without explicit feedback, RL emphasizes sequential decision-making under uncertainty, where actions influence future states and rewards. This framework draws from optimal control and behavioral psychology, enabling agents to discover optimal behaviors autonomously. At the core of RL is the Markov Decision Process (MDP), a mathematical model that formalizes the agent's interaction with the environment. An MDP is defined by a tuple S,A,P,R,γ\langle S, A, P, R, \gamma \rangle, where SS is the set of states representing the agent's situation, AA is the set of possible actions, P(ss,a)P(s'|s,a) denotes the transition probabilities to next states ss' given state ss and action aa, R(s,a)R(s,a) is the reward function providing immediate feedback, and γ[0,1)\gamma \in [0,1) is the discount factor prioritizing near-term rewards. Central to solving MDPs are policies π(as)\pi(a|s), which map states to action probabilities, and value functions: the state-value function Vπ(s)V^\pi(s) estimating expected discounted returns from state ss under policy π\pi, and the action-value function Qπ(s,a)Q^\pi(s,a) for state-action pairs. These elements enable the agent to evaluate and improve decision-making strategies. RL algorithms are broadly categorized into model-free and model-based approaches. Model-free methods, exemplified by Q-learning for value estimation and policy gradient techniques for direct policy optimization, learn policies or value functions solely from sampled experiences (state-action-reward-next state tuples) without explicitly modeling the environment's dynamics. This simplicity allows them to operate in unknown environments but often at the cost of requiring extensive data. Model-based RL, conversely, involves learning approximations of the transition function PP and reward function RR, which can then support planning algorithms to simulate trajectories and derive better policies, potentially enhancing efficiency in data-scarce settings. A foundational principle in model-based RL and dynamic programming for MDPs is the Bellman equation, which expresses the optimal value function recursively: V(s)=maxa[R(s,a)+γsP(ss,a)V(s)]V^*(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s') \right] This equation decomposes the value of a state into the immediate reward plus the discounted value of the best subsequent state, satisfying the Bellman optimality principle. Value iteration, a dynamic programming algorithm, solves it by initializing V0(s)=0V^0(s) = 0 and iteratively applying the Bellman update operator until convergence to VV^*, providing the basis for optimal policy derivation via π(s)=argmaxaQ(s,a)\pi^*(s) = \arg\max_a Q^*(s,a). Despite its strengths, RL encounters significant challenges, including sample inefficiency—where agents must generate vast amounts of interaction data to achieve reliable performance, limiting applicability to real-world systems with costly or risky trials—and handling partial observability, where the agent observes only incomplete state information, necessitating extensions like partially observable MDPs (POMDPs) to maintain the Markov property through belief states. These issues underscore the need for hybrid approaches that balance exploration, generalization, and robustness. Advanced RL applications, such as AlphaZero in board games, highlight how overcoming these hurdles can yield superhuman performance in structured domains. AlphaZero, developed by DeepMind in 2017, represents a landmark in reinforcement learning by combining deep neural networks with Monte Carlo Tree Search (MCTS) to achieve superhuman performance in complex board games including chess, shogi, and Go through self-play. The algorithm employs a single neural network that outputs both a policy function—approximating the probability distribution over actions—and a value function—estimating the expected outcome from a given state—trained end-to-end using data generated from games played against versions of itself. During gameplay, MCTS uses the neural network to guide search, simulating thousands of possible futures based on the known game rules to select high-value actions, enabling tabula rasa learning without human knowledge or domain-specific heuristics. This approach demonstrated dramatic efficiency, mastering chess in under 24 hours of training on a single machine cluster, far surpassing traditional engines reliant on handcrafted evaluations. Despite its successes, AlphaZero assumes access to a perfect model of the environment's transition dynamics, as it requires explicit knowledge of the rules to perform MCTS simulations, limiting its applicability to domains with fully specified mechanics. In environments like Atari games, where dynamics are unknown or observations are raw pixels without predefined rules, AlphaZero's reliance on simulation becomes inefficient or infeasible, as constructing an accurate world model from scratch is computationally prohibitive. These constraints highlight the need for algorithms that can learn effective representations of dynamics implicitly while retaining planning capabilities. To address challenges in model-free settings, DeepMind introduced R2D2 in 2019, a distributed reinforcement learning agent designed for partially observable environments like Atari-57, leveraging recurrent neural networks (RNNs) to maintain hidden states across frames and handle temporal dependencies. R2D2 extends distributional Q-learning with prioritized experience replay across multiple actors, enabling off-policy training on diverse trajectories while using LSTMs to process sequential pixel inputs, achieving state-of-the-art scores on 52 of 57 Atari games through scalable distributed training. However, as a purely model-free method, R2D2 lacks explicit planning mechanisms like MCTS, relying instead on value estimation for action selection, which can hinder performance in tasks requiring long-term strategic foresight or sparse rewards. The table below compares key aspects of AlphaZero and R2D2, illustrating their complementary strengths and the motivation for hybrid approaches like MuZero that integrate learned models with planning in unknown environments.
AspectAlphaZeroR2D2
Learning ParadigmModel-based (uses known transition rules for MCTS simulations)Model-free (direct policy/value learning from experience replay)
Core ComponentsNeural policy/value network + MCTS for planningRecurrent distributional Q-network + distributed prioritized replay
DomainsBoard games (e.g., chess, Go) with discrete, rule-based statesAtari games with pixel observations and partial observability
StrengthsSuperior long-term planning via search; superhuman in strategic gamesScalable to high-dimensional inputs; handles temporal abstraction
LimitationsRequires explicit environment model; inefficient for unknown dynamicsNo built-in planning; struggles with sparse rewards and strategy
Path to MuZeroMuZero learns implicit dynamics to enable planning without rulesMuZero adds model learning and MCTS to enhance strategic capabilities

Development

Origins at DeepMind

MuZero was developed at DeepMind, an artificial intelligence research laboratory and subsidiary of Google, by a team led by researchers Julian Schrittwieser and David Silver. The project emerged as a natural extension of DeepMind's prior breakthroughs in reinforcement learning, particularly AlphaZero, which had demonstrated superhuman performance in board games through self-play and Monte Carlo tree search. The algorithm's initial public announcement came on November 19, 2019, via a preprint uploaded to arXiv titled "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model," authored by Schrittwieser and collaborators including Ioannis Antonoglou, Thomas Hubert, and others. This marked the first detailed disclosure of MuZero, with no significant prior public leaks or announcements from DeepMind before late 2019. The core motivations for MuZero's creation stemmed from the limitations of existing planning algorithms in handling environments with unknown or complex dynamics, such as real-world scenarios where rules cannot be hardcoded. DeepMind aimed to create a more general reinforcement learning system capable of learning predictive models directly from observations, mirroring human-like adaptation without explicit domain knowledge. Internal development involved experiments that integrated learned models with tree-based search techniques, building toward the system's ability to master diverse domains. Following the preprint, MuZero was presented at the NeurIPS 2019 conference and elaborated in a full peer-reviewed paper published in Nature on December 23, 2020, solidifying its place in the progression of model-based reinforcement learning.

Key Innovations Over Prior Work

MuZero represents a significant advancement in reinforcement learning by introducing a learned model that predicts future outcomes implicitly, without relying on explicit rules or domain knowledge about the environment's dynamics. Unlike prior algorithms such as AlphaZero, which required predefined transition and reward functions for perfect-information games, MuZero learns a model through three core components: a representation function that encodes observations into latent states, a dynamics function that simulates transitions in this latent space, and a prediction function that estimates rewards and values. This implicit modeling allows the agent to anticipate the consequences of actions solely from interaction data, enabling rule-agnostic performance across diverse domains. A key innovation lies in MuZero's hybrid architecture, which merges model-based planning—exemplified by AlphaZero's Monte Carlo Tree Search (MCTS)—with model-free learning strategies akin to those in R2D2, particularly for handling partial observability in environments like Atari games. By integrating a learned model into the planning process, MuZero performs lookahead simulations in the latent space during decision-making, while the model itself is trained end-to-end using model-free techniques on self-play trajectories. This combination achieves superhuman performance in both board games (matching AlphaZero's results in Go, chess, and shogi) and Atari, without needing human-provided rules, thus broadening applicability to real-world scenarios with imperfect information. MuZero addresses partial observability by deriving latent state representations directly from raw image or video inputs, transforming sequential observations into a compact hidden state that captures the underlying environment dynamics. This approach generalizes seamlessly from fully observable board games to partially observable video games, where history-dependent states are inferred without explicit belief-state maintenance. The conceptual flow proceeds from raw observations to a latent model that simulates future states and rewards, culminating in informed planning that guides action selection—reducing reliance on human expertise and enabling efficient mastery with fewer assumptions about the environment. As outlined in the 2019 DeepMind paper introducing the algorithm, this framework yields efficiency gains, such as outperforming prior model-free methods on 57 Atari games while requiring no game-specific engineering. To illustrate the high-level process:
  1. Observation Encoding: Raw inputs (e.g., board positions or pixel frames) are mapped to a latent state via the representation function.
  2. Latent Simulation: The dynamics function predicts subsequent latent states and immediate rewards based on actions.
  3. Outcome Prediction: The prediction function evaluates values and policies from simulated states, informing tree-based planning.
This streamlined pipeline underscores MuZero's departure from explicit modeling, fostering greater flexibility and scalability in reinforcement learning applications.

Technical Architecture

Learned Model Components

MuZero's learned model consists of three primary neural network functions—representation, dynamics, and prediction—that collectively approximate the environment's dynamics without explicit rules, enabling planning in latent space. These components form a single parameterized network θ\theta, allowing the agent to predict future states, rewards, policies, and values based on observations and actions. By learning these representations end-to-end, MuZero achieves superhuman performance across diverse domains like board games and Atari, surpassing model-free methods while avoiding the need for a full environment simulator. The representation function hθh_\theta encodes raw observations into an initial hidden state s0s_0, capturing the current environment configuration in a compact latent form. For Atari games, it processes a stack of the last 4 grayscale frames (resized to 84x84 pixels) using a convolutional residual network with 16 blocks and 256 filters per plane, with downsampling in the residual blocks before the prediction head. In board games such as Go, it stacks the last 8 planes (19x19 for Go), while chess uses 100 planes to represent the longer history. This function initializes the latent trajectory for subsequent predictions. The dynamics function gθg_\theta models state transitions by predicting the next hidden state sks_k and immediate reward rkr_k given the current state sk1s_{k-1} and action aka_k: rk,sk=gθ(sk1,ak)r_k, s_k = g_\theta(s_{k-1}, a_k). It employs the same residual architecture as the representation function (16 blocks, 256 planes), with the action encoded as a one-hot vector concatenated to the input state. This deterministic update allows MuZero to simulate multi-step trajectories in the latent space, learning implicit dynamics without reconstructing observable states. The prediction function fθf_\theta estimates the policy π\pi (action probabilities) and value vv (expected future rewards) from any latent state sks_k: pk,vk=fθ(sk)p_k, v_k = f_\theta(s_k). It uses a lighter convolutional head similar to AlphaZero, followed by fully connected layers to output the policy as a softmax over actions and the value as a scalar (or distribution for Atari). For Atari, rewards and values undergo a scaling transformation h(x)=sign(x)(x+ϵx)h(x) = \operatorname{sign}(x) \left( \sqrt{|x|} + \epsilon x \right)
Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.