Hubbry Logo
Vision-language-action modelVision-language-action modelMain
Open search
Vision-language-action model
Community hub
Vision-language-action model
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Vision-language-action model
Vision-language-action model
from Wikipedia

The general architecture of a vision-language-action model. The model receives as input a text instruction and an image observation that are encoded in a latent representation. The action decoder receives this representation and generates a sequence of low-level robot actions.

In robot learning, a vision-language-action model (VLA) is a class of multimodal foundation models that integrates vision, language and actions. Given an input image (or video) of the robot's surroundings and a text instruction, a VLA directly outputs low-level robot actions that can be executed to accomplish the requested task.[1]

VLAs are generally constructed by fine-tuning a vision-language model (VLM), i.e. a large language model extended with vision capabilities) on a large-scale dataset that pairs visual observation and language instructions with robot trajectories.[2] These models combine a vision-language encoder (vision transformer), which translates an image observation and a natural language description into a distribution within a latent space, with an action decoder that transforms this representation into continuous output actions, directly executable on the robot.[3]

The concept was pioneered in July 2023 by Google DeepMind with RT-2, a VLM adapted for end-to-end manipulation tasks, capable of unifying perception, reasoning and control.[4]

Overview of architecture

[edit]

VLAs share a common high-level architecture articulated in two stages:

  • In the first stage, a pre-trained VLM serves as the perception and reasoning core. It encodes one or more camera images together with a language instruction into a sequence of language tokens in a shared latent space. VLMs are specifically trained on large multimodal datasets and can perform a variety of tasks such as image understanding, visual-question answering and reasoning. In order to directly control robots, VLMs must be extended to output robot actions.[5]
  • In the second stage, an action decoder maps those tokens to discrete symbols that are then de-tokenised into continuous robot commands. These output actions are represented in the same way as language tokens, but specifically refer to the number of degrees of freedom (DoF) of the robot's end effector. Considering a 6-DoF end-effector, the action space usually includes end-effector displacements (positional and rotational) and gripper positions. For instance, in RT-2, each action vector covers 6-DoF in addition to the gripper state and a termination flag, all quantized into 256 bins.[2]

VLAs usually rely on off-the-shelf VLMs, giving the robot a prior understanding of images and text. During the training process, the model is then fine-tuned on data in the form of (text instruction, visual observation, action trajectory), and so it learns to map visual observations and text instructions to robot actions. The training dataset consists of robot demonstrations which may be gathered from real robots, human teleoperation, or even synthetically generated in a simulation environment. Due to end-to-end learning, VLAs inherently learn to associate high-level concepts (e.g. object categories and spatial relations) with low-level actions, eliminating the partitioning typical of traditional robotic systems.[2][6]

Action representation

[edit]

A crucial design choice for the architecture of a VLA is the format in which robot actions are encoded.

'Discrete Token Output' is the most common approach, used by VLAs such as RT-2 and OpenVLA, and it represents each motion primitive as a sequence of discrete tokens. In this way, the model encodes the robot actions as an action string, and the VLA model learns to generate these sequences just as a language model generates text. This token-based approach keeps the same output layer and makes training straightforward. However, converting continuous trajectories into vocabulary symbols can limit spatial accuracy or temporal resolution. RT-2 demonstrates that this can be mitigated using special tokens that, for instance, mark the end of an action segment.[2][7]

'Continuous Output' (Diffusion/Flow) is an alternative approach used by VLAs such as π0 that, in order to achieve accurate dexterity and high frequency control, forego discrete tokens and directly output continuous actions. This is achieved through the use of diffusion models or flow-matching networks that act as the action decoder. π0 exploited this strategy to output continuous joint trajectories up to 50Hz. Practically, continuous output tends to scale better to robots with many degrees of freedom, where discretization for every DoF would be impractical.[8]

Single-model versus dual-system design

[edit]
Comparison between single and dual-system architecture in a vision-language-action model. Single system VLA (top) is an end-to-end architecture that couples a pre-trained VLM with an action decoder. This model handles text, images, and robot state and output actions. Dual-system VLA (bottom) is a modular architecture in which the pre-trained VLM and the action decoder are two separate subsystems. They communicate through a shared latent space. Note that each system can run independently, even on different GPUs.

VLAs can be organized either as a single end-to-end network or as a dual-system that employs two coupled models.

The single-model design, employed by RT-2, OpenVLA and π0, simultaneously understands the scene and the language instruction to produce robot actions in a single forward pass, keeping the architecture simple and reducing latency.[2][7][8]

The dual-system design, adopted by Helix and Groot N1, decouples the architecture into two components. The first component is usually slower and handles image observation and text instructions received as input. The second component runs at a faster rate and produces the robot's actions. The two components are trained end-to-end to communicate. This split improves dexterity and latency at the cost of increased computational complexity.[9][10]

History

[edit]

2023

[edit]

Robotic Transformer 2 (RT-2)

[edit]

Robotic Transformer 2 (RT-2) was developed by Google DeepMind in mid-2023 and established the vision-language-action model paradigm in robotics. It builds on two state-of-the-art VLMs, respectively PaLI-X[11] and PaLM-E,[12] by fine-tuning them on real robot demonstration data. RT-2 takes as input camera images paired with a text description and outputs discretized robot action encoded as discrete tokens. Compared to its predecessor RT-1,[13] which was trained only on robotic data, RT-2 exhibits stronger generalization for new tasks, being also able to perform multi-step reasoning using chain-of-thought.[4]

2024

[edit]

OpenVLA

[edit]
OpenVLA model architecture. Starting from an image observation and a natural language description of a task, the system generates 7D robot actions.[7]

OpenVLA is a 7b-parameter open-source VLA model introduced in June 2024 by researchers at Stanford. It was trained on the Open X-Embodiment dataset, a collaboration between 21 institutions that collected over one million episodes on 22 different embodiments. The model fuses image features using DINOv2[14] and CLIP, with a Llama-2 language backbone, and outputs discrete actions tokens. Despite its smaller size with respect to Google DeepMind's RT-2, OpenVLA outperforms RT-2 on a suite of manipulation tasks. It also supports parameter-efficient fine-tuning methods and quantization for resource-constrained deployment.[7][15][16]

Octo (Open Generalist Policy)

[edit]

Octo is a lightweight open-source generalist robot policy from UC Berkeley. Originally trained on Open X-Embodiment, it was released in smaller configurations (27M and 93M parameters). Octo encodes text instructions and image observations respectively with a language model and a lightweight convolutional neural network. Additionally, instead of an autoregressive decoder, Octo uses a diffusion policy that outputs continuous joint trajectories, enabling smoother motion and fast task adaptation. During fine-tuning, the block-wise attention structure of the architecture employed by Octo, allows to add new observations without modifying the parameters.[17]

TinyVLA

[edit]

TinyVLA is a compact VLA designed for fast inference and efficient training. TinyVLA addresses the computational requirements and the heavy reliance on large datasets of its predecessors by initializing the policy with a smaller multimodal backbone and then fine-tuning on robotics data. This work demonstrated potential for more efficient VLAs, focusing on architecture and data curation without the computational cost of very large models.[18]

π0 (pi-zero)

[edit]

π0 (pi-zero) is a large-scale generalist VLA, announced in late 2024 by the startup Physical Intelligence.[8][better source needed] π0 incorporates Paligemma[19] as a pre-trained VLM backbone, built from SigLIP[20] and Gemma[21] encoders, with an action expert trained on robot trajectories from Open X-Embodiment. Trained on robot trajectories from 8 different embodiments, it is able to generalize cross-embodiment, control different robotic arms (single-arm, dual-arm) and tackle a wide variety of tasks. π0 also introduced flow-matching model to generate high-frequency continuous actions, up to 50 Hz, while the action head takes advantage of a diffusion policy.[22][23] π0-FAST, an extension of π0, takes advantage of Frequency-space Action Sequence Tokenization (FAST),[24] a novel time-series compression approach that transform continuous tokens from time domain to frequency domain using discrete cosine transform.

2025

[edit]

Helix

[edit]

Helix, unveiled in February 2025 by Figure AI, it is a generalist VLA specifically tailored for humanoid robots. It is the first VLA able to control at a high frequency the entire upper body of a humanoid (i.e. arms, hands, torso, head, fingers). It uses a dual-system architecture, with two complementary systems trained to communicate in an end-to-end manner. System 2 (S2) is an internet-scale VLM specialized in scene understanding and language comprehension, while System 1 (S1) is a visuomotor policy that translates the latent representations produced by S2 into continuous robot actions. This decoupled architecture allows to achieve both broad generalization and fast low-level control. Helix is trained on ~500 hours of robot teleoperation paired with automatically generated text descriptions. The Helix model underscored the ability of VLAs to scale to complex embodiments such as humanoids.[9]

GR00T N1

[edit]

GR00T N1, released by NVIDIA in March 2025, is a VLA for humanoid robots that adopts the same dual-system architecture employed by Helix. It is composed of a System 2, a VLM responsible for the perception of the environment, and a System 1, which generates motor action. Different from other VLAs, it includes a heterogeneous mixture of data comprising robots' trajectories, human videos and synthetic datasets.[10]

Gemini Robotics

[edit]

Gemini Robotics, introduced in 2025 by Google DeepMind, is a VLA that builds on top of the capabilities of Gemini 2.0. While Gemini is inherently able to process multimodal data such as text, images, videos and audio, Gemini Robotics extends these capabilities to the physical world, allowing robots to take actions. The reasoning capabilities of the Gemini 2.0 VLM backbone, paired with learned low-level robot actions, allow the robot to perform highly dexterous tasks such as folding origami, as well as playing with cards. The model exhibits a high degree of generalization and is able to adapt to entirely new platforms. In June 2025, the authors released Gemini Robotics On-Device, a lightweight version of the previous model, optimized to run locally on a real robot with low-latency and high reliability while preserving dexterity.[6][25]

SmolVLA

[edit]

SmolVLA is an open-source compact VLA with 450 million parameters released by Hugging Face. It represents an effort to democratize research on VLAs. It was trained entirely on LeRobot, an open-source dataset collected and curated by the community. Despite its compact size, SmolVLA achieved comparable performances with much larger VLAs such as Octo, OpenVLA and π0. The architecture of SmolVLA employs flow-matching for continuous control, and asynchronous inference to decouple the VLM backbone from the action execution. SmolVLA can be fine-tuned and used on a single consumer GPU.[26][27][28]

See also

[edit]

References

[edit]

Further reading

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Vision-language-action (VLA) models are a class of multimodal foundation models that unify visual perception, natural language understanding, and action generation within a single architecture, enabling end-to-end mapping from images or videos paired with language instructions to robotic actions in real-world environments. These models primarily target embodied AI applications in , where they support generalist policies capable of handling diverse tasks, objects, embodiments, and environments with minimal task-specific adaptation. The concept gained prominence in 2023 with the introduction of RT-2, a vision-language-action model developed by Google DeepMind that co-fine-tunes pre-trained vision-language models on web-scale data and robotics demonstrations to translate visual scenes and natural language commands directly into robotic actions. RT-2 demonstrated emergent capabilities such as semantic reasoning, symbol understanding, and generalization to novel objects and instructions, achieving high success rates on both seen and out-of-distribution tasks compared to prior robotic policies. Subsequent advancements have expanded the field significantly. In 2024, OpenVLA emerged as a prominent open-source 7-billion-parameter VLA model trained on 970,000 real-world robot demonstrations from the Open X-Embodiment dataset, outperforming larger closed-source models like RT-2-X in multi-task success rates across diverse robot embodiments while enabling efficient fine-tuning on consumer hardware. By integrating strong visual encoders (such as DINOv2 and SigLIP) with large language model backbones (such as Llama 2), OpenVLA and similar models have facilitated robust visuomotor control and broad generalization in manipulation tasks. The rapid proliferation of VLA research has produced numerous models by 2025, ranging from end-to-end architectures to hierarchical and specialized variants that incorporate additional modalities like tactile sensing or egocentric views. These developments build on large-scale datasets such as Open X-Embodiment and benchmarks including CALVIN and RoboCasa, addressing challenges in data efficiency, cross-embodiment transfer, and real-world deployment. Ongoing work focuses on enhancing generalization, improving action tokenization, and enabling scalable training to support progress toward general-purpose robotic systems capable of flexible, language-directed behavior in unstructured settings.

Overview

Definition

Vision-language-action (VLA) models are a class of multimodal foundation models that integrate visual perception, natural language understanding, and action generation within a single unified architecture. These models are designed primarily for embodied AI applications, such as and autonomous agents, enabling systems to process visual inputs (such as images or videos) alongside natural language instructions and produce actionable outputs for real-world control. VLA models extend vision-language models by incorporating direct action prediction, allowing end-to-end mapping from multimodal observations to physical behaviors like motor commands for robotic manipulation or navigation. This unification supports generalizable policies that leverage large-scale multimodal data to handle diverse tasks, objects, embodiments, and environments in embodied intelligence.

Key characteristics

Vision-language-action (VLA) models are distinguished by their ability to process multimodal inputs consisting of visual observations (such as images or video streams) and natural language instructions to directly produce action outputs suitable for robotic control. This integration allows a single model to perceive the environment, interpret commands, and execute physical behaviors without intermediate hand-engineered representations. A core characteristic is their end-to-end learning paradigm for visuomotor control, in which perception, language understanding, and action generation are jointly optimized within a unified architecture rather than relying on separate modules for perception and policy. This approach enables the model to learn direct mappings from raw multimodal observations to actions, reducing brittleness associated with modular pipelines and facilitating seamless transfer of knowledge across modalities. VLA models exhibit strong generalization across diverse tasks, object categories, embodiments, and environments, largely attributable to large-scale pretraining on web-scale vision-language data combined with robotic trajectory data. This pretraining enables the models to adapt to novel scenarios, including unseen objects or instructions absent from the robotic training set, by leveraging broad semantic knowledge acquired from Internet-scale sources. These models are inherently language-conditioned, meaning their behavior is guided by natural language inputs that enable precise instruction following and emergent reasoning capabilities. Language serves as a flexible interface for specifying tasks, allowing the model to interpret complex commands, perform semantic disambiguation, and in some cases engage in rudimentary multi-step reasoning to select appropriate actions in context.

Relation to vision-language models

Vision-language-action (VLA) models extend vision-language models (VLMs) by incorporating the capability to generate actions for embodied agents, building directly on the perceptual and reasoning foundations of VLMs to enable end-to-end control in physical environments. Many VLA models use pre-trained VLMs as their core backbone, adapting architectures originally designed for tasks such as visual question answering, image captioning, and object recognition. For example, the seminal RT-2 model adapts Pathways Language and Image model (PaLI-X) and Pathways Language model Embodied (PaLM-E) through co-fine-tuning on both web-scale vision-language data and robotic trajectory data, preserving the generalization abilities acquired from internet-scale pretraining while adding robotic control. The key extension from VLMs to VLA involves shifting the output modality from natural language text to robot actions. In VLMs, the model generates sequences of language tokens based on visual and textual inputs; in VLA models, actions are represented as discrete tokens in the same vocabulary space, allowing the model to predict action sequences directly from multimodal inputs rather than intermediate text descriptions. This mapping enables direct translation of visual observations and language instructions into executable robotic commands. This architectural shift changes the primary objective from perception and reasoning in VLMs to embodied decision-making and control in VLA models. While VLMs focus on understanding and describing visual scenes through language, VLA models aim to produce physical actions that achieve task goals in real-world settings, leveraging the semantic understanding of their VLM foundations to improve generalization to novel objects, instructions, and environments. Subsequent open-source efforts, such as OpenVLA, continue this pattern by building on VLM backbones and fine-tuning on large-scale robot demonstration datasets to produce generalist visuomotor policies capable of handling diverse manipulation tasks across embodiments.

History

Emergence in 2023

The vision-language-action (VLA) paradigm emerged in mid-2023 with the introduction of Robotic Transformer 2 (RT-2) by Google DeepMind. This marked the first explicit formulation of a unified model that directly combines visual perception, natural language understanding, and action generation to enable end-to-end robotic control from multimodal inputs. RT-2 pioneered the approach of co-fine-tuning large pre-trained vision-language models on both web-scale vision-language data and robotic trajectory data, allowing the transfer of semantic and generalization capabilities from Internet knowledge to real-world robotic tasks. By representing robot actions as text tokens integrated into the model's output vocabulary, RT-2 established direct mapping from robot camera images and language instructions to action sequences, enabling generalization beyond training data and emergent reasoning skills such as multi-stage task decomposition. This work built upon prior Google DeepMind efforts, including Robotic Transformer 1 (RT-1) for multi-task robotic demonstrations and vision-language models such as PaLM-E and PaLI-X for large-scale multimodal pre-training. The introduction of RT-2 on July 28, 2023, defined the VLA paradigm as an end-to-end architecture for embodied AI, shifting focus toward models that jointly learn perception, language, and action in a single system for robotics applications.

Open-source and scaling in 2024

In 2024, vision-language-action (VLA) models experienced rapid progress in open-source development and scaling, driven by accessible large-scale datasets and collaborative research efforts. A pivotal release was OpenVLA in June 2024, a 7B-parameter open-source model pretrained on 970k real-world robot episodes from the Open X-Embodiment dataset. Developed by researchers from Stanford, UC Berkeley, Google DeepMind, Toyota Research Institute, and others, OpenVLA established strong baseline performance for generalist robotic manipulation across diverse embodiments, outperforming several prior open and closed models in multi-task settings while enabling efficient fine-tuning on consumer hardware. Subsequent open-source contributions in late 2024 included TinyVLA, a family of compact models emphasizing fast inference and data efficiency for robotic manipulation. Octo, an earlier 2024 open-source generalist policy trained on 800k episodes from the same dataset, also contributed to the ecosystem by demonstrating scalable transformer-based approaches for language-instructed manipulation. This wave of open-source VLAs was facilitated by the Open X-Embodiment dataset, the largest public real-robot collection with over 1 million trajectories spanning 22 embodiments, which enabled improved cross-embodiment generalization and reduced barriers to training high-capacity models.

Humanoid and efficient models in 2025

In 2025, vision-language-action (VLA) models advanced significantly toward humanoid robotics and efficient deployment, with key releases emphasizing generalist capabilities for full-body control and compact architectures suitable for on-device inference. Figure AI introduced Helix in February 2025 as a generalist VLA model designed specifically for humanoid robots, unifying visual perception, language understanding, and learned control to enable full-upper-body continuous control, including wrists and fingers, at high rates for dexterous tasks in real-world settings. This model supported applications in household chores and logistics, demonstrating coordinated multi-robot behaviors. NVIDIA released Isaac GR00T N1 in March 2025 as the first open foundation model for generalized humanoid robots, employing a dual-system architecture to enhance dexterity and reasoning in complex physical interactions. The model integrated diverse training data, including egocentric human videos and simulated trajectories, to support broad humanoid skills and accelerate development through open-source frameworks. Google DeepMind advanced Gemini Robotics in 2025 as a capable VLA model that translates visual inputs and natural language instructions into motor commands for robotic tasks, with extensions supporting on-device execution and advanced reasoning for physical agents. In June 2025, Hugging Face released SmolVLA, a compact 450-million-parameter open-source VLA model optimized for efficiency, enabling competitive performance on consumer-grade hardware such as single GPUs or laptops while supporting robotics deployment. This design prioritized reduced computational costs and on-device inference, broadening accessibility for real-world robotic applications. These 2025 developments emphasized dual-system designs in several models to balance fast reactive control with deliberative reasoning for improved dexterity in humanoid manipulation, while efficient architectures like SmolVLA facilitated on-device deployment for practical humanoid and autonomous agent use cases.

Architecture

Vision and perception components

The vision and perception components of vision-language-action (VLA) models process visual inputs, primarily images from robotic sensors, to extract features that support scene understanding and grounding of natural language instructions in embodied environments. Most VLA architectures rely on pretrained vision encoders to generate robust visual representations, leveraging large-scale pretraining for strong generalization in robotic tasks. Common pretrained backbones include SigLIP for semantic understanding and DINOv2 for low-level spatial reasoning, which are frequently fused to combine complementary strengths. In OpenVLA, a fused visual encoder processes input images (typically resized to 224×224 pixels) by passing image patches separately through SigLIP and DINOv2 backbones, then concatenating their output feature vectors channel-wise to produce a unified representation that enhances both semantic and spatial awareness critical for manipulation. A small two-layer MLP projector subsequently maps these fused features into the embedding space of the language model backbone, enabling joint processing of vision and language. Early influential models such as RT-2 build on pretrained vision-language models (e.g., PaLI-X), which incorporate their own vision encoders (ViT-based) to process images directly as part of the multimodal input, allowing transfer of web-scale visual knowledge to robotic perception without custom visual pretraining. While many VLA models focus on single-image observations for efficiency in real-time robotic control, some architectures extend to video inputs by encoding sequences of frames to capture temporal dynamics in dynamic scenes, though single-frame processing remains dominant in prominent open-source implementations. This visual processing pipeline ensures that perceptual features are effectively aligned with the language backbone for end-to-end action generation.

Language backbone integration

In vision-language-action (VLA) models, the language backbone serves as the core reasoning engine, typically a large language model (LLM) or pretrained vision-language model (VLM) that integrates visual embeddings with natural language instructions to enable task understanding and action generation. Early VLA architectures, such as those in RT-2, adapted vision-language models such as PaLI-X and PaLM-E as the language backbone, co-fine-tuning them to process interleaved image and text tokens directly and output action sequences autoregressively. Subsequent open-source models, such as OpenVLA, employ an LLM backbone like Llama-2 7B, which receives projected visual features concatenated with tokenized language instructions. Visual inputs are first processed by a vision encoder to produce patch embeddings, which a lightweight projector—often a multi-layer perceptron—maps into the LLM's embedding space, enabling the backbone to treat visual tokens as part of the input sequence alongside language tokens. The LLM backbone then autoregressively predicts the next token in the sequence, leveraging its pretrained reasoning capabilities to interpret multimodal context and generate action tokens or latents conditioned on the combined visual-language input. This integration leverages the LLM's reasoning capabilities for embodied tasks, with the predicted tokens decoded into executable actions (detailed in action representation methods).

Action representation methods

Action representation in vision-language-action (VLA) models determines how the model outputs robot control signals, with approaches broadly divided into discrete token-based methods and continuous generative methods. Early prominent VLA models, such as RT-2 and OpenVLA, employ discrete tokenization to integrate action prediction seamlessly into the autoregressive framework of pretrained vision-language models. In RT-2, continuous action dimensions (typically 7, including end-effector position and rotation deltas and gripper state) are discretized into 256 uniform bins per dimension. These discrete values are mapped to specific tokens in the language model's vocabulary, often by associating integer bin indices with existing numeric tokens or overwriting low-frequency tokens to create an action-specific vocabulary. The model then generates a sequence of these tokens as text output, such as "1 128 91 241 5 101 127", which is decoded back to continuous actions for execution; a special termination token signals task completion. OpenVLA follows a similar discrete approach, discretizing each action dimension into 256 bins (0-255) based on data quantiles to avoid outlier effects, then overwriting the 256 least-used tokens in its Llama-2 tokenizer to represent these bins. The model predicts sequences of discrete action tokens autoregressively, which are subsequently decoded into continuous actions suitable for robot control. This token-based strategy leverages the language model's next-token prediction objective for unified training across vision, language, and action modalities. In contrast, more recent models like π0 adopt continuous action representations to achieve greater precision and support high-frequency control. π0 uses flow matching—a technique related to diffusion models—to model continuous action distributions directly, generating action chunks that predict multiple future timesteps at once (with a horizon of 50 steps). This enables high-frequency output, reaching up to 50 Hz for dexterous tasks such as laundry folding, while handling variable robot embodiments by padding action spaces to a common dimension. Unlike discrete methods, continuous approaches avoid quantization errors and better capture multimodal action distributions required for complex manipulation. Some discrete approaches incorporate special tokens in action sequences, such as end-of-segment or termination indicators, to delineate action chunks or signal episode completion during low-level control. Certain architectures briefly combine these with dual-system designs for continuous refinement, though single-model end-to-end prediction dominates current VLA paradigms.

Single-model versus dual-system designs

Two primary architectural paradigms have emerged in vision-language-action (VLA) models: single-model end-to-end designs and dual-system designs. Single-model designs integrate visual perception, natural language processing, and action generation within a unified neural network, mapping multimodal inputs directly to actions in a single forward pass. This approach simplifies the architecture and supports lower inference latency, facilitating real-time robotic control. Dual-system designs decouple high-level reasoning from low-level execution, typically with a slower System 2 module (often a vision-language model operating at 7-9 Hz) handling scene understanding, language comprehension, and planning, while a faster (a reactive visuomotor policy operating at rates like 200 Hz) generates precise continuous actions. This separation enables high-frequency, dexterous control over high-dimensional action spaces, such as full upper-body humanoid manipulation with dozens of degrees of freedom. Single-model approaches prioritize architectural simplicity and efficiency, making them suitable for straightforward tasks but potentially limiting performance in scenarios requiring rapid, fine-grained adjustments. Dual-system approaches resolve the fundamental tradeoff between generalization (strong in slow, general VLMs) and speed (strong in specialized fast policies) by training the components end-to-end while keeping them distinct, yielding superior dexterity and adaptability in complex embodied tasks at the expense of greater architectural complexity and computational demands.

Training and datasets

Pretraining approaches

Pretraining of vision-language-action (VLA) models typically involves fine-tuning large pretrained vision-language models (VLMs) on extensive robot demonstration datasets to enable end-to-end mapping from multimodal inputs to actions. This approach leverages the strong visual and linguistic representations learned during VLM pretraining on Internet-scale data, then adapts them to predict robot actions by treating action sequences as discrete tokens compatible with the language modeling objective. A seminal example is RT-2, which co-fine-tunes a state-of-the-art VLM on a mixture of Internet-scale vision-language data and robotic trajectory data, representing actions as text strings (e.g., sequences of discretized action values) to allow seamless integration into the autoregressive language modeling framework. This enables the model to retain broad web knowledge while learning robotic control, resulting in improved generalization to novel objects and instructions. Subsequent open-source efforts have scaled this paradigm significantly. OpenVLA fine-tunes a 7B-parameter Prismatic VLM on 970,000 real-world robot episodes from the Open X-Embodiment dataset, spanning diverse tasks, scenes, and robot embodiments to promote generalist capabilities across varied environments. The use of such large and heterogeneous trajectory collections helps the model learn robust action priors from diverse demonstrations. More recent approaches explore pretraining directly from large-scale human activity videos to bootstrap dexterous manipulation capabilities. For instance, VITRA pretrains VLA models on 1 million episodes derived from unscripted egocentric human videos, treating human hand motions as proxy robot actions to generate a dataset with broad coverage of objects and tasks, demonstrating scalable performance gains and strong zero-shot transfer to robotic observations. Overall, pretraining scales with hundreds of thousands to millions of episodes, where dataset diversity—across robot embodiments, tasks, and environmental conditions—plays a critical role in developing generalist models capable of broad generalization.

Key datasets

The Open X-Embodiment dataset represents the primary large-scale resource for training vision-language-action models, containing over 1 million real robot trajectories aggregated from 60 existing datasets contributed by 34 robotic research labs worldwide. This collection spans 22 different robot embodiments—including single-arm, bimanual, and quadruped platforms—and covers 527 skills across 160,266 tasks, incorporating diverse scenes, household objects, and behaviors. Models such as OpenVLA have been pretrained on 970,000 episodes from this dataset, while others like Octo have utilized substantial portions of it to develop generalist robotic policies. Other approaches rely on heterogeneous mixtures of data sources to enable broader generalization, particularly for humanoid applications. GR00T N1, for instance, trains on a combination of real-robot trajectories, human videos, and synthetically generated datasets. High-quality teleoperation datasets have also emerged as key resources for efficient VLA training. Helix, for example, uses approximately 500 hours of supervised teleoperated demonstrations collected across multiple robots and operators, with natural language instructions auto-generated via a vision-language model to pair video segments with hindsight text descriptions. These datasets emphasize quality and diversity over sheer scale, supporting generalization to complex tasks without extensive fine-tuning.

Fine-tuning techniques

Fine-tuning vision-language-action (VLA) models adapts pretrained multimodal foundation models to specific robotic embodiments, tasks, or environments using limited, task-specific datasets, enabling generalization to new robots with minimal additional data. Parameter-efficient fine-tuning techniques, especially Low-Rank Adaptation (LoRA), have become standard for this purpose due to their ability to update only a small fraction of parameters while preserving performance and drastically reducing computational and memory demands. LoRA inserts low-rank trainable matrices into attention layers, typically reducing trainable parameters by orders of magnitude compared to full updates. In OpenVLA, LoRA is applied instead of full fine-tuning, avoiding model sharding and enabling training on 8 A100 or H100 GPUs in 1-2 days for adaptation to new tasks and embodiments. LoRA-based approaches support efficient adaptation on small datasets, as seen in real-world deployments where they achieve reliable manipulation (e.g., 74-76% success on button-pressing tasks) using just hundreds of demonstrations on low-cost robotic arms. Full fine-tuning, which updates all model parameters, remains an option for scenarios demanding maximum adaptation fidelity to novel embodiments, though its high resource requirements make it less practical for large VLAs compared to parameter-efficient methods. Quantization, such as 4-bit NormalFloat4 (NF4), is often combined with LoRA during or after fine-tuning to further optimize inference efficiency without substantial performance degradation (typically <2% accuracy loss). This reduces memory usage by 3-8x, enabling real-time operation (e.g., 20 Hz inference with ~45 ms latency) on consumer-grade GPUs with limited VRAM, facilitating deployment on resource-constrained robotic platforms.

Notable models

Robotic Transformer 2 (RT-2)

Robotic Transformer 2 (RT-2) is a vision-language-action (VLA) model developed by Google DeepMind and introduced in July 2023. It represents an early foundational example of end-to-end VLA systems that unify visual perception, language understanding, and action generation in a single architecture for robotic control. RT-2 builds on large pre-trained vision-language models, specifically PaLI-X (55B parameters) and PaLM-E (12B parameters), which are co-fine-tuned on a combination of Internet-scale vision-language data and robotic trajectory data collected from real-world demonstrations. This approach transfers semantic knowledge from web-scale pretraining to robotic tasks, enabling the model to process robot camera images and natural language instructions as input and produce actions directly as output. Actions are represented as discretized tokenized strings (e.g., sequences of numbers indicating episode continuation/termination, end-effector position, rotation, and gripper state), which are treated like natural language tokens and processed using standard text tokenizers. Compared to its predecessor RT-1, RT-2 shows substantially improved generalization to novel objects, backgrounds, and task variations, with approximately 3x better performance in certain evaluations and roughly 2x improvement in out-of-distribution generalization. It also demonstrates emergent multi-step reasoning capabilities through chain-of-thought fine-tuning, where the model generates intermediate natural language planning steps before predicting action tokens, enabling it to handle complex commands requiring semantic understanding or rudimentary planning (such as selecting an object for a specific purpose based on web-derived knowledge). RT-2 achieves strong results on benchmarks like the language-table suite (90% success rate in simulation) and exhibits robust real-world performance on previously unseen objects and instructions not present in its robotic training data.

OpenVLA

OpenVLA is an open-source 7-billion-parameter vision-language-action model developed by a collaboration of researchers from Stanford University, UC Berkeley, Google DeepMind, the Toyota Research Institute, and other institutions. Released in June 2024, it was pretrained on 970,000 real-world robot demonstration episodes from the Open X-Embodiment dataset, enabling robust visuomotor control from language instructions and camera images. The model outperforms the closed-source RT-2-X (55 billion parameters) by 16.5% in absolute task success rate across 29 evaluation tasks spanning multiple robot embodiments, achieving state-of-the-art performance among generalist manipulation policies despite using seven times fewer parameters. OpenVLA demonstrates strong generalization across diverse conditions, including unseen visual backgrounds and distractors, novel object positions and orientations, variations in object sizes and shapes, and semantic changes such as new target objects or instructions. It also supports efficient fine-tuning, with techniques like LoRA matching full fine-tuning performance while updating only 1.4% of parameters, allowing rapid adaptation to new robot setups and tasks. As a fully open-source release—with model checkpoints available on Hugging Face and a modular PyTorch training codebase on GitHub—OpenVLA has enabled broader accessibility and community-driven advancements in embodied AI research.

Octo

Octo is an open-source generalist robot policy developed by researchers from UC Berkeley and collaborators in 2024. It serves as a lightweight foundation for robotic manipulation, available in two sizes: Octo-Small with 27 million parameters and Octo-Base with 93 million parameters. Octo employs a transformer-based diffusion policy that generates continuous actions through diffusion decoding, supporting multi-modal action distributions. The model was pretrained on 800,000 robot episodes from the Open X-Embodiment dataset, which includes diverse embodiments, tasks, sensors, and language annotations. Its block-wise attention structure enables flexible finetuning to new observation and action spaces, such as adding proprioceptive inputs, alternative camera configurations, or different control modes, often requiring only a few hours on consumer GPUs. Octo supports natural language instructions and goal image conditioning, facilitating adaptation across varied robotic setups and tasks.

π0 (pi-zero)

π0 (pronounced pi-zero) is a vision-language-action model developed by Physical Intelligence and released in late 2024. It serves as a generalist robot policy designed for high-frequency continuous control in dexterous tasks across diverse embodiments. The model uses PaliGemma, a 3 billion parameter pre-trained vision-language model, as its backbone to inherit Internet-scale semantic knowledge. This is augmented with a flow-matching architecture that generates continuous action distributions, enabling the model to output motor commands at 50 Hz for real-time, precise control suitable for tasks requiring dexterity, such as laundry folding. Flow matching supports modeling of multimodal and high-precision action sequences, with actions produced as chunks of 50 future timesteps during inference. π0 demonstrates strong generalization across multiple robot embodiments, including single-arm, dual-arm, and mobile manipulators, through training on diverse datasets exceeding 10,000 hours from seven robot configurations and open-source sources like Open X-Embodiment. This cross-embodiment approach allows zero-shot performance on various tasks and efficient fine-tuning for new skills or platforms. Extensions such as π0-FAST provide smaller variants for specialized or resource-constrained applications while retaining core capabilities. The model's continuous action representation via flow matching facilitates smooth, high-frequency control distinct from discrete-token approaches.

Helix

Helix is a vision-language-action (VLA) model developed by Figure AI and introduced on February 20, 2025. It represents an early 2025 advancement in humanoid VLA systems, designed specifically to enable generalist control of humanoid robots through unified perception, language understanding, and precise action output. Helix employs a dual-system architecture to address the conflicting requirements of high-level semantic reasoning and low-latency motor control in humanoid embodiments. The model consists of System 2, an onboard pretrained vision-language model that operates at 7-9 Hz to process monocular images, robot state, and natural language instructions into semantic latent representations, and System 1, a fast transformer-based visuomotor policy that runs at 200 Hz to translate those representations into continuous actions. This decoupled design allows each component to operate at its optimal timescale while enabling end-to-end training with gradients flowing from System 1 back to System 2 via a latent communication vector. The model focuses on full upper-body control for humanoid robots, coordinating a 35-degree-of-freedom action space that includes individual finger movements, wrist poses, torso orientation, head gaze, and end-effector trajectories. It outputs high-rate continuous control signals at 200 Hz, enabling dexterous manipulation of diverse objects in response to natural language prompts. Helix was trained end-to-end on approximately 500 hours of high-quality teleoperated data collected from multi-robot, multi-operator demonstrations of diverse behaviors. Natural language conditioning was incorporated by using an auto-labeling vision-language model to generate hindsight instructions from segmented video clips, creating paired training examples that map observed actions back to plausible language commands. This relatively modest dataset size, combined with the model's architecture, supports strong zero-shot generalization to novel objects and tasks without task-specific fine-tuning.

GR00T N1

GR00T N1 is an open foundation model developed by NVIDIA for generalist humanoid robots, announced in March 2025 as the world's first open humanoid robot foundation model. As a vision-language-action (VLA) model, it integrates multimodal perception, language understanding, and action generation to enable versatile autonomy in human environments. The model employs a dual-system architecture inspired by human cognition principles. , a slow-thinking vision-language module, processes environmental observations and language instructions to perform deliberate reasoning and planning. , a fast-thinking diffusion transformer module, generates precise, continuous motor actions in real time, enabling fluid execution of planned behaviors. GR00T N1 is trained end-to-end on a heterogeneous mix of data sources, including real-robot trajectories, egocentric human videos, and large-scale synthetic data. This diverse training approach supports generalization across tasks and embodiments. The synthetic data, generated using NVIDIA's Isaac GR00T Blueprint on , augments real data to enhance performance significantly. Designed specifically for dexterous humanoid control, GR00T N1 excels at language-conditioned bimanual manipulation tasks, such as grasping objects, moving items with one or both arms, transferring objects between arms, and performing multistep household activities. It demonstrates strong capabilities on real humanoid platforms like the Fourier GR-1, where it achieves effective performance with high data efficiency. In simulation benchmarks across multiple robot embodiments, it outperforms state-of-the-art imitation learning baselines, highlighting its effectiveness for complex, precise manipulation in humanoid robotics. The model is openly available with permissive licenses, allowing customization and fine-tuning for specific humanoid robots and tasks.

SmolVLA

SmolVLA is a compact, open-source Vision-Language-Action (VLA) model developed by Hugging Face's LeRobot team and released in June 2025. With 450 million parameters, it combines a vision-language model backbone (based on SmolVLM2 with SigLIP vision encoder and SmolLM2 language decoder) and a flow-matching Transformer action expert that generates continuous action chunks. The model employs flow-matching as its training objective, enabling direct, non-autoregressive prediction of actions for precise real-time control. SmolVLA incorporates an asynchronous inference stack that decouples perception/language understanding from action execution, yielding 30% faster response times and up to 2× higher task throughput compared to synchronous approaches. It was trained exclusively on LeRobot Community Datasets, a collection of 487 high-quality robotics datasets (primarily from the SO100 robotic arm) shared on the Hugging Face Hub, comprising approximately 10 million frames at 30 FPS. Despite its modest size, SmolVLA achieves performance comparable to or exceeding much larger VLAs on both simulated benchmarks (such as LIBERO and Meta-World) and real-world tasks (including SO100 and SO101), even with fewer than 30,000 training episodes. It can be trained on a single consumer GPU or even a MacBook and deployed on affordable hardware like the SO-100/SO-101 arms or CPU-only setups, making advanced robotics accessible without high-end compute resources. The model and full training/inference recipes are openly available through the LeRobot library.

Applications

Robotic manipulation

Vision-language-action (VLA) models have emerged as a powerful approach for robotic manipulation, enabling end-to-end control where robots interpret natural language instructions to perform object interactions such as grasping, stacking, and placement in real-world settings. RT-2, an early influential VLA model, integrates web-scale vision-language pretraining to enable language-conditioned manipulation with emergent reasoning capabilities. It translates natural language commands into actions, demonstrating tasks such as picking up objects described semantically (e.g., "the extinct animal") and performing placement based on symbolic or relational cues, with generalization to novel objects and scenes through 6,000 real-robot trials showing substantial performance gains over prior methods.** OpenVLA, a 7B-parameter open-source model trained on 970k real-world robot trajectories, achieves state-of-the-art results in generalist manipulation policies, supporting language-conditioned grasping, stacking, and placement across diverse robot embodiments. It handles multi-object and multi-step tasks reliably, such as "Put Eggplant into Pot," "Put Blue Cup on Plate," and "Stack Cups," while generalizing to unseen backgrounds, distractors, object variations, and semantic instructions, outperforming RT-2-X by 16.5% in absolute success rate across 29 tasks and enabling zero-shot operation on platforms like WidowX and Google Robot arms.** Octo, a transformer-based diffusion policy pretrained on 800k episodes from the Open X-Embodiment dataset, provides robust language-conditioned control for grasping, stacking, and precise multi-object tasks across nine real-robot setups. It excels in zero-shot generalization and efficient fine-tuning for new environments, achieving high success rates (e.g., averaging 0.72 across six evaluation setups post-finetuning) and outperforming baselines like RT-1-X in multi-step scenarios involving long horizons or dual-arm coordination.** These models demonstrate that VLA architectures enable robots to handle complex, language-instructed manipulation in unstructured settings, with strong generalization to unseen objects, scenes, and task variations when trained on diverse real-world data.

Humanoid robot control

Vision-language-action (VLA) models have increasingly been applied to humanoid robot control, enabling high-frequency continuous actions for dexterous, full-body coordination in real-world settings. These models support natural motion by producing fluid motor commands from multimodal inputs, addressing the demands of complex humanoid tasks such as bimanual manipulation and multi-robot collaboration. In 2025, Figure AI introduced Helix, a VLA model specialized for generalist upper-body control in humanoids. Helix controls the entire upper body, including wrists, torso, head, and individual fingers, across 35 degrees of freedom at a high rate of 200 Hz for the low-level visuomotor policy. This enables precise, continuous actions that support responsive and natural dexterity, such as picking up diverse household objects from natural language prompts and coordinating with other robots on tasks like collaborative grocery storage. Its dual-system architecture separates high-level semantic processing (operating at 7-9 Hz) from fast reactive control, allowing generalization to novel objects and environments without task-specific fine-tuning. NVIDIA's GR00T N1, an open foundation VLA model also released in 2025, extends to full humanoid policies through a dual-system architecture combining vision-language understanding with action generation. Trained on diverse data including egocentric human videos, real and simulated robot trajectories, and synthetic sources, GR00T N1 supports language-conditioned bimanual manipulation and has been deployed on humanoids such as the Fourier GR-1, demonstrating strong performance in household tasks and generalization across embodiments. These 2025 advancements highlight the rapid emergence of VLA models for humanoid control in complex, unstructured environments, leveraging high-frequency continuous actions to achieve more human-like dexterity and adaptability.

Autonomous agents and interactive systems

Vision-language-action (VLA) models extend beyond stationary robotic manipulation to enable autonomous agents and interactive systems capable of operating in dynamic, unstructured environments. These models support navigation, long-horizon task execution, and natural human-robot interaction by unifying visual perception, language understanding, and action generation in a single architecture, allowing agents to follow natural language instructions while adapting to changing conditions. In navigation and mobile autonomy, VLA frameworks have been adapted for autonomous driving, where they process visual inputs and language-based goals to produce end-to-end driving actions in real-time dynamic scenarios. Such applications position VLA as a promising approach for building general-purpose autonomous agents that integrate perception, reasoning, and control for spatial navigation and decision-making in open-world settings. For long-horizon tasks, models like LoHoVLA introduce unified frameworks that decompose high-level language instructions into sequences of sub-tasks while jointly predicting actions, enabling embodied agents to handle multi-step processes with closed-loop adaptation to failures or environmental changes. This approach enhances robustness in dynamic environments through hierarchical control that re-plans sub-tasks as needed and supports generalization across diverse long-horizon scenarios. Adaptive reasoning mechanisms further advance interactive capabilities, as seen in OneTwoVLA, which dynamically switches between explicit reasoning for critical moments and direct action generation otherwise. This supports long-horizon planning, error detection and recovery, and natural human-robot interaction by allowing the agent to respond to human queries or interruptions during task execution using language-grounded reasoning. Language grounding plays a central role in dynamic instruction following, where VLA models interpret natural language commands in the context of visual observations to produce contextually appropriate actions. Benchmarks like VLABench highlight this by evaluating long-horizon reasoning under implicit human intentions and common-sense requirements, underscoring the potential for VLA-based agents in interactive settings that demand semantic understanding and multi-step coordination. Emerging extensions of VLA to digital interactive environments, such as GUI-based agents, demonstrate the model's versatility in non-physical domains, where vision-language-action pipelines enable goal-directed interaction with software interfaces through screen observations and action outputs. These developments point toward broader applications in autonomous agents that operate across physical and virtual interactive systems.

Challenges and limitations

Generalization issues

Vision-language-action (VLA) models encounter substantial generalization challenges, primarily stemming from their reliance on finite training datasets that fail to fully capture the variability of real-world embodied environments. A core limitation lies in visual generalization, where models often overfit to specific visual conditions observed during training, such as particular backgrounds, lighting, table textures, or distractor objects. Direct fine-tuning of pretrained vision-language models on robotics data commonly degrades the quality of visual representations, leading to reduced robustness when encountering out-of-distribution (OOD) visuals. For instance, performance significantly drops in simulated environments with randomized visual variants, as pretrained features lose their separability and semantic structure. Semantic generalization also remains limited, with models struggling to handle variations in natural language instructions, such as paraphrases of the same command (e.g., "grasp the can" versus "get the can"). This gap arises because fine-tuning disrupts the alignment between language understanding and action outputs, hindering the transfer of pretrained semantic knowledge to novel task formulations. Physical and motion generalization poses further difficulties, particularly in adapting to novel objects, motions, or task combinations not present in training data. Evaluations on benchmarks designed to test cross-task zero-shot capabilities, such as AGNOSTOS, reveal that leading VLA models achieve low average success rates on unseen manipulation tasks (e.g., around 17.5% for π₀ and 15.6% for VoxPoser across 23 tasks). Models frequently fail completely on Level-2 tasks involving entirely novel objects and motions, underscoring inadequate compositional reasoning and extrapolation beyond seen distributions. These issues are exacerbated by dependence on training data diversity. Datasets like Open X-Embodiment provide large-scale cross-embodiment demonstrations but still contain coverage gaps in rare embodiments, environmental variations, or long-horizon interactions, leading to overfitting to common patterns and poor transfer to novel settings. Limited availability of high-quality, action-labeled data across diverse scenarios further constrains models' ability to achieve broad generalization without degrading pretrained capabilities.

Real-time performance and latency

and latency remain critical challenges for deploying vision-language-action (VLA) models in embodied robotics, where low-latency inference is essential for closed-loop control at high frequencies (typically 20–50 Hz or higher). Large VLA models, such as OpenVLA (approximately 7B parameters) and π₀ (around 3B parameters), impose substantial computational demands during inference, often resulting in latencies exceeding 100 ms per forward pass on consumer-grade GPUs, which limits their suitability for requiring continuous feedback from dynamic environments. A key trade-off arises between single-model end-to-end architectures, which unify vision, language, and action processing for coherent behavior but incur high per-step latency, and dual-system or hierarchical approaches that separate high-level planning from low-level control to achieve higher-frequency execution. Techniques such as action chunking address this by generating multiple future actions in a single inference pass, enabling asynchronous operation where the robot executes one chunk while the model computes the next, thereby maintaining continuous motion without pauses. Real-Time Chunking (RTC), for instance, supports robust performance under latencies up to 200 ms or more by treating delayed predictions as an inpainting task, preserving action continuity and improving task precision compared to synchronous baselines. Optimizations for efficient serving further mitigate latency. For OpenVLA, variants like OpenVLA-OFT employ parallel decoding and action chunking to achieve up to 26× faster action generation and 3× lower latency relative to the base model, facilitating high-frequency control on hardware such as bimanual robots. Specialized inference frameworks, such as those developed for π₀, reduce per-inference latency to as low as 27.3 ms (for two input views) on an RTX 4090 GPU through CUDA graph recording, kernel fusion, and overlapping computation streams, enabling 30 Hz frame rates and up to 480 Hz trajectory frequencies in streaming setups. HyperVLA, using hypernetwork-based activation of only a subset of parameters, accelerates inference by 120× compared to OpenVLA while reducing activated parameters by 90×. Quantization, model compression, and training-free acceleration methods are increasingly adopted to lower memory and compute requirements, allowing deployment on consumer hardware and supporting real-time edge inference in resource-constrained robotic systems. These advances highlight the ongoing shift toward balancing model capacity with practical deployment constraints, though larger models continue to demand significant GPU resources (often 15–18 GB for inference).

Safety and ethical concerns

The deployment of vision-language-action (VLA) models in embodied AI systems introduces substantial safety risks due to the potential for erroneous or compromised actions to cause physical harm to humans, damage to environments, or injury to the robot itself. These models directly map multimodal inputs to continuous action outputs without explicit safety intermediaries, making them vulnerable to unpredictable failures in real-world settings with unstructured dynamics and long-horizon tasks. For instance, unaligned VLA policies exhibit unsafe behaviors such as navigating into confined corners leading to repeated collisions and entrapment, failing to account for blind spots resulting in impacts with previously observed but currently unseen obstacles, causing collateral damage to nearby fragile items during manipulation, destabilizing precariously positioned objects, or interacting with intrinsically hazardous equipment like active stovetops. Such violations can escalate to severe consequences, including hardware destruction, environmental harm, or direct human injury. Adversarial attacks further amplify these risks by exploiting vulnerabilities in perception and action generation, enabling malicious perturbations to sensory inputs that translate directly into unsafe physical behaviors. Research has demonstrated that targeted attacks can induce critical safety violations, such as a robot holding a knife being manipulated to point toward and approach a nearby human in real-world experiments, violating standards for human-robot separation and velocity. These attacks achieve high success rates across safety categories and underscore the direct pathway from model compromise to embodied harm, particularly in human-centric environments. VLA models also inherit biases from their training datasets, which often reflect human prejudices present in large-scale vision-language-action data, potentially leading to unfair, discriminatory, or unreliable decisions in applications involving human interaction or resource allocation. The black-box nature of these large-scale models exacerbates concerns around interpretability, making it difficult to predict, debug, or explain why a particular action was selected in safety-critical scenarios. This lack of transparency hinders accountability and trust in autonomous systems. Broader ethical implications arise from the increasing autonomy of VLA-driven robots, including questions of alignment with human values, responsibility for unintended harms, and the potential for misuse in sensitive domains. Ensuring these models prioritize safety and fairness during deployment has become critical as they advance toward real-world integration in and humanoid applications.

Future directions

Scaling laws and efficiency

Research indicates that performance in robotic tasks addressed by vision-language-action (VLA) models follows power-law scaling relationships, with improvements in success rates as model size, data volume, and compute increase. These scaling laws in robotics exhibit faster rates of improvement compared to those observed in language modeling, suggesting substantial potential for enhanced generalization and new capabilities through further increases in scale. Specific work has shown that integrating world models into VLA architectures amplifies data scaling effects by providing dense self-supervised signals, leading to accelerated performance gains as training datasets grow larger in domains such as autonomous driving. This points toward promising future gains from scaling beyond current typical sizes (around 7 billion parameters in many existing VLAs) through larger models trained on more extensive multimodal datasets. Parallel to raw scaling, efforts emphasize efficiency to democratize VLA deployment. Compact models such as SmolVLA demonstrate that reduced-parameter architectures can achieve performance comparable to models ten times larger while enabling training on single GPUs and inference on consumer-grade hardware. These designs prioritize affordability and in resource-constrained settings. Advances in efficient VLA architectures include model compression techniques such as quantization and knowledge distillation, alongside optimized training pipelines and efficient fine-tuning strategies that lower computational demands without substantial performance loss. Such innovations support broader adoption by making powerful VLA capabilities viable for diverse robotic platforms beyond high-end compute clusters.

Advanced multimodal integration

Future advancements in vision-language-action (VLA) models are focusing on expanding multimodal integration beyond core vision and language inputs to incorporate additional sensory modalities such as touch (tactile and force feedback) and proprioception, enabling more robust embodied interactions in contact-rich environments. For instance, frameworks like Tactile-VLA fuse tactile sensing with visual, linguistic, proprioceptive, and action data, allowing robots to follow tactile-aware instructions (e.g., adjusting force for "soft" or "hard" interactions) and leverage pretrained commonsense knowledge for zero-shot generalization in physical tasks. Similarly, force-distilled approaches such as FD-VLA inject predicted force tokens into VLA architectures without requiring physical force sensors during inference, enhancing cross-modal alignment and performance in manipulation scenarios where contact forces are critical. Proprioception is frequently integrated as a foundational modality for state awareness, with ongoing efforts to enrich fusion mechanisms for more precise control and generalization. Improved reasoning and long-horizon planning represent key targets for advanced integration, with techniques like chain-of-thought prompting over multimodal inputs enabling adaptive failure recovery and strategy adjustment during extended tasks. Research roadmaps emphasize unifying perception, language, and action within single architectures to support more sophisticated agentic behaviors, including cross-embodiment planning and socially aligned reasoning for general-purpose embodied agents. Integration with reinforcement learning (RL) and simulation environments is emerging as a powerful strategy to scale VLA capabilities and address long-horizon challenges. RL-based fine-tuning methods, such as those in SimpleVLA-RL, improve training efficiency and achieve substantial gains in long-horizon benchmarks by leveraging trajectory sampling and reward signals, reducing dependence on massive static datasets while enhancing policy robustness. These approaches pave the way for more capable, generalist VLA systems. Real-world deployment trends have accelerated since 2024, with VLA models transitioning from research prototypes to practical applications in industrial and embodied settings, driven by advancements in on-device inference, humanoid platforms, and open-source frameworks. On-device and low-latency versions have emerged to enable on resource-constrained robots. Google DeepMind's Gemini Robotics On-Device represents a key example, optimized for local execution on robots with low-latency inference, marking the first such VLA model made available for robotics developers. Similarly, SmolVLA, a compact 450M-parameter open-source model, supports efficient operation on consumer hardware, addressing power and speed constraints for edge deployment. Humanoid robot integration has progressed through commercial pilots. Figure AI's Helix VLA model provides generalist control for humanoid robots, enabling full upper-body coordination at 200 Hz and collaborative manipulation of novel objects via natural language prompts. Separately, Figure AI has deployed Figure 02 humanoid robots in real-world industrial settings, including an 11-month pilot at BMW Group Plant Spartanburg that contributed to the production of over 30,000 vehicles. Open-source ecosystems have accelerated broader adoption by providing accessible models for customization across robot embodiments. OpenVLA, a 7B-parameter model pretrained on 970k robot episodes, supports fine-tuning for diverse platforms and demonstrates zero-shot performance on household-like manipulation tasks such as table wiping, pot flipping, and object stacking. These efforts, along with repositories like WholebodyVLA for humanoid-focused research, enable community-driven deployment in varied settings including potential household robotics.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.