BeingBeyond's latest achievement: the first VLA-Being-H0 trained using human operation trajectories

Date:2026.01.20

Views: 38

640.gif

Enabling robots to understand human intentions in the real world and perform corresponding actions has always been a core challenge in the field of embodied intelligence. However, the development of vision-language-action models is facing an urgent problem: there is a severe shortage of real robot operation data.

Regarding this issue, the robotics industry has currently invested heavily in building data collection platforms, but it is still a drop in the bucket. The huge gap between the existing amount of data and the hundreds of millions of samples required by the model scaling law has become a key factor restricting technological breakthroughs.

Against this backdrop, the BeingBeyond research team has proposed an innovative solution: extracting hand movement trajectories from massive amounts of human operation videos and constructing a high-quality training dataset with a scale of hundreds of millions.

More importantly, the team proposed a brand-new method framework called "physical instruction fine-tuning," which can accurately map human hand movements to the robot's action space, thereby forming a closed loop from visual understanding to action generation.

图片

Being-H0论文封面

Based on this breakthrough, the team successfully trained Being-H0, the first large-scale pre-trained VLA model utilizing human hand video data.System verification was completed on a real robot platform. The research revealed several key findings: Human hands can be regarded as a universal template for various end-effectors; pre-training using large-scale videos of human hand operations can effectively solve the problem of data scarcity; this method can significantly improve the success rate of robot tasks and greatly reduce the dependence on real robot data.


Being-H0: The first large-scale VLA model trained using human operation trajectories

Being-H0 is built on an important assumption: human hand movements represent the most versatile operational paradigm, and all existing types of robotic end-effectors can be regarded as subsets of it. Whether it is a complex five-fingered dexterous hand or a simple two-fingered gripper, they can all benefit from knowledge of human hand movements. By learning human operation trajectories for pre-training, a base model with wide adaptability can be constructed.

In addition, such video data is extremely easy to obtain in the era of short videos, and naturally avoids the differences between simulated data and real scenes.

图片

Being-H0 Physical Instruction Fine-Tuning Framework

Drawing on the successful approach of visual instruction fine-tuning, the research team designed a complete physical instruction fine-tuning framework, which is specifically optimized for the heterogeneity problem between 2D visual-language data and 3D action space. This framework consists of three key components:

Pre-training phase: Learning from millions of human hand videos:

Traditional multimodal models are often limited by differences in data modalities when migrating to embodied tasks. To address this, the team constructed a unified multimodal autoregressive architecture, which integrates visual, language, and action representations through massive hand trajectory data. In terms of coding design, dedicated encoders are used for the wrist and fingers respectively, and action discretization is performed based on a grouped residual variational quantized autoencoder, controlling the posture reconstruction error at the millimeter level.

Physical space alignment: achieving the mapping from 2D video to 3D space:

By means of a unified coordinate system transformation method, differences in camera parameters, observation perspectives, etc., among different video sources are eliminated, ensuring that the spatial and motion representations learned by the model are consistent and transferable.

Post-training phase: Transfer from the pre-trained model to the real robot

Establish an efficient conversion mechanism from human actions to robot operations to ensure the accuracy and generalization ability of skill transfer.


Hundred-million-scale UniHand Dataset

To support the training of the physical instruction tuning framework, the research team established a complete data collection and processing process, integrating multimodal information from 11 open-source data sources, covering various sources such as motion capture, virtual reality devices, and conventional RGB videos. The final constructed UniHand dataset contains 150 million samples of human hand movement instructions, covering tasks such as gesture generation, motion semantic understanding, and context-aware motion prediction.

图片

UniHand dataset

It is worth noting that even when trained on only 2.5 million of these samples, the model has shown significant advantages in multiple evaluations.


Verification through real machine experiments

In addition to regular task evaluations, the research team also conducted systematic real robot experiments. The experimental results show that under the same downstream task training conditions, the Being-H0 model optimized based on physical instructions significantly outperforms the base model InternVL3 and also surpasses the VLA model GR00T N1.5 open-sourced by NVIDIA during the same period.

图片

It should be specially noted that although GR00T N1.5 used a larger scale of human video data in training, Being-H0 performed better in terms of action learning efficiency and task success rate due to its highly structured data construction strategy.

This comparison result strongly confirms the effectiveness of the data construction strategy in this study: by explicitly constructing pre-training data that is highly aligned with the structure of downstream tasks, the model's ability to learn human action knowledge from video data can be significantly improved, thereby enhancing the success rate of downstream tasks.

To further verify the robustness of the method, the research team conducted a comparative analysis of the performance between Being-H0 and the untrained basic model under different training data scales. The experiment set training data sampling ratios ranging from 25% to 100%. The results showed that under the condition of the same amount of data, the Being-H0 model consistently demonstrated stable performance advantages.

图片

Pick-Place-Toy task

In addition, with the same success rate, Being-H0 requires far less real machine data than other models. For example, in the Pick-Place-Toy task, this model can achieve a similar success rate to other models using all the data with only 25% of the real data, significantly improving data utilization efficiency.

This series of experiments not only verified the effectiveness of the physical instruction tuning framework, but also confirmed that this method can significantly reduce the amount of real machine data.

Article link: https://arxiv.org/pdf/2507.15597

Project official website: https://beingbeyond.github.io/Being-H0/


About BeingBeyond: Intelligence Without Boundaries

BeingBeyond has pioneered the paradigm of training general embodied models using large-scale human data. Both the dexterous hand operation models (Being-H Series) and the humanoid robot mobile operation models (Being-M Series) demonstrate industry-leading performance and can be deployed across different embodiments. Additionally, the team has built the world's largest first-person hand dataset and full-body posture dataset, which are becoming the core foundation for driving model iterations. BeingBeyond focuses on the research and development of general humanoid robot models, providing robot embodiment manufacturers and scenario-based customers with highly generalized and deployable embodied models.