BeingBeyond's latest release: a million-scale dataset enabling cross-platform and cross-morphology migration of fine-grained movements

Date：2026.01.20

Recently, research teams from Peking University and Renmin University of China have made breakthrough progress in the motion generation technology for general-purpose humanoid robots. They have proposed for the first time a general motion generation framework called Being-M0, which features the characteristic of data-model collaborative scaling (Scaling Law), and built a complete "data-model-migration" technical system around it, opening up a new path for humanoid robots to achieve flexible and diverse motion control.

Being‑M0 Thesis Cover

To achieve this goal,the research team first automatically extracted, annotated, and cleaned human motion data from a vast amount of internet videos,and constructed the industry's first million-scale motion generation dataset, MotionLib. The scale of this dataset is more than 15 times that of the largest existing public dataset, providing a solid data foundation for training high-performance motion generation models.

Building on this foundation, the team developed an end-to-end text-driven motion generation model, verified the technical feasibility of "big data + large models" in the field of motion generation, and successfully and efficiently transferred the generated human motions to multiple humanoid robot platforms, achieving a complete closed loop from text instructions to robot motions.

The results of this research have been accepted by ICML 2025, a top-tier conference in the field of artificial intelligence.

Million-level action dataset MotionLib: Breaking through the bottleneck of data scale

In the field of AI, the scale of data is often the key to a leap in model performance. To build high-quality and large-scale action data resources, the research team collected more than 20 million human action videos from public datasets and online video platforms, and developed a fully automatic data processing pipeline. This pipeline uses pre-trained models for 2D human key point estimation and screening, and then extracts high-precision 3D key point information with the help of models trained on large-scale 3D data.

MotionLib Dataset example

To address the issue of coarse granularity in previous action data annotation, the team proposed a hierarchical fine-grained annotation scheme. They utilized large language models to generate structured descriptions for each video segment, which not only cover the overall action semantics but also record in detail the local movement features such as those of the arms and legs, providing important support for refined action generation.

In addition, MotionLib also features multimodality. Besides conventional RGB videos, it includes depth information and supports multi-person interaction scenarios, which greatly expands the applicable range of the dataset. After rigorous quality filtering, a dataset containing over one million high-quality action sequences was finally built, laying a data foundation for further breakthroughs in the field of action generation.

Large-scale motion generation model, enabling accurate mapping from language to motion

As the data scale of MotionLib achieves an order-of-magnitude growth, how to give full play to the effectiveness of big data has become the key.

Through systematic experiments, the team verified for the first time in the field of action generation the collaborative scaling law between data size and model size. The experiments showed that under the same data conditions, larger-capacity models (such as LLaMA-2 with 13B parameters) are significantly superior to smaller models (such as GPT2 with 700M parameters) in terms of action diversity and semantic alignment, and large models demonstrate better data utilization efficiency.

Efficient motion retargeting from humans to humanoid robots

Transferring the generated human body movements to a physical robot is the final step in achieving text-driven humanoid robot motion generation. To realize the closed loop from text to robot movements, the core challenge of cross-modal movement transfer needs to be addressed.

Due to the significant differences in degrees of freedom configuration, link dimensions, and other aspects among different humanoid robots, when redirecting human movements to robots, traditional methods based on inverse kinematics solutions or direct mapping of joint angles often lead to motion distortion and even dynamic infeasibility.

Being-M0The overall framework is divided into two stages.

To solve this problem, the Being-M0 team proposed a two-stage solution of "optimization + learning":

During the training data construction phase, action sequences that satisfy the kinematic constraints of the robot are generated through a multi-objective optimization method. The optimization process not only considers basic constraints such as joint limits but also takes into account the smoothness and stability of the action trajectories. Although this multi-objective optimization-based method has a relatively high computational cost, it can ensure the high quality of the generated data, laying a good foundation for the subsequent learning phase.

In the motion mapping stage, a lightweight MLP network is used to learn the mapping relationship from human motions to humanoid robot motions. Through a carefully designed network structure, this method achieves efficient support for multiple robot platforms such as H1, H1-2, and G1.

Compared with direct optimization, the neural network-based method has significantly improved the real-time performance of the system while maintaining the accuracy of action transfer.

Project address: https://beingbeyond.github.io/Being-M0/

Paper link: https://arxiv.org/abs/2410.03311

About BeingBeyond: Intelligence Without Boundaries

BeingBeyond has pioneered the paradigm of training general embodied models using large-scale human data. Both the dexterous hand operation models (Being-H series) and the humanoid robot mobile operation models (Being-M series) demonstrate industry-leading performance and can be deployed across different embodiments. Meanwhile, the world's largest first-person hand dataset and full-body posture dataset built by the team are becoming the core foundation for driving model iteration. BeingBeyond focuses on the research and development of general humanoid robot models, providing robot embodiment manufacturers and scenario-based customers with highly generalized and implementable embodied models.