Recently, research teams from Peking University and Renmin University of China have made breakthrough progress in the motion generation technology for general-purpose humanoid robots. They have proposed for the first time a general motion generation framework called Being-M0, which features the characteristic of data-model collaborative scaling (Scaling Law), and built a complete "data-model-migration" technical system around it, opening up a new path for humanoid robots to achieve flexible and diverse motion control.
Being‑M0 Thesis Cover
To achieve this goal,the research team first automatically extracted, annotated, and cleaned human motion data from a vast amount of internet videos,and constructed the industry's first million-scale motion generation dataset, MotionLib. The scale of this dataset is more than 15 times that of the largest existing public dataset, providing a solid data foundation for training high-performance motion generation models.
Building on this foundation, the team developed an end-to-end text-driven motion generation model, verified the technical feasibility of "big data + large models" in the field of motion generation, and successfully and efficiently transferred the generated human motions to multiple humanoid robot platforms, achieving a complete closed loop from text instructions to robot motions.
Million-level action dataset MotionLib: Breaking through the bottleneck of data scale
In the field of AI, the scale of data is often the key to a leap in model performance. To build high-quality and large-scale action data resources, the research team collected more than 20 million human action videos from public datasets and online video platforms, and developed a fully automatic data processing pipeline. This pipeline uses pre-trained models for 2D human key point estimation and screening, and then extracts high-precision 3D key point information with the help of models trained on large-scale 3D data.
MotionLib Dataset example
To address the issue of coarse granularity in previous action data annotation, the team proposed a hierarchical fine-grained annotation scheme. They utilized large language models to generate structured descriptions for each video segment, which not only cover the overall action semantics but also record in detail the local movement features such as those of the arms and legs, providing important support for refined action generation.
Large-scale motion generation model, enabling accurate mapping from language to motion
As the data scale of MotionLib achieves an order-of-magnitude growth, how to give full play to the effectiveness of big data has become the key.
Efficient motion retargeting from humans to humanoid robots
Transferring the generated human body movements to a physical robot is the final step in achieving text-driven humanoid robot motion generation. To realize the closed loop from text to robot movements, the core challenge of cross-modal movement transfer needs to be addressed.
Due to the significant differences in degrees of freedom configuration, link dimensions, and other aspects among different humanoid robots, when redirecting human movements to robots, traditional methods based on inverse kinematics solutions or direct mapping of joint angles often lead to motion distortion and even dynamic infeasibility.
Being-M0The overall framework is divided into two stages.
To solve this problem, the Being-M0 team proposed a two-stage solution of "optimization + learning":
During the training data construction phase, action sequences that satisfy the kinematic constraints of the robot are generated through a multi-objective optimization method. The optimization process not only considers basic constraints such as joint limits but also takes into account the smoothness and stability of the action trajectories. Although this multi-objective optimization-based method has a relatively high computational cost, it can ensure the high quality of the generated data, laying a good foundation for the subsequent learning phase.
In the motion mapping stage, a lightweight MLP network is used to learn the mapping relationship from human motions to humanoid robot motions. Through a carefully designed network structure, this method achieves efficient support for multiple robot platforms such as H1, H1-2, and G1.
Compared with direct optimization, the neural network-based method has significantly improved the real-time performance of the system while maintaining the accuracy of action transfer.
Project address: https://beingbeyond.github.io/Being-M0/
Paper link: https://arxiv.org/abs/2410.03311
About BeingBeyond: Intelligence Without Boundaries
BeingBeyond has pioneered the paradigm of training general embodied models using large-scale human data. Both the dexterous hand operation models (Being-H series) and the humanoid robot mobile operation models (Being-M series) demonstrate industry-leading performance and can be deployed across different embodiments. Meanwhile, the world's largest first-person hand dataset and full-body posture dataset built by the team are becoming the core foundation for driving model iteration. BeingBeyond focuses on the research and development of general humanoid robot models, providing robot embodiment manufacturers and scenario-based customers with highly generalized and implementable embodied models.