Being-VL
The next-generation cross-modal general-purpose foundation model for robots
Efficiently integrates vision and language to achieve a more comprehensive semantic understanding of the physical world
Endow robots with the ability to perceive spatial structures and object relationships
Allow perception results to smoothly translate into clear and actionable motion choices
Language and three-dimensional spatial representation achieve deep alignment, enabling robots to accurately understand and execute instructions