Being-VL-0.5: towards more unified multimodal understanding via visual BPE tokenization

Date：2026.01.20

Recently, BeingBeyond and Peking University jointly launched Being-VL-0.5, the industry's first multimodal large model that systematically introduces the Byte Pair Encoding (BPE) mechanism into the visual modality.

This model, trained on millions of quantized visual data, has verified the technical feasibility of structured prior injection in cross-modal alignment, laying a theoretical foundation for multimodal understanding and generation under a unified discrete representation framework.

As a technological breakthrough in the field of vision-language,Being-VL-0.5 demonstrates performance comparable to continuous embedding methods in standard benchmark tests, while significantly reducing its reliance on the amount of training data. Its curriculum-driven training strategy and progressive parameter unfreezing mechanism effectively address the challenge of the lack of high-level semantics in visual tokens, promoting the leap of multimodal models from local feature extraction to global semantic understanding.

论文封面

BeingBeyond will continue to deepen research on unified modal representation, explore the extension of BPE encoding to more modalities such as audio and 3D point clouds, and strive to build a general multimodal generation framework for embodied intelligence, ultimately achieving a breakthrough improvement in the autonomy and general capabilities of humanoid robots.

The paper has been accepted by ICCV 2025.

Core theoretical innovation

Traditional multimodal models often perform weaker in processing visual information compared to text, with visual representations lacking high-level semantic structures. BeingBeyond and the research team from Peking University made a theoretical discovery in their preliminary work for ICLR 2025, titled "From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities" — the unigram statistical limitations of Transformer models in raw sequence learning can be overcome through structured tokenization.

Based on this conclusion, the research team further verified in this study:"Tokenization is not only a means of data compression but also a key mechanism for injecting structured priors into the model."

Core technical architecture

1.Unified foundation of discrete representation:

VQ-GAN is used to quantize visual images into discrete token sequences, but we find that the initial tokens only contain local visual information and lack high-level semantic structures.

2.Priority-guided Visual BPE Encoding:

Co-occurrence frequency: Count the occurrence frequency of spatially adjacent VQ tokens in the data to construct a hierarchical visual vocabulary.
Spatial consistency: Evaluate the degree to which token pairs maintain stable spatial relationships in different images.

3.A curriculum-driven data combination strategy, based on hierarchical visual patterns (from simple to complex), our team has designed a four-stage learning curriculum:

Basic stage: Establish basic vision-language alignment
Perception stage: Learning detailed visual attributes
Reasoning Stage: Developing complex visual reasoning abilities
Instruction Phase: Optimize Task Execution Capabilities

4.Progressive parameter unfreezing training adopts a three-stage strategy to ensure the effective integration of visual representations and language models:

Phase 1: Only train the extended visual token embeddings
Phase 2: Selectively unfreezing early Transformer layers
Stage 3: Full Parameter Fine-tuning

Experimental verification

The experimental results strongly verify the theoretical predictions of the research team.Results from multiple standard benchmark tests show that Being-VL-0.5 achieves performance comparable to mainstream continuous embedding methods while retaining the advantages of unified token representation.It is particularly noteworthy that the research team still achieved considerable performance even when the amount of training data was much lower than that of traditional CLIP encoders, demonstrating the efficiency of this method.

In addition, through visual analysis, the research team observed that models using BPE showed a more reasonable distribution in terms of token activation patterns, which verified the theoretical hypothesis of the paper regarding the injection of structured priors.

This study is the first to systematically apply the structured prior injection mechanism of BPE to the visual modality, designing a complete training process that matches the characteristics of visual BPE. Its deeper significance lies in providing a theoretical foundation for the understanding and generation of multimodality with unified discrete representations.

The research team stated that future work will explore the possibility of unified representations for more modalities (such as audio and 3D point clouds) to promote the development of multimodal generation tasks.

Full text of the paper: https://arxiv.org/abs/2506.23639

Project homepage: https://beingbeyond.github.io/Being-VL-0.5

About BeingBeyond: Wisdom Without Boundaries

BeingBeyond has pioneered the paradigm of training general embodied models using large-scale human data. Both the dexterous hand manipulation model (Being-H series) and the humanoid robot mobile manipulation model (Being-M series) boast industry-leading performance and can be deployed across different ontologies. Meanwhile, the world's largest first-person hand dataset and full-body posture dataset built by the team are becoming the core foundation for driving model iterations. BeingBeyond focuses on the research and development of general humanoid robot models, providing robot ontology manufacturers and customers in application scenarios with highly generalized and deployable embodied models.