meituan-longcat
/

LongCat-Next

@@ -21,7 +21,7 @@ tags:
     <a href="https://longcat.chat/longcat-next/intro" target="_blank" style="margin: 2px;">
         <img alt="Blog" src="https://img.shields.io/badge/Blog-LongCatNext-white?logo=safari&logoColor=white&color=purple" style="display: inline-block; vertical-align: middle;"/>
     </a>
-    <a href="https://huggingface.co/meituan-longcat" target="_blank" style="margin: 2px;">
         <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-LongCatNext-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
     </a>
     <a href="https://github.com/meituan-longcat/LongCat-Next" target="_blank" style="margin: 2px;">
@@ -59,7 +59,7 @@ tags:
 ## Model Introduction
-![evaluation](./assets/overview.png)
 We develop **LongCat-Next**, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal inductive bias beyond the language paradigm. As an industrial-strength foundation model with A3B model size, it excels at seeing, creating, and talking, achieving strong performance across a wide range of multimodal benchmarks. In particular, leveraging semantically complete discrete representations, it surpasses the long-standing performance ceiling of discrete vision modeling on understanding tasks, and provides a unified solution for visual understanding and generation. This success demonstrates that discrete tokens can universally represent multimodal signals and be deeply internalized within a single discrete embedding space. We further provide extensive experiments to analyze this unified discrete training paradigm and uncover several interesting findings.

     <a href="https://longcat.chat/longcat-next/intro" target="_blank" style="margin: 2px;">
         <img alt="Blog" src="https://img.shields.io/badge/Blog-LongCatNext-white?logo=safari&logoColor=white&color=purple" style="display: inline-block; vertical-align: middle;"/>
     </a>
+    <a href="https://huggingface.co/meituan-longcat/LongCat-Next" target="_blank" style="margin: 2px;">
         <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-LongCatNext-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
     </a>
     <a href="https://github.com/meituan-longcat/LongCat-Next" target="_blank" style="margin: 2px;">
 ## Model Introduction
+![evaluation](./assets/overview.jpg)
 We develop **LongCat-Next**, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal inductive bias beyond the language paradigm. As an industrial-strength foundation model with A3B model size, it excels at seeing, creating, and talking, achieving strong performance across a wide range of multimodal benchmarks. In particular, leveraging semantically complete discrete representations, it surpasses the long-standing performance ceiling of discrete vision modeling on understanding tasks, and provides a unified solution for visual understanding and generation. This success demonstrates that discrete tokens can universally represent multimodal signals and be deeply internalized within a single discrete embedding space. We further provide extensive experiments to analyze this unified discrete training paradigm and uncover several interesting findings.

assets/{overview.png → overview.jpg} RENAMED Viewed

File without changes