Locke commited on
Commit
522f202
·
1 Parent(s): 341db92
README.md CHANGED
@@ -21,7 +21,7 @@ tags:
21
  <a href="https://longcat.chat/longcat-next/intro" target="_blank" style="margin: 2px;">
22
  <img alt="Blog" src="https://img.shields.io/badge/Blog-LongCatNext-white?logo=safari&logoColor=white&color=purple" style="display: inline-block; vertical-align: middle;"/>
23
  </a>
24
- <a href="https://huggingface.co/meituan-longcat" target="_blank" style="margin: 2px;">
25
  <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-LongCatNext-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
26
  </a>
27
  <a href="https://github.com/meituan-longcat/LongCat-Next" target="_blank" style="margin: 2px;">
@@ -59,7 +59,7 @@ tags:
59
 
60
  ## Model Introduction
61
 
62
- ![evaluation](./assets/overview.png)
63
 
64
 
65
  We develop **LongCat-Next**, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal inductive bias beyond the language paradigm. As an industrial-strength foundation model with A3B model size, it excels at seeing, creating, and talking, achieving strong performance across a wide range of multimodal benchmarks. In particular, leveraging semantically complete discrete representations, it surpasses the long-standing performance ceiling of discrete vision modeling on understanding tasks, and provides a unified solution for visual understanding and generation. This success demonstrates that discrete tokens can universally represent multimodal signals and be deeply internalized within a single discrete embedding space. We further provide extensive experiments to analyze this unified discrete training paradigm and uncover several interesting findings.
 
21
  <a href="https://longcat.chat/longcat-next/intro" target="_blank" style="margin: 2px;">
22
  <img alt="Blog" src="https://img.shields.io/badge/Blog-LongCatNext-white?logo=safari&logoColor=white&color=purple" style="display: inline-block; vertical-align: middle;"/>
23
  </a>
24
+ <a href="https://huggingface.co/meituan-longcat/LongCat-Next" target="_blank" style="margin: 2px;">
25
  <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-LongCatNext-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
26
  </a>
27
  <a href="https://github.com/meituan-longcat/LongCat-Next" target="_blank" style="margin: 2px;">
 
59
 
60
  ## Model Introduction
61
 
62
+ ![evaluation](./assets/overview.jpg)
63
 
64
 
65
  We develop **LongCat-Next**, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal inductive bias beyond the language paradigm. As an industrial-strength foundation model with A3B model size, it excels at seeing, creating, and talking, achieving strong performance across a wide range of multimodal benchmarks. In particular, leveraging semantically complete discrete representations, it surpasses the long-standing performance ceiling of discrete vision modeling on understanding tasks, and provides a unified solution for visual understanding and generation. This success demonstrates that discrete tokens can universally represent multimodal signals and be deeply internalized within a single discrete embedding space. We further provide extensive experiments to analyze this unified discrete training paradigm and uncover several interesting findings.
assets/{overview.png → overview.jpg} RENAMED
File without changes