arxiv:2512.13687

Towards Scalable Pre-training of Visual Tokenizers for Generation

Published on Dec 15

· Submitted by

Jingfeng Yao on Dec 16

MiniMax

Upvote

Authors:

Jingfeng Yao ,

Yuda Song ,

Yucong Zhou ,

Xinggang Wang

Abstract

A unified visual tokenizer pre-training framework (VTP) improves generative performance by optimizing image-text contrastive, self-supervised, and reconstruction losses, leading to better scaling properties and higher zero-shot accuracy and faster convergence.

AI-generated summary

The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8\% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at https://github.com/MiniMax-AI/VTP.

View arXiv page View PDF GitHub 132 Add to collection

Community

MapleF9

Paper author Paper submitter 1 day ago

•

edited 1 day ago

GitHub codes: https://github.com/MiniMax-AI/VTP
Huggingface weights: https://huggingface.co/collections/MiniMaxAI/vtp
collaborated with HUST Vision Lab: https://github.com/hustvl

librarian-bot

about 14 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

s21mind

about 5 hours ago

This gets at a core bottleneck in generative vision: tokenizers optimized for pixels don’t scale cognition. VTP’s shift toward semantic-first latent spaces mirrors what we’ve already learned in language — understanding must precede generation. The fact that generation quality now scales with tokenizer pretraining FLOPs is the real breakthrough here. This feels like a necessary correction to the “just reconstruct better” era of VAEs and a strong signal that vision models are finally being trained to think, not just compress.