Directional Textual Inversion for Personalized Text-to-Image Generation
Abstract
Directional Textual Inversion (DTI) improves text-to-image personalization by constraining learned tokens to unit magnitude, enhancing prompt conditioning and enabling smooth interpolation between concepts.
Textual Inversion (TI) is an efficient approach to text-to-image personalization but often fails on complex prompts. We trace these failures to embedding norm inflation: learned tokens drift to out-of-distribution magnitudes, degrading prompt conditioning in pre-norm Transformers. Empirically, we show semantics are primarily encoded by direction in CLIP token space, while inflated norms harm contextualization; theoretically, we analyze how large magnitudes attenuate positional information and hinder residual updates in pre-norm blocks. We propose Directional Textual Inversion (DTI), which fixes the embedding magnitude to an in-distribution scale and optimizes only direction on the unit hypersphere via Riemannian SGD. We cast direction learning as MAP with a von Mises-Fisher prior, yielding a constant-direction prior gradient that is simple and efficient to incorporate. Across personalization tasks, DTI improves text fidelity over TI and TI-variants while maintaining subject similarity. Crucially, DTI's hyperspherical parameterization enables smooth, semantically coherent interpolation between learned concepts (slerp), a capability that is absent in standard TI. Our findings suggest that direction-only optimization is a robust and scalable path for prompt-faithful personalization.
Community
Hi everyone! 👋
We investigated why Textual Inversion (TI) often ignores context and traced the issue to embedding norm inflation. We found that standard TI learns tokens with massive magnitudes (often >20) compared to the model's native vocabulary (≈0.4), which we prove theoretically breaks the representation update in pre-norm Transformers.
Our solution, Directional Textual Inversion (DTI), fixes the magnitude to an in-distribution scale and optimizes only the direction on the hypersphere using Riemannian SGD. This simple change significantly improves prompt fidelity and enables smooth spherical interpolation (slerp) between concepts.
We’d love for you to try it out! Code is available here: https://github.com/kunheek/dti
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Finetuning-Free Personalization of Text to Image Generation via Hypernetworks (2025)
- Semantic Anchoring for Robust Personalization in Text-to-Image Diffusion Models (2025)
- Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization (2025)
- Infinite-Story: A Training-Free Consistent Text-to-Image Generation (2025)
- Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion (2025)
- FairImagen: Post-Processing for Bias Mitigation in Text-to-Image Models (2025)
- Exploring MLLM-Diffusion Information Transfer with MetaCanvas (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper