arxiv:2512.13672

Directional Textual Inversion for Personalized Text-to-Image Generation

Published on Dec 15

· Submitted by

Kunhee Kim on Dec 16

KAIST AI

Upvote

Authors:

Kunhee Kim ,

Abstract

Directional Textual Inversion (DTI) improves text-to-image personalization by constraining learned tokens to unit magnitude, enhancing prompt conditioning and enabling smooth interpolation between concepts.

AI-generated summary

Textual Inversion (TI) is an efficient approach to text-to-image personalization but often fails on complex prompts. We trace these failures to embedding norm inflation: learned tokens drift to out-of-distribution magnitudes, degrading prompt conditioning in pre-norm Transformers. Empirically, we show semantics are primarily encoded by direction in CLIP token space, while inflated norms harm contextualization; theoretically, we analyze how large magnitudes attenuate positional information and hinder residual updates in pre-norm blocks. We propose Directional Textual Inversion (DTI), which fixes the embedding magnitude to an in-distribution scale and optimizes only direction on the unit hypersphere via Riemannian SGD. We cast direction learning as MAP with a von Mises-Fisher prior, yielding a constant-direction prior gradient that is simple and efficient to incorporate. Across personalization tasks, DTI improves text fidelity over TI and TI-variants while maintaining subject similarity. Crucially, DTI's hyperspherical parameterization enables smooth, semantically coherent interpolation between learned concepts (slerp), a capability that is absent in standard TI. Our findings suggest that direction-only optimization is a robust and scalable path for prompt-faithful personalization.

View arXiv page View PDF Project page GitHub 2 Add to collection

Community

kunheekim

Paper author Paper submitter 1 day ago

Hi everyone! 👋

We investigated why Textual Inversion (TI) often ignores context and traced the issue to embedding norm inflation. We found that standard TI learns tokens with massive magnitudes (often >20) compared to the model's native vocabulary (≈0.4), which we prove theoretically breaks the representation update in pre-norm Transformers.

Our solution, Directional Textual Inversion (DTI), fixes the magnitude to an in-distribution scale and optimizes only the direction on the hypersphere using Riemannian SGD. This simple change significantly improves prompt fidelity and enables smooth spherical interpolation (slerp) between concepts.

We’d love for you to try it out! Code is available here: https://github.com/kunheek/dti

librarian-bot

about 9 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.13672 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.13672 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.13672 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.