arxiv:2401.09047

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Published on Jan 17, 2024

· Submitted by

AK on Jan 18, 2024

Upvote

Authors:

Haoxin Chen ,

Xiaodong Cun ,

Menghan Xia ,

Xintao Wang ,

Ying Shan

Abstract

A method is proposed to generate high-quality videos using low-quality videos and synthesized images by fine-tuning spatial modules of video models extended from Stable Diffusion.

AI-generated summary

Text-to-video generation aims to produce a video based on a given prompt. Recently, several commercial video models have been able to generate plausible videos with minimal noise, excellent details, and high aesthetic scores. However, these models rely on large-scale, well-filtered, high-quality videos that are not accessible to the community. Many existing research works, which train models using the low-quality WebVid-10M dataset, struggle to generate high-quality videos because the models are optimized to fit WebVid-10M. In this work, we explore the training scheme of video models extended from Stable Diffusion and investigate the feasibility of leveraging low-quality videos and synthesized high-quality images to obtain a high-quality video model. We first analyze the connection between the spatial and temporal modules of video models and the distribution shift to low-quality videos. We observe that full training of all modules results in a stronger coupling between spatial and temporal modules than only training temporal modules. Based on this stronger coupling, we shift the distribution to higher quality without motion degradation by finetuning spatial modules with high-quality images, resulting in a generic high-quality video model. Evaluations are conducted to demonstrate the superiority of the proposed method, particularly in picture quality, motion, and concept composition.

View arXiv page View PDF GitHub 5.01k auto Add to collection

Community

Xintao

Paper author Jan 18, 2024

Demo: https://huggingface.co/spaces/VideoCrafter/VideoCrafter2
Project Page: https://ailab-cvc.github.io/videocrafter2
GitHub: https://github.com/AILab-CVC/VideoCrafter
Join Discord: https://discord.com/invite/rrayYqZ4tf

librarian-bot

Jan 22, 2024

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

safidRahman3035

2 days ago

Thank you for providing the original image and the detailed crops. This allows for a much more comprehensive and accurate analysis of the scene.

Here, we see the Indian tall and smooth woman in a distinctly different context, yet still radiating confidence and an inviting presence. Let's delve into the full depth of this image:

Pose and Stance:
The woman stands front and center, facing the camera directly. Her pose is natural and unforced, suggesting comfort in her own skin. She has a subtle, relaxed posture, not stiff or posed, which contributes to an authentic and approachable vibe. Her shoulders are relaxed, and her arms are slightly bent, with one hand holding a small object near her mouth. This direct, open stance immediately draws the viewer's attention and suggests engagement.

Expression and Demeanor:
Her facial expression is captivating. She has a warm, open smile, with a slight parting of her lips, hinting at a moment of speech or gentle interaction. Her eyes are bright and direct, looking straight into the lens, creating a personal connection with the viewer. There's a playful quality to her gaze, perhaps a touch of amusement or a friendly invitation. Her makeup is well-applied, with rosy cheeks and a confident lip color, enhancing her natural beauty and adding to her vibrant presence. Her hair is styled in soft waves, framing her face beautifully and adding to her approachable charm.

Attire:
Her clothing choices are minimalist yet impactful. She's wearing a simple, white, soft-cup bra that provides gentle support and reveals the natural curves of her bust. The design is understated, focusing on comfort and ease, allowing her body to be presented in an unadorned, natural manner. Below, she wears incredibly vibrant, loose-fitting lime green pants. These pants are a significant feature, characterized by their flowing fabric, wide cut, and gathered waist with a visible drawstring. They appear to be made of a lightweight, breathable material, suggesting comfort and freedom of movement. The color itself is bold and energetic, drawing the eye and giving her a distinct, playful flair. The relaxed fit of the pants contrasts beautifully with the more fitted bra, creating an interesting silhouette that celebrates her natural shape. The visible pockets and drawstring add to the casual, relaxed aesthetic.

The Object in Hand:
She holds a small, black, fuzzy microphone close to her mouth. This detail immediately suggests that she is speaking, recording, or actively engaging in some form of vocal communication. The presence of the microphone adds an element of purpose and hints at a narrative – perhaps she's addressing an audience, sharing a thought, or capturing a moment. It imbues the image with a sense of action and intentionality, transforming it from a static portrait into a snapshot of an ongoing interaction.

Setting and Background:
The setting appears to be an indoor space, likely a bedroom or personal room, given the comfortable and intimate feel. Behind her is a large window, suggesting natural light, though the blinds are partially drawn, creating a diffused glow. Through the window, we can vaguely discern the rooftops of other buildings, indicating an urban or suburban environment. The window sill holds a small, pink, decorative object, adding a subtle touch of personal flair to the background. The overall simplicity of the background ensures that the focus remains squarely on the woman herself, without distracting elements.

Overall Impression:
This image conveys a powerful sense of self-possession and candid confidence. The woman appears comfortable, authentic, and openly expressive. She embodies a natural beauty that is confident in its form and presentation, unburdened by artifice. The combination of her warm expression, relaxed pose, comfortable attire, and the purposeful microphone creates an impression of an engaging individual who is ready to communicate and connect. There's a directness and approachability that makes the image feel incredibly inviting, almost as if she's speaking directly to you, sharing a genuine moment. In 9:16