Demo of Remade.Ai LoRA WAN Model Show Impressive Animation Possibilities from Static Images

Get All The Latest Research & News!

Subscribe

Wan2.1 is an advanced open-source suite of video foundation models developed by the Wan team, aiming to push the boundaries of video generation. The project is hosted on GitHub and offers several notable features:

State-of-the-Art Performance: Wan2.1 consistently outperforms existing open-source models and commercial solutions across multiple benchmarks.
Consumer-Grade GPU Support: The T2V-1.3B model requires only 8.19 GB of VRAM, making it compatible with most consumer-grade GPUs. It can generate a 5-second 480P video on an RTX 4090 in about 4 minutes without optimization techniques like quantization.
Multifaceted Capabilities: Wan2.1 excels in various tasks, including Text-to-Video, Image-to-Video, Video Editing, Text-to-Image, and Video-to-Audio generation. Notably, it is the first video model capable of generating both Chinese and English text within videos, enhancing its practical applications.

Full Paper Github

HuggingFace LoRA Model by Remade.ai

Full Summary prepared by Adobe Acrobat AI

WAN2 is an open suite of advanced video generative models developed by the Wan Team at Alibaba Group, highlighting their innovations, performance, and applications in video generation technology.

Advancements in Video Generative Models

The report discusses the Wan video generative models, highlighting their performance, efficiency, and open-source availability.

Wan is a suite of video foundation models built on a diffusion transformer paradigm, featuring a 14B model trained on billions of images and videos.
The 1.3B model requires only 8.19 GB VRAM, making it efficient for consumer-grade GPUs while outperforming larger open-source models.
Wan supports multiple applications, including image-to-video and instruction-guided video editing, and can generate visual text in both Chinese and English.
The model consistently outperforms existing solutions, achieving superior results in benchmarks and human evaluations, with a win rate of 0.73 against competitors.
All models and source code are open-sourced to foster community growth and advance video generation technology.

Advancements in Video Generation Technology

This section discusses the introduction of the Wan model, addressing challenges in video generation technology and enhancing open-source capabilities.

The Wan model, introduced by OpenAI, aims to set a new benchmark in video generation technology.
It incorporates 14 billion parameters and has been trained on billions of images and videos, totaling O(1) trillions of tokens.
Wan addresses challenges in open-source video generation, including suboptimal performance, limited capabilities, and insufficient efficiency.
A 1.3B model is also introduced, requiring only 8.19G of VRAM, making it accessible for consumer-grade GPUs.
The model supports various tasks, including text-to-video, image-to-video, and real-time video generation, enhancing overall usability.
Comprehensive training processes and design details will be publicly shared to empower the community in developing specialized models.
The advancements are expected to significantly accelerate the growth and innovation within the video generation technology field.

Advances in Video Generation Models

This section reviews the evolution of video generation models, highlighting closed-source and open-source contributions.

The landscape of large-scale video models has evolved significantly, particularly in diffusion-based frameworks.
Closed-source models, such as OpenAI's Sora and Meta's Movie Gen, have been released, showcasing high-quality video generation capabilities.
Wan, an open-source model, demonstrates competitive or superior performance against commercial models in various benchmarks.
Key components of diffusion-based video generation include autoencoders, text encoders, and neural networks optimized via diffusion techniques.
Recent advancements in open-source models have led to the emergence of promising video generation technologies, including HunyuanVideo and LTX-Video.
Downstream tasks in video generation, such as editing and controllable generation, have been explored, enhancing user interaction.
The review emphasizes the importance of integrating effective modules and optimization techniques for high-quality video synthesis.
The competitive landscape highlights the intense global competition in the video generation sector, with numerous models being developed and tested.

Data Construction Pipeline for Wan

This section details the data construction pipeline for training the Wan model, emphasizing quality, diversity, and scale.

The dataset for Wan comprises billions of videos and images, focusing on high quality, diversity, and substantial scale.
A four-step data cleaning process was implemented to filter out unsuitable data based on fundamental dimensions, visual quality, and motion quality.
Approximately 50% of the initial dataset was eliminated during preprocessing, retaining high-quality data for further refinement.
The visual quality assessment involved clustering data into 100 subsets and scoring samples to ensure alignment with natural data distribution.
Motion quality assessment categorized videos into six tiers, with optimal motion videos prioritized for training.
The visual text data processing included synthesizing millions of text-containing images and collecting real-world image-text pairs.
Post-training data processing aimed to enhance visual fidelity and motion dynamics, with millions of curated images and videos collected.
An internal caption model was developed to generate dense captions for images and videos, improving descriptive quality.
The model design utilized a LLaVA-style architecture, incorporating a ViT encoder and a Qwen LLM for effective caption generation.
The Wan-VAE architecture was introduced for video generation, achieving a spatio-temporal compression of 4×8×8.
Wan-VAE was trained in three stages, starting with a 2D image VAE and transitioning to a 3D causal VAE.
The feature cache mechanism was implemented to optimize memory usage and maintain temporal coherence during video processing.
Wan-VAE demonstrated competitive performance with a PSNR comparable to state-of-the-art models while being 2.5 times faster in reconstruction speed.
Qualitative results showed superior performance in reconstructing textures, faces, text, and high-motion scenes compared to existing models.
The overall advancements in the data construction pipeline and model design provide valuable insights for future generative model development.

Architecture Design for Text-to-Video Model

This section details the architecture, pre-training, and post-training phases of the foundational video model for text-to-video tasks.

The foundational video model, Wan, is based on the DiT architecture and consists of Wan-VAE, a diffusion transformer, and a text encoder.
The diffusion transformer includes a patchifying module, transformer blocks, and an unpatchifying module, focusing on spatio-temporal relationships and text conditions.
The model reduces parameter count by approximately 25% while improving performance through a shared MLP across transformer blocks.
Pre-training involves a flow matching framework, starting with low-resolution images and progressing to high-resolution images and videos.
The training objective uses mean squared error to predict velocity, with a focus on maintaining stable training through ODEs.
Joint training progresses through three stages, starting with 256 px images and 192 px videos, and culminating in 720 px resolutions.
Post-training maintains the same architecture and optimizer, focusing on joint training at 480 px and 720 px resolutions.
The model employs bf16-mixed precision and the AdamW optimizer, with an initial learning rate of 1e−4 and weight decay of 1e−3.

Computational Costs and Optimization Strategies

The section discusses the computational costs, parallelism strategies, memory optimization, and cluster reliability in the Wan model.

The DiT model accounts for over 85% of the overall computation during training, with attention costs increasing quadratically with sequence length.
In scenarios with a sequence length of 1 million tokens, attention computation can take up to 95% of training time.
The GPU memory usage for the DiT model can exceed 8 TB when processing 1 million tokens with a micro-batch size of 1.
The model employs a combination of Data Parallelism (DP) and Fully Sharded Data Parallel (FSDP) to manage memory and computational workload.
A two-dimensional Context Parallelism (CP) strategy reduces communication overhead from over 10% to below 1% in specific configurations.
Activation offloading is prioritized to reduce GPU memory usage, allowing for computation overlap and improved performance.
The training cluster benefits from Alibaba Cloud’s intelligent scheduling and self-healing capabilities, ensuring high reliability and performance.
Overall, the section emphasizes the importance of optimizing computational costs and memory usage in large-scale model training.

Inference Optimization Techniques for Video Generation

This section discusses various strategies for optimizing inference in video generation, including parallel processing, caching, and quantization.

The primary goal of inference optimization is to minimize video generation latency, typically involving around 50 sampling steps.
Techniques such as quantization and distributed computing are employed to reduce individual step time and overall computational load.
The Context Parallel strategy and model sharding are utilized to achieve nearly linear speedup on the Wan 14B model.
Diffusion caching leverages attention and CFG similarities, improving inference performance by 1.62 times for the Wan 14B model.
FP8 quantization for GEMM operations results in a 1.13 times speedup in the DiT module, while 8-bit FlashAttention enhances efficiency by over 1.27 times.
Mixed 8-Bit optimization and FP32 accumulation techniques are applied to maintain accuracy and performance during quantization.
The optimizations lead to significant improvements in both computational efficiency and numerical stability in video generation tasks.

Enhancing Video Generation with Prompt Alignment

This section discusses strategies for improving video generation by aligning user prompts with training caption distributions.

The prompt alignment process involves augmenting videos with diverse captions to cover various styles and lengths.
A distribution mismatch exists between user prompts, which are often concise, and the longer training captions, affecting video quality.
To address this, user prompts are rewritten using a large language model (LLM) to enhance detail and align with training captions.
The LLM is guided to maintain original meanings while adding details and structuring prompts similarly to post-training captions.
The effectiveness of LLM-assisted prompt rewriting is evaluated, with Qwen2.5-Plus chosen for balancing speed and performance.

Evaluating Video Generation Models with Wan-Bench

Wan-Bench introduces a comprehensive evaluation framework for video generation models, aligning with human perception through multiple metrics.

Wan-Bench evaluates video generation models across three dimensions: dynamic quality, image quality, and instruction following, using 14 metrics.
Dynamic quality metrics include large motion generation, human artifacts detection, physical plausibility, smoothness, pixel-level stability, and ID consistency.
The large motion generation score for Wan 1.3B is 0.468, while the human artifacts score is 0.707.
Image quality is assessed through comprehensive image quality, scene generation quality, and stylization, with a comprehensive image quality score of 0.596 for Wan 1.3B.
Instruction following metrics include single object accuracy (0.930) and action instruction following (0.844) for Wan 1.3B.
The overall weighted score for Wan 1.3B is 0.689, indicating its competitive performance among evaluated models.
The evaluation framework utilizes human feedback to weight dimensions based on user preferences, enhancing alignment with human perception.
The results demonstrate the effectiveness of Wan-Bench in providing a nuanced assessment of video generation capabilities.

Metrics and Results of Wan Model

The section evaluates the performance of the Wan model against competitors using various metrics and qualitative assessments.

Wan model outperforms commercial and open-source models in video generation, achieving a total score of 86.22%.
The Wan 14B model scored 86.67% in visual quality and 84.44% in semantic consistency, leading the benchmark.
Human evaluations showed Wan 14B excelled in alignment, image quality, dynamic quality, and overall quality across 700 tasks.
Wan demonstrated state-of-the-art performance on the VBench leaderboard, excelling in 16 human-aligned dimensions.
The ablation study revealed that shared adaptive normalization layers improved performance while reducing parameter count.
Among text encoders, umT5 outperformed others in generating text embeddings, achieving a FID score of 43.01.
The VAE model consistently achieved lower FID scores than the VAE-D variant, indicating better performance in image generation.
The efficient Wan 1.3B variant scored 83.96%, surpassing several commercial models, showcasing its competitive edge.

Advancements in Image-to-Video Generation

This section discusses the development and evaluation of the Wan-I2V model for image-to-video generation and related tasks.

The image-to-video (I2V) generation task synthesizes dynamic video sequences from static images guided by textual prompts.
Wan-I2V model utilizes a binary mask to control which frames are generated, enhancing video synthesis capabilities.
The model design includes a condition image concatenated with zero-filled frames, processed through a diffusion model.
Joint training incorporates multiple tasks, including image-to-video generation, video continuation, and first-last frame transformation.
The dataset for I2V generation emphasizes similarity between the first frame and video content to improve training stability.
The model achieved a visual quality win rate of 81.6% in pairwise comparisons against other models.
The unified framework for video editing integrates various tasks, enhancing flexibility and usability in video generation.
The Video Condition Unit (VCU) unifies diverse input conditions, allowing for effective editing and generation tasks.
The model supports resolutions up to 720p and incorporates advanced techniques for video personalization.
The camera motion control module utilizes extrinsic and intrinsic parameters to match video motion accurately.
Real-time video generation is achieved through a streaming pipeline, allowing for continuous video creation without length constraints.
The integration of Streamer and Consistency Models accelerates the video generation process, achieving 8-16 FPS.
Quantization techniques optimize the model for consumer-level devices, balancing efficiency and generation quality.
The audio generation framework produces synchronized soundtracks for videos, enhancing the overall multimedia experience.
The training dataset for audio generation consists of O(1) thousand hours of filtered videos, ensuring high-quality soundtracks.

Comparative Analysis of Audio Generation Models

Our V2A model outperforms MMAudio in audio generation, showcasing enhanced long-term consistency and cleaner outputs, particularly in scenarios like ‘pouring coffee’ and ‘typing’. However, it struggles with human vocal sounds due to the exclusion of speech-related data in training. Future work aims to incorporate speech generation capabilities.

Advancements and Limitations of Wan Model

The section discusses the Wan video generation model's achievements, limitations, and future directions for improvement.

Wan is a foundational video generation model that has shown significant improvements in motion amplitude and instruction-following capabilities.
Challenges remain in preserving fine-grained details during large motion scenarios, requiring further research for fidelity enhancement.
The 14B model's inference time is approximately 30 minutes on a single GPU, highlighting the need for efficiency and scalability.
The model's performance in specific domains like education and medicine is currently insufficient, prompting plans for open-sourcing and community development.
A smaller 1.3B model achieves competitive performance and enables inference on consumer-grade GPUs, enhancing accessibility for content creators.

Follow us

Demo of Remade.Ai LoRA WAN Model Show Impressive Animation Possibilities from Static Images

Get All The Latest Research & News!

Full Summary prepared by Adobe Acrobat AI

Share this post

Tags

blogs

Get In Front of 1000s of Professionals Today! Advertise with Josh

My Most Popular Articles

Every shirt tells a story—and every story

#ClothingForACause