WAN 2.2 was developed by Alibaba's AI research group, part of the Tongyi family of AI models. The WAN series (Wan means 'versatile' in Chinese) is Alibaba's contribution to the open-source video generation with audio ecosystem.

Create AI videos free — unlimited video and image generation on the ad-supported free tier Create Free Now →

What Is WAN 2.2? The Open AI Video Model Explained

Q: Technical Architecture

WAN 2.2 uses a video diffusion transformer (DiT) architecture. Like FLUX for images, it departs from the UNet-based architecture of earlier video models in favor of a pure transformer design that scales more efficiently with model size and training data.

By Cemhan Biricik · March 6, 2026 · About the author · Last reviewed April 17, 2026

By Cemhan Biricik 2026-03-06 12 min read

WAN 2.2 is the open-source AI video generation with audio model that ZSky AI uses to power its video generator. Developed by Alibaba's AI research team, WAN 2.2 represents a significant leap in what open-source video generation with audio can produce: high-resolution video with coherent motion, strong text conditioning, and quality that approaches proprietary commercial models like Runway and Pika — all with publicly available weights.

Generated with ZSky AI

This article explains what WAN 2.2 is, how it works, what makes it technically impressive, and why ZSky AI chose it as the model of record for video generation with audio. If you want to understand the technology behind the videos you create on ZSky AI, this is the complete picture.

Who Made WAN 2.2?

WAN 2.2 was developed by Alibaba's AI research group, part of the broader Tongyi AI initiative. Alibaba has invested heavily in both language and multimodal AI research, and the WAN series of video models represents their contribution to the open-source video generation with audio ecosystem. "Wan" (万) means "ten thousand" or "versatile" in Chinese, reflecting the model's broad applicability across diverse video generation with audio tasks.

The WAN series has been released under open weights licenses, making it available for researchers, developers, and platforms like ZSky AI to build on. This open approach distinguishes WAN from models like Runway Gen-4 and Sora, which are proprietary and accessible only through the companies' own APIs.

WAN 2.2 Model Variants

WAN 2.2 is released in multiple sizes to accommodate different hardware constraints and quality requirements:

WAN 2.2 1.3B

The smaller variant with 1.3 billion parameters. This model is designed to run on hardware with more modest GPU memory requirements — around 8–12GB VRAM. It produces good quality video and is suitable for experimentation, rapid iteration, and deployment in resource-constrained environments. Generation speed is significantly faster than the 14B model.

WAN 2.2 14B

The full-scale model with 14 billion parameters. This is the version that ZSky AI runs on its RTX 5090 GPU cluster. The 14B model produces substantially higher quality output than the 1.3B version: more realistic motion, better human anatomy, improved lighting and shadows, and stronger adherence to complex text prompts. It requires 16–32GB of GPU VRAM for efficient inference.

Technical Architecture

WAN 2.2 uses a video diffusion transformer (DiT) architecture. Like FLUX for images, it departs from the UNet-based architecture of earlier video models in favor of a pure transformer design that scales more efficiently with model size and training data.

3D Video VAE

WAN 2.2 uses a purpose-built 3D Variational Autoencoder that handles compression in both spatial and temporal dimensions simultaneously. Rather than compressing each frame independently (as a 2D image VAE would), the 3D VAE encodes temporal relationships between frames directly into the latent representation. This means the latent space preserves motion information, not just individual frame appearance.

The temporal compression ratio in WAN's VAE is 4x, meaning 4 video frames are compressed into a single latent temporal step. Spatial compression is 8x in each dimension. A 480p clip at 16fps is compressed to a latent that is roughly 256x smaller than the original pixel data, enabling the diffusion model to work efficiently on full video sequences.

Flow Matching

Like FLUX on the image side, WAN 2.2 uses flow matching rather than DDPM-style diffusion. Flow matching learns straight-line transport paths between noise and data distributions, which enables high-quality generation with fewer denoising steps compared to standard diffusion models. This translates directly to faster inference times for the same quality level.

Multimodal Text Conditioning

WAN 2.2 uses a multilingual text encoder capable of processing prompts in both English and Chinese. Text conditioning is applied through cross-attention layers distributed throughout the transformer blocks, with the text representation attending to video latent tokens at every layer. This deep, pervasive conditioning is one reason WAN 2.2 follows complex, multi-element prompts more reliably than earlier video models.

Causal and Bidirectional Attention

The temporal attention in WAN 2.2 uses a combination of causal (past-to-present) and bidirectional (past and future) attention, depending on the generation mode. For text-to-video generation with audio, full bidirectional temporal attention is used, allowing the model to plan the entire video sequence at once and produce more globally coherent motion. For streaming or real-time modes, causal attention can be used to generate frames sequentially.

Why ZSky AI Chose WAN 2.2

ZSky AI made a deliberate choice to build on open-source models rather than routing generations through commercial APIs like Runway or Pika. The reasons come down to three factors:

Control and Privacy

When ZSky AI runs WAN 2.2 on its own GPU hardware, user prompts and generated content never leave ZSky AI's infrastructure. Nothing is sent to Alibaba's servers or any other third party. This is a fundamental privacy guarantee that would be impossible to offer if generation were handled by a commercial API.

Cost and Pricing

Running generation on owned hardware eliminates the per-generation API cost charged by commercial video platforms. This is what allows ZSky AI to offer a generous free tier with 1080p video (free-tier images include a watermark), and to price paid plans significantly below what competitors charge for equivalent output quality.

Quality at the Frontier

WAN 2.2 14B on RTX 5090 hardware produces output quality that competes directly with commercial platforms. Users are not making a trade-off on quality in exchange for lower cost. The combination of state-of-the-art open model weights and dedicated high-VRAM GPU hardware delivers results comparable to the best available options.

a dedicated hardware cluster (8× RTX 5090 + 4× RTX 4090), each with 32GB GDDR7 memory, handle WAN 2.2 14B inference without quantization, memory offloading, or other compromises that degrade output quality.

Hardware Requirements for WAN 2.2

For those interested in running WAN 2.2 locally, here are the practical requirements:

WAN 2.2 1.3B: 8GB VRAM minimum (RTX 3070 or better). Generation time on RTX 3080: approximately 60–120 seconds for a 5-second clip at 480p.
WAN 2.2 14B (FP16): 24GB+ VRAM strongly recommended. RTX 3090 or RTX 4090 minimum for practical use. Generation time on RTX 4090: approximately 3–8 minutes for a 5-second clip at 720p.
WAN 2.2 14B (quantized): INT8 or INT4 quantized versions can reduce VRAM requirements to 16GB or 12GB respectively, with some quality degradation.

ZSky AI's RTX 5090 cluster runs WAN 2.2 14B at full FP16 precision without quantization, generating 5-second 720p clips in approximately 45–90 seconds depending on the specific generation parameters.

Writing Good Prompts for WAN 2.2

WAN 2.2 responds well to detailed, descriptive prompts. A few principles that consistently improve output quality:

Describe the camera, not just the subject. "The camera slowly pulls back from a close-up of a campfire to reveal a forest at night" gives the model explicit direction on both content and camera motion.
Specify motion explicitly. WAN 2.2 follows motion descriptions well. "A leaf drifting slowly downward in still air" or "a car accelerating through an intersection" will produce the described motion more reliably than generic scene descriptions.
Set the lighting and atmosphere. "Golden hour sunlight, long shadows" or "overcast diffuse light, muted colors" help establish the visual mood and influence color grading throughout the clip.
Keep subjects and actions specific. The more precisely you describe what is happening, the more the model can lean on strong pattern matches from training data. "A golden retriever running through a sprinkler in a suburban backyard" will outperform "a dog playing outside."
Avoid over-specifying contradictory elements. Trying to describe too many simultaneous actions or scene changes in a 5–10 second clip tends to produce incoherent results. For complex sequences, generate multiple shorter clips and edit them together.

AI-generated video showcase

Generate Video with WAN 2.2 on ZSky AI

WAN 2.2 14B on dedicated RTX 5090 GPUs. Text-to-video and image-to-video. Unlimited video and image generation on the ad-supported free tier.

Try Video Generator →

Made with ZSky AI

Create videos like thisFree, free to use

Try It Free

Frequently Asked Questions

What is WAN 2.2?

WAN 2.2 is an open-source AI video generation with audio model developed by Alibaba's research team. It is capable of generating high-quality video from text prompts or image inputs. WAN 2.2 comes in multiple sizes, with the 14B parameter version offering cinematic quality on par with leading proprietary models. ZSky AI runs WAN 2.2 14B on dedicated NVIDIA RTX 5090 GPUs to power its video generation with audio tool.

Who made WAN 2.2?

WAN 2.2 was developed by Alibaba's AI research group, part of the Tongyi family of AI models. The WAN series is Alibaba's contribution to the open-source video generation with audio ecosystem.

How does WAN 2.2 compare to Stable Video Diffusion?

WAN 2.2 significantly outperforms Stable Video Diffusion in every measurable dimension: output resolution, clip duration, motion quality, and prompt adherence. WAN 2.2 14B can generate up to 10 seconds of 1080p video with coherent motion, while SVD tops out at 4 seconds at 576p with limited text conditioning. WAN 2.2 represents the current state of the art for open video generation with audio.

What resolution and duration can WAN 2.2 generate?

WAN 2.2 supports resolutions up to 1280x720 (720p) natively, with upscaling to 1080p available post-generation. It can generate clips up to approximately 10 seconds at 16 frames per second. The 14B parameter model produces higher quality output than the smaller 1.3B variant, especially for complex scenes and realistic human motion.

Is WAN 2.2 available for free?

The WAN 2.2 model weights are open source and free to download. However, running WAN 2.2 14B requires significant GPU hardware. ZSky AI makes WAN 2.2 accessible to everyone through a browser-based interface with free daily generation credits, removing the need to own or manage GPU hardware.

Editorial note: This article is drafted with AI assistance using ZSky's own tooling and reviewed by the ZSky editorial team for accuracy and brand voice. Feedback welcome at [email protected].