How AI Image Generation Actually Works: Diffusion Models Explained
Type a sentence, click generate, and a fully formed image appears in seconds. It can be photorealistic, painterly, cinematic, or abstract — anything your words describe. Behind this seemingly effortless experience lies a sophisticated mathematical process called diffusion, and understanding it will fundamentally change how you write prompts, choose parameters, and evaluate AI image tools.
This article breaks down every stage of the diffusion pipeline in detail: how noise becomes structure, how text steers that structure, what latent space actually means, and why parameters like CFG scale and step count change your results the way they do. If you have ever wondered how Stable Diffusion, FLUX, or DALL-E actually work under the hood, this is the comprehensive explanation.
The Forward Process: Destroying Images with Noise
Every diffusion model begins its training by learning what happens when you systematically destroy images. Take a photograph — a landscape, a portrait, a still life. Add a small amount of random Gaussian noise. The image looks slightly grainy but still recognizable. Add more noise. Then more. After enough steps, the original image is completely gone, replaced by pure static indistinguishable from random television snow.
This is the forward diffusion process, and it is entirely mechanical. No neural network is involved. The mathematics are straightforward: at each timestep t, a small amount of noise sampled from a Gaussian distribution is added to the image according to a predefined noise schedule. The schedule determines how quickly the image is destroyed — linear schedules add noise at a constant rate, while cosine schedules preserve more image structure in early steps and destroy it faster at the end.
The forward process can be written concisely. Given a clean image x0, the noisy version at timestep t is:
x_t = sqrt(alpha_t) * x_0 + sqrt(1 - alpha_t) * epsilon
Here, alpha_t is a value that decreases from 1 to near 0 as t increases, and epsilon is random Gaussian noise. At t=0, you get the original image. At the final timestep, alpha_t is nearly zero and the result is almost entirely noise.
The crucial insight is that because this forward process is fully defined mathematically, we know exactly what noise was added at each step. This gives us training labels: we can train a neural network to predict that noise, and if it predicts correctly, it can reverse the process.
The Reverse Process: Learning to Denoise
The neural network at the heart of a diffusion model — historically a UNet, increasingly a transformer in modern architectures — is trained to do one thing: given a noisy image at a particular noise level, predict the noise that was added.
Training works by repeatedly sampling images from a large dataset, adding noise at random timesteps, and asking the network to predict the noise. The difference between the predicted noise and the actual noise (measured as mean squared error) is the training loss. Over millions of training steps on hundreds of millions of images, the network learns to predict noise with high accuracy across all noise levels.
Why does noise prediction lead to image generation? Because if you can accurately predict the noise in a noisy image, you can subtract it to recover a slightly cleaner version. Repeat this across many steps, starting from pure noise, and coherent image structure emerges progressively. The model effectively learns the statistical structure of all images in its training data — because understanding what "real images look like" is exactly what you need to know to distinguish signal from noise.
During generation (inference), the process unfolds in reverse:
- Start with a tensor of pure random Gaussian noise
- Feed it to the denoising network along with the current timestep
- The network predicts the noise component
- Subtract the predicted noise (scaled by the sampler algorithm) to get a slightly cleaner image
- Repeat from step 2 with the updated image and the next timestep
- After all steps complete, the result is a clean generated image
The number of steps, the sampler algorithm, and the noise schedule all influence the final quality. More steps generally mean higher quality because each step removes a smaller amount of noise, allowing the network to make more precise predictions. But there are diminishing returns — the difference between 20 steps and 30 steps is usually more visible than the difference between 30 and 50.
Latent Diffusion: The Compression Breakthrough
Running diffusion directly on pixel values is computationally brutal. A 1024×1024 RGB image contains over 3 million values, and the denoising network must process all of them at every single step. This made early pixel-space diffusion models like Imagen and DALL-E 2 extremely expensive to train and slow to run.
The breakthrough that made consumer-grade AI image generation possible was latent diffusion, introduced in the 2022 paper "High-Resolution Image Synthesis with Latent Diffusion Models" by Robin Rombach and colleagues at CompVis (LMU Munich). This is the architecture behind Stable Diffusion and, with modifications, FLUX.
The idea is to separate the problem into two parts:
- Compression: Train a Variational Autoencoder (VAE) to compress images into a much smaller latent representation and decompress them back to pixels
- Generation: Run the entire diffusion process in this compressed latent space
A typical VAE compresses a 512×512×3 pixel image (786,432 values) into a 64×64×4 latent tensor (16,384 values) — a 48x reduction. For 1024×1024 images (as in SDXL), the latent is 128×128×4, still a massive compression from the 3.1 million pixel values.
The VAE is trained separately from the diffusion model. It learns to preserve perceptually important information — edges, textures, colors, spatial relationships — while discarding imperceptible details. Because human visual perception is insensitive to the discarded information, decoded images look virtually identical to originals despite the heavy compression.
Latent diffusion reduced compute requirements by roughly 48x compared to pixel-space diffusion, making it possible to generate high-quality images on a single consumer GPU in seconds rather than minutes on a cluster.
The full generation pipeline in latent diffusion is: encode text → generate noise in latent space → denoise in latent space over N steps → decode final latent to pixels via VAE. The diffusion model never touches pixels directly.
CLIP and Text Encoding: Turning Words into Numbers
Diffusion without text conditioning produces random images sampled from the training distribution. To generate specific content, the model needs to understand your prompt. This is where text encoders come in.
The most influential text encoder in image generation is CLIP (Contrastive Language-Image Pre-training), developed by OpenAI. CLIP was trained on approximately 400 million image-text pairs scraped from the internet. During training, CLIP learned to map images and their text descriptions into a shared embedding space where semantically related images and text are close together.
When you write a prompt like "a red fox sitting in snow under northern lights," CLIP's text encoder converts this into a sequence of token embeddings — high-dimensional vectors (typically 768 or 1024 dimensions per token) that numerically represent the meaning of each token and its contextual relationships with other tokens in the prompt.
These embeddings capture rich semantic information:
- Object identity: "fox" maps to a region of embedding space near images of foxes
- Attributes: "red" shifts the embedding toward red-colored variants
- Spatial relationships: "sitting in snow" encodes both the action and the environment
- Style and atmosphere: "northern lights" pulls toward specific lighting conditions and color palettes
Modern models use multiple text encoders for improved understanding. SDXL uses two CLIP encoders (ViT-L/14 and ViT-bigG). FLUX uses CLIP alongside T5-XXL, a text-only transformer encoder that provides much richer natural language understanding, particularly for complex multi-clause prompts, spatial descriptions, and abstract concepts.
The dual-encoder approach in FLUX is significant: CLIP excels at visual-semantic alignment (it knows what things look like), while T5 excels at language comprehension (it understands complex sentences). Together, they give the model both better visual knowledge and better prompt interpretation. This is a major reason FLUX handles long, detailed prompts more faithfully than SDXL.
Cross-Attention: Where Text Meets Image
Text embeddings on their own do not generate images. They must be injected into the denoising network so that every denoising step is influenced by the prompt. This injection happens through cross-attention layers.
In a standard attention mechanism, a neural network computes queries, keys, and values from the same input and attends to itself (self-attention). In cross-attention, the queries come from the image representation and the keys and values come from the text embeddings. This allows the image representation to "look at" the text at every layer and decide which parts of the prompt are relevant for each spatial region of the image.
In Stable Diffusion's UNet architecture, cross-attention layers are interspersed throughout the network at multiple resolution levels. This means text influence operates at both coarse scales (overall composition, subject placement) and fine scales (texture details, local color).
FLUX takes this further with joint attention blocks (also called MMDiT — Multi-Modal Diffusion Transformer). Instead of separate self-attention for the image and cross-attention for the text, FLUX concatenates image and text tokens into a single sequence and runs full self-attention over both. This allows image tokens to attend to text tokens and vice versa in a single operation, creating deeper bidirectional interaction between the two modalities.
This architectural difference partially explains why FLUX renders text within images much better than SDXL — the joint attention mechanism allows the model to precisely coordinate letter shapes with text token meanings at every layer.
Classifier-Free Guidance: Controlling Prompt Adherence
Even with cross-attention, the model might produce images that loosely relate to the prompt but do not closely follow it. Classifier-free guidance (CFG) is the mechanism that controls how tightly generation follows the text.
The concept is elegant. At each denoising step, the model runs twice:
- Conditioned pass: Denoise using your text prompt as conditioning
- Unconditioned pass: Denoise with no text conditioning (or with an empty prompt)
The actual denoising direction used is then computed as:
output = unconditioned + cfg_scale * (conditioned - unconditioned)
The term (conditioned - unconditioned) represents the "direction toward the prompt" — what the text is adding beyond what the model would generate on its own. The CFG scale amplifies this direction.
At cfg_scale = 1.0, the output equals the conditioned prediction with no amplification. At cfg_scale = 7.5 (a common default for SDXL), the text influence is amplified 7.5x beyond baseline. This makes the model much more strongly follow the prompt, at the cost of some naturalness and diversity.
| CFG Scale Range | Effect | Best For |
|---|---|---|
| 1.0 – 3.0 | Very loose prompt following, high diversity and naturalness | Abstract art, creative exploration |
| 3.5 – 7.0 | Balanced prompt adherence and image quality (FLUX sweet spot) | Most FLUX generations |
| 7.0 – 12.0 | Strong prompt following, may oversaturate (SDXL sweet spot) | SDXL, detailed scene matching |
| 12.0 – 20.0+ | Very strict adherence, often produces artifacts and harsh colors | Rarely recommended |
The computational cost of CFG is significant: it doubles the number of network forward passes per step, since you must run both the conditioned and unconditioned passes. Some newer architectures use distillation techniques to approximate CFG effects without the extra forward pass, improving generation speed. For a deeper look at prompt strategies that work with different CFG values, see our Prompt Engineering Masterclass.
Sampling Methods: The Math of Step-by-Step Denoising
The denoising loop requires a sampler (also called a scheduler or solver) that determines exactly how to update the noisy image at each step. Different samplers trade off speed, quality, and determinism.
DDPM (Denoising Diffusion Probabilistic Models) is the original approach. It models denoising as a stochastic process, adding a small amount of fresh noise at each step alongside the denoising. This produces high-quality results but requires many steps (often 1000) for convergence.
DDIM (Denoising Diffusion Implicit Models) reformulates the process as deterministic. Given the same seed and prompt, DDIM produces identical outputs every time. It also enables meaningful results in 20–50 steps, a massive speedup over DDPM. DDIM also enables interpolation between images by interpolating their initial noise tensors.
DPM-Solver and DPM-Solver++ treat the diffusion reverse process as an ordinary differential equation (ODE) and apply numerical ODE solvers to it. This mathematical reformulation achieves excellent quality in 15–25 steps by taking more intelligent step sizes. DPM-Solver++ is among the most popular samplers for SDXL.
Euler and Euler Ancestral are simple first-order solvers. Euler is deterministic; Euler Ancestral adds stochastic noise at each step, producing more varied outputs at the cost of some consistency. Both are fast and produce acceptable quality.
Flow Matching (used by FLUX) is a fundamentally different approach. Rather than learning to reverse a noise-adding process, flow matching learns a velocity field that directly transports noise to data along straight paths. This makes the generation process more efficient — straight paths require fewer steps to traverse than the curved paths of standard diffusion. FLUX typically produces excellent results in 20–28 steps.
Distillation methods like LCM (Latent Consistency Models) and SDXL Turbo train a student model to approximate the output of many denoising steps in a single or few steps. These can generate usable images in 4–8 steps at some quality cost, enabling near-real-time generation for interactive applications.
The VAE Decoder: Latents to Pixels
After the diffusion process finishes, you have a clean latent tensor — a compressed mathematical representation of your image. The final pipeline stage is decoding this latent back into visible pixels through the VAE decoder.
The VAE decoder is a convolutional neural network that performs the inverse of the encoder's compression. It takes the 64×64×4 (or 128×128×4 for SDXL-resolution models) latent and progressively upsamples it through convolutional layers, eventually producing a full-resolution RGB image.
The quality of the VAE matters more than many users realize. A good VAE preserves fine details like skin texture, thin lines, text characters, and subtle color gradients. A poor VAE introduces blur, color shifts, or artifacts during decoding. FLUX's VAE produces notably sharper decodes than SDXL's original VAE, contributing to FLUX's reputation for higher image sharpness.
Some workflows add post-processing after VAE decoding: face restoration with specialized models like GFPGAN or CodeFormer, upscaling with Real-ESRGAN or similar super-resolution models, or color correction. ZSky AI generates at native model resolution to preserve authentic model output characteristics, with optional upscaling available for users who need higher-resolution output for print or large displays. For a deep dive on resolution and upscaling, see our AI Image Resolution Guide.
Negative Prompts: What Not to Generate
Negative prompts are a feature of models that use CFG (particularly SDXL). Instead of running the unconditioned pass with a completely empty prompt, you can substitute a negative prompt that describes what you want to avoid.
The CFG formula then becomes:
output = negative_conditioned + cfg_scale * (positive_conditioned - negative_conditioned)
By replacing the unconditioned baseline with a negative prompt, you push generation away from whatever the negative prompt describes. Common negative prompts include terms like "blurry, low quality, distorted, extra fingers" to discourage common artifacts.
FLUX uses negative prompts less heavily than SDXL because its improved architecture and training naturally avoid many common artifacts. However, negative prompts can still be useful for fine-tuning specific aspects of the output.
Seeds, Reproducibility, and Controlled Variation
The seed parameter is a random number that initializes the starting noise tensor. With deterministic samplers, the same seed + same prompt + same parameters = identical output every time. This is essential for reproducible workflows.
Seeds enable several useful techniques:
- Iteration: Find a seed that produces a good composition, then refine the prompt while keeping the seed fixed to preserve the overall layout
- Batch exploration: Generate the same prompt with sequential seeds to quickly explore the output space
- Sharing: Share a seed alongside your prompt so others can reproduce your exact result
- Interpolation: Blend between two seeds (using spherical linear interpolation on the noise tensors) to create smooth transitions between images
Different seeds activate different initial noise patterns, which lead the denoising process down different "paths" through the model's learned distribution. Some seeds reliably produce better compositions than others for specific types of content, which is why experienced users often save and reuse seeds.
The Complete Pipeline: From Prompt to Image
Putting it all together, here is every step that happens when you generate an image on a platform like ZSky AI:
- Text encoding: Your prompt is tokenized and passed through CLIP (and T5 for FLUX) to produce text embeddings
- Noise initialization: A random noise tensor is generated in latent space using the seed value
- Iterative denoising: For each timestep (e.g., 28 steps for FLUX):
- The denoising network receives the current noisy latent, the timestep, and text embeddings
- It predicts the noise (or velocity in flow matching) via self-attention, cross-attention, and feedforward layers
- For CFG: a second unconditioned pass is run and results are blended according to the guidance scale
- The sampler computes the update step and produces the next, slightly cleaner latent
- VAE decoding: The final clean latent is passed through the VAE decoder to reconstruct full-resolution pixels
- Post-processing (optional): Upscaling, face restoration, or other enhancements
- Output: The finished image is saved and displayed
On ZSky AI's dedicated RTX 5090 GPUs, this entire pipeline completes in 3–8 seconds for a standard FLUX generation at 1024×1024 resolution with 28 steps. The GPU performs billions of matrix multiplications across those steps, but the latent diffusion compression makes it tractable on a single card.
Why This Matters for Your Prompts
Understanding the pipeline changes how you approach prompt writing. Knowing that CLIP encodes semantic meaning means you should describe concepts clearly rather than relying on obscure syntax. Knowing that cross-attention operates at multiple scales means detailed prompts can influence both composition and fine details. Knowing how CFG works means you can adjust guidance scale to balance creativity against prompt fidelity rather than blindly using defaults.
For a practical application of these concepts, read our Prompt Engineering Masterclass: 50 Tips for Better AI Images, which translates the technical foundations covered here into actionable prompt-writing strategies.
See Diffusion in Action on ZSky AI
Generate images with advanced AI on dedicated RTX 5090 GPUs. 200 free credits at signup + 100 daily when logged in, no credit card required, no video watermark.
Try ZSky AI Free →
Frequently Asked Questions
How do diffusion models generate images from text?
Diffusion models generate images by starting from pure random noise and gradually removing that noise over many steps, guided by your text prompt. The text is first converted into numerical embeddings by a CLIP or T5 encoder. These embeddings are injected into the denoising neural network via cross-attention layers at each step, steering the noise removal toward an image that matches your description. After all denoising steps complete, a VAE decoder converts the compressed latent representation into a full-resolution pixel image.
What is latent diffusion and why does it matter?
Latent diffusion runs the noise-removal process in a compressed mathematical space rather than directly on image pixels. A Variational Autoencoder (VAE) compresses images by roughly 48x before diffusion begins. This dramatically reduces computation requirements — making it possible to generate high-quality images on consumer GPUs rather than requiring data center hardware — without meaningfully degrading output quality.
What does CFG scale (guidance scale) do in image generation?
CFG (Classifier-Free Guidance) scale controls how strictly the model follows your text prompt. At each denoising step, the model makes both a conditioned prediction (using your prompt) and an unconditioned prediction (ignoring the prompt). The CFG scale amplifies the difference between these two. Higher values (7–12 for SDXL, 3.5–7 for FLUX) produce images that match the prompt more closely but may look oversaturated. Lower values produce more natural-looking but less prompt-faithful results.
How many steps does it take to generate an AI image?
The number of denoising steps varies by model and sampler. SDXL typically uses 20–30 steps. FLUX produces excellent results in 20–28 steps. Distilled models like SDXL Turbo or LCM can generate acceptable images in 4–8 steps. More steps generally improve quality up to a point, after which returns diminish significantly.
What is the difference between Stable Diffusion and FLUX?
Both are open-weight diffusion models, but they differ in architecture. Stable Diffusion (SDXL) uses a UNet backbone with DDPM-style diffusion and CLIP text encoding. FLUX replaces the UNet with a transformer, uses rectified flow matching for straighter noise-to-image paths, and adds T5 alongside CLIP for better text understanding. FLUX generally produces higher quality images with better text rendering and anatomical accuracy.
Why do AI-generated images sometimes look distorted?
Distortions occur because diffusion models learn statistical patterns rather than explicit rules about anatomy or physics. The model predicts pixels based on training data patterns, not from understanding that hands have five fingers. Areas with high variance in training data are harder for the model to get right. Newer models like FLUX have substantially reduced these artifacts through larger datasets and improved architectures.