Follow along free — 200 free credits at signup + 100 daily when logged in, free to use Create Free Now →

ControlNet Guide: Take Full Control of AI Image Generation

By Cemhan Biricik 2026-01-20 20 min read

Text-to-image diffusion models are powerful, but they have a fundamental limitation: you describe what you want in words, and the model interprets those words however it sees fit. You can ask for "a woman standing in a doorway," but you cannot control which way she faces, how the doorway is shaped, or where in the frame she stands. The model decides. ControlNet changes this entirely.

ControlNet is a neural network architecture introduced by Lvmin Zhang and Maneesh Agrawala in 2023 that adds precise spatial conditioning to pretrained diffusion models. Instead of relying solely on text prompts, you provide a structural reference image — an edge map, a depth map, a human pose skeleton, a rough scribble — and ControlNet ensures the generated image follows that structure exactly. The result is that you gain compositional control comparable to traditional digital art tools, while retaining the generative power of diffusion models.

This guide covers everything you need to know to use ControlNet effectively: how it works under the hood, detailed walkthroughs of every major preprocessor type, parameter tuning, multi-ControlNet stacking, and practical workflows for real production use. Whether you are using ZSky AI, ComfyUI, or Automatic1111, these techniques apply universally.

How ControlNet Works: The Architecture

To understand ControlNet properly, you need to understand what it does to the diffusion model's architecture. A standard Stable Diffusion or FLUX model has an encoder that compresses the image into latent space, a middle block that processes the latents, and a decoder that reconstructs the image. During generation, the model progressively denoises random noise into an image, guided by text embeddings from CLIP or T5.

ControlNet creates a trainable copy of the encoder blocks of the pretrained model. This copy receives both the standard noisy latent input and an additional conditioning image (your edge map, depth map, etc.). The copied encoder processes these combined inputs and produces feature maps that are injected back into the original model's decoder through zero-convolution layers — convolutional layers initialized with zero weights so they start with no influence and gradually learn the appropriate conditioning strength during training.

This architecture is elegant for several reasons. First, the pretrained model's weights are never modified, so image quality is preserved. Second, the zero-convolution initialization means ControlNet starts as a no-op and learns conditioning gradually, preventing destructive interference during training. Third, because ControlNet operates on the encoder side, it influences the spatial structure of generation without overriding the model's learned understanding of textures, lighting, and style.

In practice, this means you can combine a Canny edge ControlNet with any text prompt. The edges define where structural boundaries appear; the text prompt defines what those boundaries represent, what colors and textures fill them, and what style the final image takes. The two conditioning signals are complementary, not competing.

Canny Edge Detection: Preserving Structure and Detail

Canny edge detection is the most widely used ControlNet preprocessor, and for good reason: it captures the structural boundaries of an image with precision that directly maps to how diffusion models construct compositions. The Canny algorithm works by detecting intensity gradients in the image, applying non-maximum suppression to thin edges to single-pixel width, and using hysteresis thresholding with two threshold values to classify edges as strong, weak, or non-edges.

Adjusting Canny Thresholds

The two threshold parameters — low threshold and high threshold — control how much detail the edge map captures. This is the single most important tuning decision for Canny ControlNet:

Best Practices for Canny ControlNet

Set the ControlNet weight between 0.5 and 0.85 for most use cases. A weight of 1.0 forces the model to follow edges very strictly, which can produce images that look traced rather than naturally generated. At 0.6–0.7, the model follows the major structural elements while having enough freedom to add natural variation in textures and fine details.

Use the guidance end parameter to control when ControlNet stops influencing generation. Setting guidance end to 0.7 means ControlNet is active for the first 70% of denoising steps (which establish composition and major structures) but inactive for the final 30% (which refine textures and fine details). This produces more natural-looking results than leaving ControlNet active for the entire generation.

Pre-process your reference images before extracting edges. Remove backgrounds if they are not relevant, crop to your desired composition, and ensure the reference is clean and high-contrast. Garbage in, garbage out applies doubly to edge maps — a noisy reference produces a noisy edge map that confuses the model.

Depth Maps: Controlling Spatial Relationships

Depth ControlNet uses a depth estimation model (typically MiDaS or Zoe Depth) to create a grayscale map where brightness represents distance from the camera. White areas are close, black areas are far, and gray gradients represent the spatial relationship between objects. This gives ControlNet information about the three-dimensional structure of the scene without specifying exact edges or textures.

Depth maps are particularly powerful for several use cases that other preprocessors handle poorly:

Choosing a Depth Estimator

EstimatorStrengthsBest For
MiDaS (DPT-Large)Good general accuracy, fast inferenceGeneral-purpose scenes, landscapes
Zoe DepthSuperior metric accuracy, fine detailIndoor scenes, architecture, product shots
Depth Anything v2Best zero-shot generalization, robustDiverse subjects, mixed scenes, edge cases
MarigoldHighest detail preservation, sharp edgesComplex scenes with fine depth boundaries

For most users, Depth Anything v2 provides the best balance of accuracy and robustness. It handles edge cases (reflective surfaces, transparent objects, unusual perspectives) better than older alternatives and produces clean, smooth depth maps with well-defined boundaries.

Depth Map Weight and Parameters

Depth ControlNet is generally more forgiving with weight settings than Canny. Because depth maps encode spatial relationships rather than exact pixel boundaries, the model has more room to interpret the conditioning naturally. Start with a weight of 0.6–0.8 and adjust based on how strictly you need spatial relationships preserved. For architectural interiors where perspective must be exact, use 0.8–1.0. For landscapes where you want depth layers but creative freedom within each layer, use 0.4–0.6.

OpenPose: Controlling Human Body Positioning

OpenPose detects human body keypoints — joints, facial landmarks, and hand positions — and produces a skeleton overlay that ControlNet uses to position figures in the generated image. This solves one of the most frustrating limitations of text-to-image generation: getting people to stand, sit, gesture, and interact in specific ways.

The OpenPose preprocessor detects 18 body keypoints (nose, neck, shoulders, elbows, wrists, hips, knees, ankles), 21 hand keypoints per hand (when the hand model is enabled), and 70 facial keypoints (when the face model is enabled). Each keypoint is connected by colored lines to form an intuitive skeleton representation.

OpenPose Variants

Creating Custom Poses

You do not need a reference photograph to use OpenPose ControlNet. Several tools allow you to create pose skeletons manually by dragging keypoints into the desired positions. In ComfyUI, the OpenPose Editor extension provides a 3D mannequin interface where you can position a figure freely. Third-party tools like Posemy.art and Magic Poser offer web-based pose creation with exportable OpenPose-format skeletons. This allows you to specify exact poses that may not exist in any reference photograph — unusual action poses, dance positions, or interaction between multiple figures.

When creating custom poses, pay attention to anatomical plausibility. The diffusion model will attempt to render whatever skeleton you provide, but implausible joint angles (elbows bending backward, legs at impossible angles) produce distorted, uncanny results. Use your body or a reference to verify that the pose is physically achievable.

Scribble and Lineart: From Rough Sketches to Finished Art

Scribble and lineart ControlNets are designed for a fundamentally different workflow than Canny or depth: they take rough, imprecise inputs and turn them into polished images. This makes them ideal for artists who want to use AI as an assisted rendering tool — sketch the composition quickly by hand, then let the diffusion model handle rendering, texturing, and lighting.

Scribble Mode

Scribble ControlNet accepts very rough inputs — literally scribbled lines that approximate shapes and compositions. A few curved lines suggesting a face, some boxes for buildings, rough circles for trees. The model interprets these scribbles as compositional guides and fills in the detail. This is the most forgiving ControlNet mode and the lowest barrier to entry for compositional control.

For scribble mode, use a lower ControlNet weight (0.4–0.6) to give the model maximum interpretive freedom. Your scribble is a suggestion, not a blueprint. Higher weights force the model to follow your rough lines more closely, which often produces awkward results because scribbled lines are imprecise by nature.

Lineart Mode

Lineart ControlNet expects cleaner input — traced outlines, digital line drawings, or extracted lineart from existing images. The lineart preprocessor can extract clean outlines from photographs, removing shading and texture to produce a pure line drawing that the model can then re-render in any style. This is particularly powerful for style transfer: extract the lineart from a photograph, then generate with prompts for "watercolor illustration" or "oil painting" to produce an artistic rendering that preserves the exact composition of the original.

Practical Workflow: Sketch to Final Image

  1. Sketch your composition roughly on paper or in any drawing application. Do not worry about quality — focus on placement, proportions, and the overall compositional idea.
  2. Photograph or export your sketch. Ensure good contrast (dark lines on a light background works best).
  3. Load the sketch as a scribble or lineart ControlNet input. If your sketch is very rough, use scribble mode. If it has clean lines, use lineart mode.
  4. Write a detailed text prompt describing the desired output: subject details, lighting, style, medium, and quality.
  5. Set ControlNet weight to 0.5–0.7 and generate. Review the output and adjust weight up if composition is not followed closely enough, or down if the image looks constrained.
  6. Iterate on the prompt while keeping the same sketch. Change styles, lighting, and details while maintaining your intended composition.

Advanced ControlNet Techniques

Multi-ControlNet Stacking

One of ControlNet's most powerful capabilities is stacking multiple control types simultaneously. You can combine OpenPose for body positioning with depth for spatial layout, Canny for structural detail, and a color reference for palette control. Each ControlNet operates through its own set of injected features, and their influences combine additively in the model's decoder.

When stacking, reduce the weight of each individual ControlNet to prevent over-constraining the generation. A single ControlNet at weight 0.7 works well alone, but two ControlNets each at 0.7 effectively doubles the conditioning strength, which can produce rigid, artifact-heavy outputs. A good starting point for dual ControlNet setups is 0.4–0.5 per unit. For triple stacks, reduce to 0.3–0.4 each.

ControlNet with Inpainting

ControlNet and inpainting combine powerfully. Mask a region of an existing image, provide a ControlNet reference for that region (a pose skeleton for a new figure, an edge map for an architectural element), and generate only within the masked area with both inpainting and ControlNet conditioning. This enables precise, structurally controlled edits to existing images — replacing a figure's pose without changing anything else, adding architecturally consistent elements to a scene, or inserting objects with proper depth integration.

Preprocessor Resolution

The resolution at which the preprocessor analyzes your reference image affects the detail captured in the control map. Higher preprocessor resolution captures finer detail but increases processing time and memory usage. For Canny edges, a preprocessor resolution of 512–768 is usually sufficient. For depth maps, match the preprocessor resolution to your output resolution for maximum spatial accuracy. For OpenPose, lower resolutions (256–512) work fine because the skeleton is sparse by nature.

Control Mode: Balanced, Prompt, and Control

Many implementations offer three control modes that change how the text prompt and ControlNet reference interact:

ControlNet for FLUX vs SDXL

ControlNet implementations differ between advanced AI due to their fundamentally different architectures. SDXL uses a U-Net architecture with residual connections, and ControlNet was originally designed for this structure. FLUX uses a Diffusion Transformer (DiT) architecture, which required adapting the ControlNet approach.

FeatureSDXL ControlNetFLUX ControlNet
ArchitectureU-Net encoder copyDiT block adaptation
Available modelsExtensive (Canny, depth, pose, scribble, segmentation, tile, IP-Adapter, etc.)Growing (Canny, depth, pose, with more being trained)
Multi-ControlNetWell-supported, stableSupported but fewer tested combinations
Weight sensitivityModerateHigher (reduce weights by ~20% compared to SDXL)
Output quality ceilingExcellentSuperior (benefits from FLUX's base quality)
Community modelsThousands availableHundreds and growing rapidly

For FLUX ControlNet, start with lower weights (0.4–0.6) than you would use for SDXL. FLUX's stronger text understanding means the prompt already provides substantial compositional guidance, so ControlNet conditioning needs less weight to be effective. Overly strong ControlNet weights with advanced AI can produce images that feel rigid or over-processed.

Segmentation Maps: Object-Level Control

Segmentation ControlNet uses semantic segmentation maps — images where each region is colored according to its object class (sky, building, road, person, vegetation, etc.) — to control what types of objects appear where in the generated image. Unlike edge maps that control boundaries or depth maps that control spatial relationships, segmentation maps control object identity and placement at a semantic level.

This is particularly valuable for scene design. You can paint a segmentation map with specific regions for sky, buildings, roads, trees, and water, and the model will generate a scene matching that layout. Change the text prompt from "photorealistic cityscape" to "watercolor painting of a medieval town" and the same segmentation map produces a completely different aesthetic while maintaining the same scene composition.

Segmentation maps can be created manually using any painting tool with a color palette matching the ADE20K class definitions (150 object classes with assigned colors). This gives you pixel-level control over scene composition without needing any reference photograph — you can design entirely imaginary scenes by painting regions in the appropriate class colors.

Tile ControlNet: Upscaling and Detail Enhancement

Tile ControlNet takes a different approach from other ControlNet types. Instead of providing structural conditioning for new generations, it conditions the model on the existing content of an image at a tile level, enabling high-quality upscaling and detail enhancement. The input image is divided into tiles, each tile provides local conditioning, and the model generates enhanced detail within each tile while maintaining global coherence.

This is one of the most practical ControlNet applications for production work. Take a 512×512 generation, feed it back through the model with Tile ControlNet at 2x or 4x resolution, and the model adds genuine detail — not the blurred interpolation of traditional upscaling, but new, coherent detail that matches the existing content. Combined with a text prompt that describes the desired level of detail ("highly detailed, sharp focus, fine textures"), Tile ControlNet produces upscaled images that rival native high-resolution generation. For more on upscaling techniques, see our AI upscaling comparison.

Generate with ControlNet on ZSky AI

Use ControlNet with advanced AI on dedicated RTX 5090 GPUs. Upload your reference images and generate with precise compositional control.

Try ZSky AI Free →
Made with ZSky AI
ControlNet Guide: Precise AI Image Control — ZSky AI
Create art like thisFree, free to use
Try It Free

Frequently Asked Questions

What is ControlNet and how does it work?

ControlNet is a neural network architecture that adds spatial conditioning to diffusion models like Stable Diffusion and FLUX. It creates a trainable copy of the model's encoder blocks and injects spatial control signals — such as edge maps, depth maps, or pose skeletons — into the denoising process through zero-convolution layers. This allows you to control the exact composition, pose, and structure of generated images while the diffusion model handles textures, colors, and style.

Which ControlNet preprocessor should I use?

The best preprocessor depends on what you want to control. Use Canny edge detection for preserving hard edges and architectural detail. Use depth maps (MiDaS, Zoe, or Depth Anything) for maintaining spatial relationships and 3D structure. Use OpenPose for controlling human body poses. Use scribble or lineart for turning rough sketches into finished images. Use segmentation maps for controlling which objects appear where in the scene.

Can I use multiple ControlNets at the same time?

Yes, stacking multiple ControlNet models simultaneously is one of the most powerful techniques available. Common combinations include OpenPose for body position plus depth for spatial layout, or Canny edges for structure plus a color reference for palette. When using multiple ControlNets, reduce the weight of each unit (typically 0.4–0.5 per unit for a dual stack) to prevent over-constraining the generation.

What is the difference between ControlNet weight and guidance start/end?

Weight (0.0–2.0) controls how strongly the spatial condition influences the output — higher values follow the control image more strictly. Guidance start and end (0.0–1.0) control when during the denoising process ControlNet is active. Setting guidance end to 0.7 means ControlNet influences only the first 70% of steps, allowing the model more creative freedom for fine details in later steps.

Does ControlNet work with advanced AI?

Yes. FLUX-specific ControlNet models have been developed that support Canny edges, depth maps, pose conditioning, and other types. Because FLUX uses a DiT architecture rather than a U-Net, you need FLUX-specific ControlNet models — SDXL ControlNets are not compatible. FLUX ControlNets generally require lower weights (0.4–0.6) than their SDXL equivalents due to FLUX's stronger base conditioning.

How do I get the best results with ControlNet Canny edge detection?

Adjust the low and high thresholds to capture the right level of detail for your use case. Medium thresholds (80–120 low, 150–200 high) work best for most subjects. Set ControlNet weight between 0.5 and 0.8 for a balance between structural fidelity and creative freedom. Use guidance end around 0.7–0.8 to let the model refine fine details independently. Always clean your reference image before extracting edges to avoid noise in the control map.