Image-to-Image Guide: Transform Photos with AI
Text-to-image generates from noise. Image-to-image generates from your image. That distinction changes everything about what is possible. Instead of describing what you want from scratch, you provide a starting point — a photograph, a sketch, a screenshot, a painting — and the diffusion model transforms it according to your text prompt while preserving as much or as little of the original as you choose.
This is not a filter. Filters apply fixed mathematical transformations to pixel values. Image-to-image (img2img) runs your image through a diffusion model that understands what it is looking at — it recognizes faces, objects, spatial relationships, lighting, and composition — and regenerates the image with those understood elements restyled, enhanced, or transformed according to your instructions. The results are fundamentally different from anything achievable with traditional image processing.
This guide covers the mechanics of img2img, the critical denoising strength parameter, practical workflows for style transfer, sketch-to-image conversion, photo enhancement, and advanced techniques for getting exactly the transformation you want. These techniques work across FLUX, SDXL, and most modern diffusion models through tools like ZSky AI, ComfyUI, and Automatic1111.
How Image-to-Image Generation Works
Understanding the mechanics helps you predict and control results. In standard text-to-image, the model starts with pure random noise and progressively removes noise over many steps, guided by the text prompt, until a coherent image emerges. In img2img, the starting point is not random noise — it is your input image with a controlled amount of noise added to it.
The process works as follows:
- Your input image is encoded into the model's latent space using the VAE encoder, producing a latent representation of the image.
- Gaussian noise is added to this latent, proportional to the denoising strength parameter. A denoising strength of 0.5 means enough noise is added to correspond to 50% of the total diffusion process. A strength of 1.0 adds enough noise to completely obscure the original image.
- The denoising process begins not from the first step but from the step corresponding to the noise level. At strength 0.5 with 20 total steps, denoising starts at step 10. The model runs the last 10 steps of its normal denoising process.
- During each denoising step, the model predicts and removes noise guided by the text prompt, exactly as in text-to-image. But because the starting point contains information from your input image (not just random noise), the output retains structural elements of the original.
- The denoised latent is decoded back to pixel space by the VAE decoder, producing the final transformed image.
This mechanism explains why denoising strength is so powerful: at low values, very little noise is added and very few denoising steps run, so the output closely resembles the input with minor modifications. At high values, heavy noise almost completely obscures the original, and many denoising steps run, allowing the model to generate substantially new content using the original as only a vague structural guide.
Mastering Denoising Strength
Denoising strength is the single most important parameter in img2img. It is a continuous scale from 0.0 (no change) to 1.0 (essentially text-to-image with a vague compositional bias), and understanding where to set it for different tasks is the key skill.
| Range | Transformation Level | Use Cases |
|---|---|---|
| 0.1–0.3 | Minimal — subtle refinement | Noise reduction, slight color correction, texture enhancement, minor quality improvement |
| 0.3–0.5 | Moderate — recognizable changes | Gentle style transfer, lighting adjustment, color palette shift, detail enhancement |
| 0.5–0.7 | Significant — clear transformation | Full style transfer (photo to painting), environment changes, substantial aesthetic transformation |
| 0.7–0.85 | Major — loose reference | Dramatic reimagining, sketch-to-finished-art, concept exploration from rough references |
| 0.85–1.0 | Near-complete — structural echo only | Using input as compositional inspiration only, generating "variations" of a concept |
Finding the Sweet Spot
Start at 0.5 and generate. If the result is too similar to the input, increase by 0.1. If the result has lost too much of the original's structure, decrease by 0.1. After 2–3 adjustments, you will find the exact strength that gives the right balance of transformation and preservation for your specific input and prompt combination.
The sweet spot changes based on the input image. Photographs with strong, clear compositions tolerate higher denoising before losing structure (their structure is so strong that it persists through more noise). Sketches and abstract inputs need lower denoising to preserve their compositional intent. Complex scenes with many small elements need lower denoising because fine details are the first things destroyed by noise.
Style Transfer: Photo to Art
Style transfer is img2img's most popular application: take a photograph and transform it into an oil painting, watercolor, anime illustration, pixel art, or any other visual style. The photo provides composition and content; the prompt provides the target style.
Effective Style Transfer Prompts
The style prompt should be specific about the target medium and aesthetic:
# Instead of this:
"painting of a landscape"
# Write this:
"oil painting on canvas, thick impasto brushstrokes,
warm color palette, impressionist style, soft edges,
visible paint texture, gallery lighting"
Describe the target medium (oil painting, watercolor, charcoal drawing), the technique (impasto, wet-on-wet, cross-hatching), the aesthetic movement (impressionist, art nouveau, expressionist), and quality markers (museum quality, gallery exhibition, masterwork). The more specific your style description, the more convincingly the model transforms the image.
Style Transfer Settings by Target Style
| Target Style | Denoising | CFG Scale | Key Prompt Terms |
|---|---|---|---|
| Oil painting | 0.55–0.7 | 7–9 | oil on canvas, brushstrokes, impasto, gallery lighting |
| Watercolor | 0.5–0.65 | 6–8 | watercolor wash, transparent layers, wet edges, paper texture |
| Anime/manga | 0.6–0.75 | 7–10 | anime style, cel shaded, clean lines, vibrant colors |
| Pencil sketch | 0.5–0.65 | 6–8 | graphite pencil drawing, cross-hatching, white paper |
| Cyberpunk | 0.55–0.7 | 8–10 | neon lighting, rain-slicked surfaces, holographic, dystopian |
| Vintage photograph | 0.3–0.5 | 5–7 | faded film, grain, 1970s Polaroid, warm cast, soft focus |
Using LoRAs for Style Transfer
For the most convincing style transfers, combine img2img with a style LoRA. A LoRA trained on a specific artist's work or a particular visual style will produce more authentic results than prompt engineering alone. Load the style LoRA at weight 0.6–0.8, write a style-matching prompt, and set denoising to 0.5–0.7. The LoRA handles the stylistic nuance that words cannot fully capture.
Sketch-to-Image: From Rough to Refined
Img2img transforms rough sketches into finished, rendered images. This workflow is popular among concept artists, game designers, and illustrators who want to iterate quickly on compositions before committing to full rendering.
Preparing Your Sketch
The sketch does not need to be polished. Even rough gesture drawings and basic shape compositions work as img2img inputs. However, a few preparation steps improve results:
- High contrast: Dark lines on a white or light background. The model reads structure from contrast, and faint gray lines may not register as structural elements.
- Clean background: Erase smudges and accidental marks. The model treats everything in the image as intentional content.
- Correct proportions: The model will follow the proportions in your sketch. If the head is too large relative to the body, the output will have the same issue. Get proportions roughly right even if detail is minimal.
- Add basic shading if possible: Even rough value indications (dark areas, light areas) help the model understand depth and lighting in the scene, producing more three-dimensional results.
Sketch-to-Image Settings
Use denoising strength 0.7–0.9 for sketches. The model needs significant freedom to transform rough lines into rendered content. Lower values keep too much of the sketch's rough quality. The prompt should describe the finished result in detail — surface materials, lighting, atmosphere, and style — because the sketch provides only structure.
For more precise control over which elements of the sketch are followed, consider using ControlNet in scribble or lineart mode instead of standard img2img. ControlNet provides structural conditioning without the noise-addition mechanism of img2img, giving you more independent control over structure and style.
Photo Enhancement and Restoration
Img2img can enhance photographs in ways that go beyond traditional editing tools. Because the model understands the content of the image, it can add genuine detail, correct lighting, and improve composition — not just sharpen pixels.
Quality Enhancement
For quality enhancement without changing the image content, use low denoising strength (0.2–0.35) with a quality-focused prompt:
professional photograph, high resolution, sharp focus,
clean detail, natural lighting, DSLR quality,
well-exposed, accurate colors
The model refines textures, reduces noise, and enhances detail while keeping the image recognizably the same. This is particularly effective for improving smartphone photos, compressed images, and older digital photographs. The results exceed traditional sharpening because the model generates real detail rather than amplifying existing noise.
Lighting Correction
Describe the desired lighting in your prompt while using moderate denoising (0.3–0.5): "natural golden hour lighting, warm tones, soft shadows" transforms a harshly lit photo into a warmly lit one. "Studio three-point lighting, professional portrait" transforms casual portrait lighting into studio-quality lighting. The model re-renders the lighting of the scene while preserving subject identity and composition.
Background Enhancement
For images where the subject is good but the background is distracting, use a moderate denoising (0.4–0.6) with a prompt that describes the desired background: "professional portrait, clean blurred background, shallow depth of field, bokeh." The model typically preserves the subject while replacing or enhancing the background, though for precise control, consider inpainting the background specifically.
Advanced Img2Img Techniques
Progressive Refinement
Instead of attempting one perfect transformation, apply img2img iteratively with low denoising at each step. Start with the original, apply img2img at 0.3 denoising with your style prompt. Take the output, feed it back as the input, and apply img2img again at 0.3. Each iteration nudges the image closer to the target style without the jarring artifacts that can occur with a single high-denoising pass. Three to five iterations of gentle transformation often produce more coherent results than one aggressive transformation.
Multi-Model Pipeline
Different models have different strengths. A powerful workflow for maximum quality:
- Generate the base image with advanced AI (best prompt adherence and composition)
- Feed the FLUX output into SDXL via img2img for stylistic treatment (some styles are better represented in SDXL's training data)
- Upscale the result with Real-ESRGAN
- Refine the upscaled image with a final img2img pass at low denoising (0.2–0.3) to add fine detail
This pipeline extracts the best qualities from each model. FLUX for composition, SDXL for style, Real-ESRGAN for resolution, and a final pass for detail refinement.
Seed Control for Variation Exploration
Keep the same input image and prompt but change the seed to explore variations. Each seed produces a different interpretation of the transformation. Generate 8–16 variations, select favorites, and then fine-tune those with additional img2img passes or inpainting. This is faster than tweaking the prompt word by word because the visual differences between seeds are immediate and dramatic.
CFG Scale Interaction with Img2Img
CFG (Classifier-Free Guidance) scale interacts differently with img2img than with text-to-image. In text-to-image, higher CFG pushes the model toward the prompt more aggressively. In img2img, high CFG combined with low denoising can produce over-saturated or artifact-heavy results because the model is trying to push the slightly-noised image strongly toward the prompt with very few steps.
For img2img, use lower CFG than you would for text-to-image. If you normally generate at CFG 7–8, try 5–7 for img2img. At very low denoising (0.2–0.3), reducing CFG to 3–5 often produces the most natural results. The input image already provides strong structural guidance, so less prompt pressure is needed.
Img2Img vs ControlNet: When to Use Which
Img2img and ControlNet both use reference images, but they work fundamentally differently and excel at different tasks:
| Criterion | Img2Img | ControlNet |
|---|---|---|
| How reference is used | Noise added to reference, then denoised | Structural info extracted from reference, used as conditioning |
| Content preservation | Preserves actual colors, textures, content from the original | Preserves only the specific structure type (edges, depth, pose) |
| Best for style transfer | Yes — directly transforms content | Only if combined with IP-Adapter for style reference |
| Best for structural control | Moderate — structure degrades with high denoising | Excellent — structure is maintained independently of content |
| Best for photo enhancement | Yes — preserves and enhances photo content | Only via Tile ControlNet |
| Best for sketch-to-image | Good with high denoising | Better — ControlNet scribble/lineart designed for this |
| Combining with prompts | Prompt and image compete (denoising balances them) | Prompt and control are complementary (independent signals) |
Use img2img when you want to transform existing visual content — style transfer, photo enhancement, quality improvement, and iterative refinement. Use ControlNet when you want structural guidance from a reference while generating entirely new visual content — pose-guided generation, composition control, and edge-guided rendering. For maximum control, combine both: use ControlNet for structural guidance and img2img's noise-addition for content-level transformation.
Batch Processing and Workflow Automation
For professional workflows that require transforming multiple images consistently — converting a product photo set to illustration style, enhancing a batch of event photographs, or generating variations for A/B testing — batch processing is essential.
In ComfyUI, batch img2img is built into the workflow system. Load a directory of images, apply the same prompt and settings to each, and output to a results directory. The consistency of the transformation depends on using identical settings for every image in the batch.
For the most consistent batch results, lock the seed across all images in the batch. While this means every image gets the same random influence, it removes one source of variation, making the transformation more uniform across the set. If you need variation, use sequential seeds (seed, seed+1, seed+2) rather than random seeds to maintain some consistency.
Transform Your Images with AI
Upload any photo and transform it with img2img on ZSky AI's dedicated RTX 5090 GPUs. Style transfer, enhancement, and creative transformation in seconds.
Try ZSky AI Free →
Frequently Asked Questions
What is image-to-image (img2img) in AI generation?
Img2img is an AI generation mode where you provide an existing image as a starting point along with a text prompt. The diffusion model adds noise to the input image (controlled by denoising strength), then denoises it guided by your prompt. The result inherits the composition and structure of the input while being transformed according to the prompt. Low denoising preserves the original closely; high denoising produces more dramatic transformations.
What denoising strength should I use for img2img?
Start at 0.5 and adjust. Use 0.2–0.4 for subtle refinements like quality enhancement. Use 0.4–0.6 for moderate style transfer. Use 0.6–0.8 for significant transformations like photo-to-painting. Use 0.8–1.0 when you want only a loose structural reference from the original.
How do I do style transfer with AI?
Upload your source image in img2img mode and write a detailed prompt describing the target style. Set denoising strength to 0.5–0.7. For stronger style adherence, combine with a style LoRA. Describe the specific medium, technique, and aesthetic movement for the most convincing results.
Can I convert a rough sketch into a finished image?
Yes. Upload your sketch as the img2img input, write a detailed prompt describing the finished result, and set denoising strength to 0.7–0.9. The model uses your sketch as a compositional guide while generating fully rendered content. For more precise structural control, ControlNet's scribble mode is even more effective.
What is the difference between img2img and ControlNet?
Img2img directly transforms your image by adding noise and denoising with a prompt. ControlNet extracts structural information from a reference and uses it as conditioning for generation from noise. Img2img transforms content; ControlNet provides structural guidance. Use img2img for style transfer and enhancement, ControlNet for compositional control.
How do I enhance photo quality with img2img?
Use low denoising (0.2–0.4) with a quality-focused prompt: "high resolution, sharp focus, professional lighting, clean detail." The model enhances textures and detail while preserving content. For best results, combine with upscaling (Real-ESRGAN) and a final low-denoising refinement pass.