LoRA Training Guide: Create Custom AI Models for Your Style
Every AI image generator produces images in the styles it was trained on. Want a specific person's face, a particular product, a unique art style, or a proprietary brand aesthetic? The base model does not know these concepts. You need to teach it. LoRA — Low-Rank Adaptation — is the most efficient and widely used method for doing exactly that.
LoRA was originally developed for large language models by Edward Hu and colleagues at Microsoft Research, but it was quickly adopted by the image generation community because it solves a critical problem: how do you customize a multi-gigabyte model without retraining the entire thing? A LoRA fine-tune modifies only a tiny fraction of the model's parameters by training small, low-rank matrices that are injected alongside the original weights. The result is a file that is typically 10–200 MB, trains in minutes to hours rather than days, and can be loaded and unloaded from the base model instantly.
This guide walks through the entire LoRA training process from dataset preparation to deployment. Whether you are training for FLUX or SDXL, the principles are the same — the implementation details differ, and we cover both.
Understanding LoRA: How Low-Rank Adaptation Works
To understand why LoRA is so effective, you need to understand what happens during fine-tuning. A diffusion model like SDXL has approximately 3.5 billion parameters organized in layers of weight matrices. During standard fine-tuning, you update all of these weights through gradient descent, which requires storing gradients and optimizer states for every parameter — enormous memory requirements and the risk of catastrophically forgetting the model's original capabilities.
LoRA takes a different approach. For each weight matrix W in the model, instead of updating W directly, LoRA freezes W and trains two small matrices A and B such that the modified weight becomes W + BA. The key insight is that A and B have a much lower rank than W — if W is a 4096×4096 matrix (16.7 million parameters), A might be 4096×16 and B might be 16×4096, totaling only 131,072 trainable parameters for that layer. Across the entire model, this typically reduces trainable parameters by 99% or more while still being able to learn meaningful adaptations.
The rank parameter (commonly called rank or dim) controls how many parameters the LoRA has. Higher rank means more expressive capacity but larger file size and higher risk of overfitting. Lower rank produces smaller, more efficient LoRAs but may not capture complex concepts. For most use cases, a rank of 16–64 is sufficient. Character likeness LoRAs often work well at rank 32. Complex style LoRAs may benefit from rank 64–128.
LoRA vs Full Fine-Tuning vs Textual Inversion
| Method | File Size | Training Time | VRAM Needed | Expressiveness |
|---|---|---|---|---|
| Textual Inversion | ~10 KB | 1–4 hours | 8 GB+ | Limited (token-level only) |
| LoRA (rank 32) | 30–100 MB | 30 min–3 hours | 12 GB+ | High (modifies attention layers) |
| LoRA (rank 128) | 100–300 MB | 1–6 hours | 16 GB+ | Very high |
| Full Fine-Tune | 2–7 GB | 12–48 hours | 24 GB+ | Maximum |
Dataset Preparation: The Foundation of Quality
The quality of your LoRA is determined primarily by the quality of your training dataset. No amount of hyperparameter tuning will compensate for a poorly curated dataset. This section covers how to build a dataset that produces excellent results.
How Many Images Do You Need?
- Character/face LoRA: 15–30 images. Show the subject in varied poses (front, three-quarter, profile), lighting conditions (indoor, outdoor, studio), expressions (neutral, smiling, serious), and optionally different clothing. Variety is critical — if all images show the same angle, the LoRA will only work from that angle.
- Style LoRA: 50–200 images. The images must consistently demonstrate the target style across different subjects. If training a "watercolor landscape" style, include watercolor paintings of mountains, oceans, forests, cities, and abstract scenes. The model needs to learn the style as independent from any particular subject.
- Object/product LoRA: 20–50 images. Show the object from multiple angles, in different lighting, and ideally in different contexts. Include close-ups of distinctive details and full-product shots.
- Concept LoRA: 30–100 images. Abstract concepts (a particular color grading, a lighting mood, a compositional tendency) require more images because the concept is less concrete than a face or object.
Image Quality Requirements
Every image in your dataset should meet these criteria:
- Resolution: At least 1024×1024 pixels for SDXL training, or the training resolution you intend to use. Images will be resized and cropped during training, but starting with higher resolution preserves more detail. Do not upscale low-resolution images to meet this requirement — you are training on fake detail.
- Sharpness: No motion blur, no out-of-focus images unless blur is part of the concept you are training. The model learns from every pixel, and blurry training data produces blurry outputs.
- Consistency: For character LoRAs, the subject should be clearly identifiable in every image. Remove images where the face is occluded, heavily shadowed, or at extreme angles that obscure identifying features.
- No video watermarks or text overlays: The model will learn to reproduce watermarks if they appear in training data. Clean your images before training.
- Varied backgrounds: If all training images have white backgrounds, the LoRA may associate the concept with white backgrounds. Use varied backgrounds to help the model isolate the concept from the context.
Captioning Your Dataset
Every training image needs a text caption that accurately describes its contents. This is how the model learns to associate your visual concept with specific text tokens. There are two captioning approaches:
Trigger word captioning: Use a unique trigger word (e.g., sks person, ohwx style) followed by a natural description of the image. The trigger word becomes the activation token — include it in your generation prompt to activate the LoRA. Example: sks woman, portrait, studio lighting, neutral expression, dark background.
Natural language captioning: Describe the image in full natural language without a trigger word. The LoRA learns to associate the visual concept with the descriptive patterns in your captions. This approach can produce more flexible LoRAs but requires more careful captioning. Example: A portrait photograph of a woman with auburn hair and green eyes, studio lighting with soft shadows, neutral expression, wearing a white blouse, dark background.
For FLUX LoRAs, natural language captioning typically works better because FLUX's T5 encoder processes full sentences more effectively than keyword clusters. For SDXL, trigger word captioning is the established standard.
Automated Captioning Tools
Manually captioning 50–200 images is tedious. Several tools can auto-generate captions that you then review and correct:
- BLIP-2 / CogVLM: Vision-language models that produce natural language descriptions. Good starting point but may miss specific details important to your concept.
- WD14 Tagger: Produces Danbooru-style tags. Fast and detailed for anime/illustration styles but less useful for photographic subjects.
- Florence-2: Microsoft's vision model with strong captioning capabilities. Produces accurate, detailed descriptions.
- GPT-4 Vision / Claude Vision: The most accurate option for complex subjects. Upload images and ask for detailed descriptions matching your desired caption format.
Regardless of the tool, always review auto-generated captions. Incorrect captions teach the model incorrect associations. Add your trigger word to every caption if using the trigger word approach, and ensure the descriptions accurately reflect what is visible in each image.
Training Parameters: Getting the Settings Right
Learning Rate
The learning rate controls how aggressively the model updates weights during training. Too high and the LoRA overtrains quickly, producing distorted or oversaturated images. Too low and training takes forever or the LoRA has no visible effect.
- SDXL LoRA: Start at
1e-4(0.0001). This is the most widely tested and reliable starting point. - FLUX LoRA: Start at
5e-5to1e-4. FLUX's larger parameter count benefits from slightly conservative learning rates. - Text encoder learning rate: If training the text encoder alongside the U-Net/DiT, use a lower learning rate for the text encoder — typically 50% of the U-Net rate (e.g.,
5e-5if U-Net is at1e-4).
Training Steps and Epochs
An epoch is one complete pass through your training dataset. The total number of training steps is: epochs × (dataset_size / batch_size). For a 30-image dataset with batch size 1 and 50 epochs, that is 1,500 steps.
General guidelines for total training steps:
- Character LoRA (15–30 images): 1,000–3,000 steps. Start checking outputs at 1,000 steps.
- Style LoRA (50–200 images): 2,000–8,000 steps. Styles require more steps to generalize properly.
- Object LoRA (20–50 images): 1,500–4,000 steps.
Overtraining is the most common mistake. An overtrained LoRA produces images that look exactly like the training data regardless of the prompt — the model has memorized rather than learned. Save checkpoints every 200–500 steps and test each one. The best checkpoint is usually well before the final one.
Rank and Alpha
The rank (dim) determines the LoRA's capacity. The alpha determines the scaling factor applied to the LoRA's contribution: effective_weight = alpha / rank * lora_weight. A common convention is to set alpha equal to rank (so the scaling factor is 1.0), but some trainers prefer alpha = rank/2 for more conservative initial influence.
| Use Case | Recommended Rank | File Size (SDXL) |
|---|---|---|
| Simple concept (trigger word for an object) | 8–16 | 10–30 MB |
| Character likeness | 16–32 | 30–80 MB |
| Art style | 32–64 | 80–150 MB |
| Complex multi-concept | 64–128 | 150–300 MB |
Network Modules to Train
You can select which layers of the model the LoRA modifies. The standard target modules are the attention layers (Q, K, V, and output projection) in both the U-Net (SDXL) or DiT (FLUX). Some trainers also include feed-forward network layers for additional expressiveness at the cost of larger file sizes.
For character LoRAs, training only the attention layers is usually sufficient. For complex style LoRAs that need to modify how the model processes textures and colors at a fundamental level, including feed-forward layers can improve results.
The Training Process: Step by Step
Setting Up Your Environment
The two most popular LoRA training tools are kohya_ss (sd-scripts) and ai-toolkit by Ostris. Both support SDXL and FLUX training with full parameter control. kohya_ss has a web UI (via bmaltais GUI) that simplifies configuration, while ai-toolkit uses YAML configuration files for more precise control.
- Install your chosen training tool following its documentation. Ensure your GPU drivers and CUDA are up to date.
- Organize your dataset in a directory structure:
training_data/[num]_[concept_name]/where[num]is the number of repeats per image per epoch. For a 20-image character dataset,10_sks_personmeans each image is seen 10 times per epoch. - Place caption files (.txt) alongside each image with the same filename:
image_001.pngandimage_001.txt. - Configure your training parameters (learning rate, steps, rank, etc.) in the UI or config file.
- Set a regularization dataset (optional but recommended) — images of the base concept without your specific subject, to prevent the model from associating your trigger word with generic features of the class.
- Start training and monitor the loss curve. Loss should decrease steadily and plateau. If it drops to near zero, you are probably overtraining.
Regularization Images
Regularization (also called "class images" or "prior preservation") prevents a phenomenon called language drift, where the LoRA causes the model to associate generic class words with your specific concept. Without regularization, training a LoRA on "sks woman" might cause the model to generate your specific woman whenever anyone prompts "woman" without the trigger word.
To create regularization images, generate 200–500 images from the base model using the class prompt (e.g., "a woman, portrait photograph") without the LoRA. These images represent what the model normally generates for the class, and during training they anchor the model's understanding of the generic class while allowing it to learn your specific concept under the trigger word.
Testing and Deployment
Evaluating Your LoRA
Test your LoRA checkpoints systematically. Generate images with several prompts that vary in specificity and context:
- Direct activation: "sks person, portrait photograph, studio lighting" — tests basic concept activation.
- Context variation: "sks person walking through a forest" — tests whether the concept holds in different environments.
- Style variation: "sks person, oil painting style" — tests whether the concept can be combined with different aesthetics.
- Weight sensitivity: Generate at LoRA weights 0.5, 0.7, 0.9, and 1.0 to find the sweet spot. Most LoRAs produce the best results at 0.6–0.8 rather than full weight.
- Without trigger word: Generate "a woman, portrait photograph" without the trigger word. The LoRA should have minimal influence — if it does, you may need more regularization images or fewer training steps.
Deploying Your LoRA
LoRA files are portable and platform-independent. A LoRA trained with kohya_ss works in ComfyUI, Automatic1111, Forge, and any other tool that supports the safetensors format. To use your LoRA:
- Place the .safetensors file in your tool's LoRA directory (typically
models/loras/). - Load it alongside the base model it was trained on. An SDXL LoRA works with AI checkpoints; a FLUX LoRA works with advanced AI.
- Set the LoRA weight (start at 0.7) and include the trigger word in your prompt.
- Generate and adjust weight as needed.
On ZSky AI, you can upload custom LoRAs and use them with our RTX 5090 inference infrastructure for fast generation with your custom models.
Common Training Problems and Solutions
Overtrained / Fried LoRA
Symptoms: images look exactly like training data regardless of prompt, colors are oversaturated, faces look waxy or plastic. Solution: use an earlier checkpoint, reduce learning rate by 50%, reduce training steps, or increase rank (which distributes learning across more parameters, reducing per-parameter overfitting).
Undertrained LoRA
Symptoms: trigger word has no visible effect, or effect is extremely subtle even at weight 1.0. Solution: increase training steps, increase learning rate by 50%, verify that captions include the trigger word in every file, and check that images are being loaded correctly (correct directory structure, correct file format).
Style Bleeding
Symptoms: the LoRA changes overall image style even when the trigger word is not used. Solution: add more regularization images, reduce training steps, ensure captions describe image content beyond just the trigger word (so the model does not attribute everything in the image to the trigger word), and test with lower LoRA weights.
Poor Generalization
Symptoms: LoRA only works well in poses/angles/contexts similar to the training data. Solution: increase dataset variety. Add more images showing the concept in diverse situations. For character LoRAs, add profile shots, three-quarter views, full-body shots, and different lighting conditions. For style LoRAs, add more subject variety within the style.
Advanced Techniques
LyCORIS: Beyond Standard LoRA
LyCORIS (Lora beYond Conventional methods, Other Rank adaptation Implementations for Stable diffusion) extends LoRA with alternative matrix decomposition methods. LoHa (Hadamard product) and LoKr (Kronecker product) offer different capacity-efficiency tradeoffs that can capture certain concepts more effectively than standard LoRA. If standard LoRA struggles with your concept at reasonable ranks, try LyCORIS methods — they sometimes produce better results for complex style adaptations.
Pivotal Tuning / DreamBooth + LoRA
Combining DreamBooth's approach (learning a new text token) with LoRA's efficient weight modification produces particularly strong character LoRAs. The model learns both a new embedding for your subject and modified attention patterns, resulting in stronger likeness capture with better prompt followability than either method alone.
Multi-Concept LoRA Training
You can train a single LoRA on multiple concepts by using different trigger words for each concept and organizing your dataset accordingly. A single LoRA file could contain both a character and a style, each activated by their own trigger word. This is more memory-efficient during inference than loading multiple separate LoRAs.
Use Your Custom LoRAs on ZSky AI
Train your LoRA, upload it to ZSky AI, and generate with dedicated RTX 5090 GPU power. Custom model support with no queue times.
Try ZSky AI Free →
Frequently Asked Questions
What is a LoRA in AI image generation?
LoRA (Low-Rank Adaptation) is a fine-tuning technique that trains small adapter weights to modify a base model's behavior without changing the original weights. A LoRA file is typically 10–200 MB compared to the base model's multi-gigabyte size. LoRAs teach models new concepts like specific people, art styles, objects, or aesthetic preferences, and can be loaded and unloaded instantly during generation.
How many images do I need to train a LoRA?
For a character or face LoRA, 15–30 high-quality images with varied poses, lighting, and expressions typically produce good results. For a style LoRA, 50–200 images consistently demonstrating the target style are recommended. For an object LoRA, 20–50 images from different angles and contexts work well. Quality always matters more than quantity.
What GPU do I need to train a LoRA?
For SDXL LoRAs, a GPU with at least 12 GB VRAM (like an RTX 3060 12GB or RTX 4070) is sufficient. For FLUX LoRAs, 16–24 GB VRAM is recommended. Training time ranges from 30 minutes to several hours depending on dataset size and parameters. Cloud GPU services are an option if local hardware is insufficient.
What is the best learning rate for LoRA training?
For SDXL, 1e-4 (0.0001) is the most reliable starting point. For FLUX, 5e-5 to 1e-4 works well. Use a cosine scheduler for stable training. If outputs are oversaturated or distorted, lower the learning rate. If the LoRA has no effect at weight 1.0, increase it or add more training steps.
Can I combine multiple LoRAs at once?
Yes. A common workflow combines a character LoRA with a style LoRA to place a specific person in a specific artistic style. When stacking, reduce each LoRA's weight to 0.5–0.8 to prevent over-conditioning. More than 3–4 simultaneous LoRAs typically degrades quality as their modifications may conflict.
What is the difference between LoRA, LyCORIS, and full fine-tuning?
LoRA trains low-rank matrices modifying specific layers, producing small files with efficient training. LyCORIS extends LoRA with additional decomposition methods (LoHa, LoKr) for more complex adaptations. Full fine-tuning modifies every weight, producing maximum quality but requiring the most resources. LoRA offers the best balance for most users.
Ready to put your training knowledge to work? Try ZSky AI free with 200 free credits at signup + 100 daily when logged in. Power users who need more can compare plans here.