Lora-Script

Abstract

This post primarily records issues encountered during model fine-tuning and some parameter configurations. The graphics cards used were rented from a cloud provider.

Multi-card model training mainly uses a modified kohya_ss framework and utilizes deepspeed 3 for multi-GPU training.
Single-card training uses the aki toolkit for lora-script training.

Bash Config

[model]
v2 = false
v_parameterization = false
pretrained_model_name_or_path = "./sd-models/realismEngineSDXL_v30VAE.safetensors"
vae = "./sd-models/sdxl_vae.safetensors"

[dataset]
train_data_dir = "./train/001"
reg_data_dir = ""
prior_loss_weight = 1
cache_latents = true
shuffle_caption = true
enable_bucket = true

[additional_network]
network_dim = 32
network_alpha = 16
network_train_unet_only = false
network_train_text_encoder_only = false
network_module = "networks.lora"
network_args = []

[optimizer]
unet_lr = 1e-4
text_encoder_lr = 1e-5
optimizer_type = "AdamW8bit"
lr_scheduler = "cosine_with_restarts"
lr_warmup_steps = 0
lr_restart_cycles = 1

[training]
resolution = "512,512"
train_batch_size = 1
max_train_epochs = 10
noise_offset = 0.0
keep_tokens = 0
xformers = true
lowram = false
clip_skip = 2
mixed_precision = "fp16"
save_precision = "fp16"

[sample_prompt]
sample_sampler = "euler_a"
sample_every_n_epochs = 1

[saving]
output_name = "xtgz-centos-sdxl"
save_every_n_epochs = 2
save_n_epoch_ratio = 0
save_last_n_epochs = 499
save_state = false
save_model_as = "safetensors"
output_dir = "./output"
logging_dir = "./logs"
log_prefix = "output_name"

[others]
min_bucket_reso = 256
max_bucket_reso = 1024
caption_extension = ".txt"
max_token_length = 225
seed = 1337

Common Issues/Troubleshooting

Mirror Site Address

export HF_ENDPOINT=https://hf-mirror.com

Missing Modules

ModuleNotFoundError: No module named 'bitsandbytes'
- Solution: pip install bitsandbytes>=0.43.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/

Pip Permission Issues

WARNING: Running pip as the 'root' user can result in broken permissions...
- Solution: Using a virtual environment is recommended, but for quick fixes locally: pip install bitsandbytes>=0.43.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/

Folder Permission Issues

WARNING: Ignoring invalid distribution -orch ...
- Solution: Delete the folder path .~orch or other similarly named temporary folders.

xformers Issues

no modules name 'xformers' in cuda 12.8:
- Solution: pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu128

torchvision Issues

not libpng libjpeg . or need build torchvision before ***
- Solution: pip3 install torchvision --index-url https://download.pytorch.org/whl/cu128

CUDA 12.8 Installation

Download: wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux.run
Silent Install: sh cuda_12.8.0_570.86.10_linux.run --toolkit --toolkitpath=/root/autodl-tmp/cuda-12.8 --silent
Modify Environment Variables:
- echo 'export PATH=/root/autodl-tmp/cuda-12.8/bin:$PATH' >> ~/.bashrc
- echo 'export LD_LIBRARY_PATH=/root/autodl-tmp/cuda-12.8/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc

xformers Error: No Kernel Image

Error: CUDA error ... no kernel image is available for execution on the device

Fix: First check your CUDA version:
nvcc --version -> compare with nvidia-smi CUDA version.
If inconsistent, use the steps above to install the correct new version of CUDA.
Check versions: conda list. Ensure PyTorch version matches xformers version. If you installed 2.7.1 but xformers only supports up to 2.7.0, downgrade first:
pip3 install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
If it still errors, it is indeed an xformers issue:
1. Run python -m xformers.info
2. Check build.envs. It’s possible the package for 12.x compute capability only supports up to 9.0.
  - build.env.TORCH_CUDA_ARCH_LIST: 6.0+PTX 7.0 7.5 8.0+PTX 9.0a
3. Confirm current GPU compute capability: nvidia-smi --query-gpu=compute_cap --format=csv (e.g., 12.0)
4. Modify environment variable for compute capability: export TORCH_CUDA_ARCH_LIST="12.0" (single session)
5. Download source and compile (use mirror for acceleration):
  pip install -v --no-build-isolation -U git+https://ghfast.top/https://github.com/facebookresearch/xformers.git@main#egg=xformers
After installation, check xformers build version again. If it matches your compute capability, it is correct.

xformers Error: Incompatibility & Slow Download

If install-cn.ps1 fails to install the virtual environment, try switching to install.ps1. Network issues mainly affect Torch installation (domestic mirrors may not help much).
Manual Installation:
- If network is too slow, reinstall Torch manually:
- Open install.ps1, verify commands.
- python.exe -m venv venv (Create venv)
- .\venv\Scripts\activate (Activate venv)
- Use nvidia-smi to find CUDA version, go to PyTorch Website to download the corresponding .whl file.
- Go to xformers repo/site, find the installation command for your CUDA version.
- pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu128
- Place the two downloaded packages in the lora-scripts folder.
- Manually install:
  - pip install .\torch-2.7.0+cu128-cp310-cp310-win_amd64.whl
  - pip install .\xformers-0.0.30-cp310-cp310-win_amd64.whl
- Finally, update the environment in the PS script.

Flux Training: google/t5-xxl Download Error

Reference: https://blog.csdn.net/sinat_29957455/article/details/142782264

Multi-GPU Training Issues Summary

Complete Training Parameters

train_flux.sh

#!/bin/bash

# =================================================================
# Configuration Area (Modify paths to match your machine)
# =================================================================

# 1. Disable P2P direct access (Core fix for Illegal memory access)
# export NCCL_P2P_DISABLE=1

# 2. Disable InfiniBand (Prevent crash from trying to use server-grade networking)
export NCCL_IB_DISABLE=1

# 3. Force blocking mode (If error occurs, shows the exact line)
export CUDA_LAUNCH_BLOCKING=1

# Explicitly specify GPUs (0,1,2,3)
export CUDA_VISIBLE_DEVICES=0,1,2,3

# Optimize memory allocation to prevent fragmentation
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

# FLUX.1 Model Path
MODEL_PATH="train/sd-models/flux1-dev.safetensors"

# CLIP-L Model Path
CLIP_PATH="train/sd-models/clip_l.safetensors"

# T5-XXL Model Path
T5_PATH="train/sd-models/t5xxl_fp16.safetensors"

# AE Model Path
AE_PATH="train/sd-models/flux-ae.safetensors"

# Output Directory
OUTPUT_DIR="./output"

# =================================================================
# Run Command
# =================================================================

accelerate launch \
  --deepspeed_config_file "ds_config.json" \
  --use_deepspeed \
  --num_cpu_threads_per_process 8 \
  --gpu_ids 0,1,2,3 \
  --mixed_precision bf16 \
  --num_processes 4 \
  --num_machines 1 \
  --num_cpu_threads_per_process 1 \
  --offload_optimizer_device cpu \
  --offload_param_device cpu \
  "sd-scripts/flux_train.py" \
  --config_file "dreambooth_flux_config.toml"\
  --optimizer_type="adafactor" \
  --optimizer_args "scale_parameter=False" "relative_step=False" "warmup_init=False" \
  --cache_text_encoder_outputs \
  --cache_latents \
  --full_bf16 \
  --lowram \
  --gradient_checkpointing \
  --cache_latents \
  --max_data_loader_n_workers 0 \
  --learning_rate 1e-5 \
  --cache_latents_to_disk \
  --cache_text_encoder_outputs_to_disk

ds_config.json

{
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": "auto",
    "steps_per_print": 1,
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "nvme",
            "nvme_path": "/root/autodl-tmp/ds_cache",
            "pin_memory": true,
            "buffer_count": 5,
            "fast_init": false
        },
        "offload_param": {
            "device": "nvme",
            "nvme_path": "/root/autodl-tmp/ds_cache",
            "pin_memory": true,
            "buffer_count": 5,
            "buffer_size": 1e8,
            "max_in_cpu": 1e9
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": 5e7,
        "stage3_prefetch_bucket_size": 5e7,
        "stage3_param_persistence_threshold": 1e4,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_clipping": 1.0,
    "bf16": {
        "enabled": true
    }
}

dreambooth_config.toml

ae = "train/sd-models/flux-ae.safetensors"
blocks_to_swap = 0
bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
cache_text_encoder_outputs = true
cache_text_encoder_outputs_to_disk = true
caption_dropout_every_n_epochs = 0
caption_dropout_rate = 0
caption_extension = ".txt"
clip_l = "/root/autodl-tmp/kohya_ss/train/sd-models/clip_l.safetensors"
cpu_offload_checkpointing = true
discrete_flow_shift = 3.1582
double_blocks_to_swap = 0
dynamo_backend = "no"
epoch = 50
fp8_base = true
full_bf16 = false
# gradient_accumulation_steps = 1
gradient_checkpointing = true
guidance_scale = 1
huber_c = 0.1
huber_scale = 1
huber_schedule = "snr"
keep_tokens = 0
learning_rate = 4e-6
learning_rate_te = 0
loss_type = "l2"
lr_scheduler = "constant"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
lr_warmup_steps = 0
max_bucket_reso = 1024
max_data_loader_n_workers = 0
max_grad_norm = 1
max_timestep = 500
max_token_length = 75
max_train_steps = 250
min_bucket_reso = 256
mixed_precision = "bf16"
model_prediction_type = "sigma_scaled"
multires_noise_discount = 0.3
no_token_padding = true
noise_offset_type = "Original"
optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False", "weight_decay=0.01",]
# optimizer_args = [ ]
optimizer_type = "Adafactor"
# optimizer_type = "AdamW8bit"
output_dir = "outputs"
output_name = "Quality_1"
persistent_data_loader_workers = 0
pretrained_model_name_or_path = "train/sd-models/flux1-dev.safetensors"
prior_loss_weight = 1
resolution = "512,512"
# sample_prompts = "/outputs/sample/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 10
save_model_as = "safetensors"
save_precision = "bf16"
sdpa = true
seed = 1
single_blocks_to_swap = 0
t5xxl = "train/sd-models/t5xxl_fp16.safetensors"
t5xxl_max_token_length = 225
timestep_sampling = "sigmoid"
train_batch_size = 1
train_blocks = "all"
train_data_dir = "train/images"
wandb_run_name = "Quality_1"

1. Gradio Port Error

Traceback (most recent call last): OSError: Cannot find empty port in range: 28001-28001…

Solution:
- netstat -ano | findstr :28001
- taskkill /PID <PID> /F

2. CUDA Out of Memory

torch.OutOfMemoryError: CUDA out of memory.

Note: Sometimes reported as VRAM error but is actually caused by system RAM overflow. Need to modify batch_size.

3. use_libuv = 0

Reference: Introduction to Libuv TCPStore Backend
Issue: If use_libuv = 0 is set in environment variables but explicitly set to True in code (route 3), the code value takes precedence. I set lib_use to False in all relevant files.

4. DistributedDataParallel Error

ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules…

Solution: Add parameter to cancel CPU offloading/swapping: --blocks_to_swap = 0

5. NotImplementedError: Cannot copy out of meta tensor; no data!

Cause: Model tensors not initialized properly.
Fix: Modify /root/autodl-tmp/kohya_ss/sd-scripts/library/flux_utils.py

6. DeepSpeed OOM

Cause: DeepSpeed loading Flux data into VRAM and RAM causing OOM.

Model Training Config:

Full Fine-tuning Flux1-dev
Environment: PyTorch 2.7.0, Python 3.12 (Ubuntu 22.04), CUDA 12.8
Hardware: RTX 5090 (32GB) * 4, 360GB RAM
Storage: 50GB System, 440GB Data

Solution: Offload optimizer and params to NVMe using DeepSpeed to move data pressure to disk.
(See ds_config.json configuration above, specifically offload_optimizer and offload_param set to nvme).

7. TypeError: adam_update(): incompatible function arguments.

DeepSpeed 3 passes incorrect eps value type causing this.
Fix Code:

beta1, beta2 = group['betas']

# ================= FIX START: Handle eps tuple conflict =================
# 1. Force handle eps: If tuple from Adafactor, force to Adam default 1e-8
eps_val = group['eps']

if isinstance(eps_val, tuple) or isinstance(eps_val, list):
    # Adafactor eps is (1e-30, 1e-3), causing division by zero or instability in Adam
    # So if tuple, use Adam standard default 1e-8
    eps_val = 1e-8
else:
    eps_val = float(eps_val)

# 2. Force handle step (Prevent Tensor)
step_val = state['step']
if hasattr(step_val, 'item'):
    step_val = int(step_val.item())
else:
    step_val = int(step_val)

# 3. Force handle bias_correction (Prevent int)
bias_correction_val = bool(group['bias_correction'])

# ================= DEBUG START =================
print("\n" + "="*30 + " DEBUG ADAM UPDATE " + "="*30)
try:
    # Extract variables for checking
    arg_list = [
        ("0. opt_id (int)", self.opt_id),
        ("1. step (int)", state['step']),
        ("2. lr (float)", group['lr']),
        ("3. beta1 (float)", beta1),
        ("4. beta2 (float)", beta2),
        ("5. eps (float)", group['eps']),
        ("6. weight_decay (float)", group['weight_decay']),
        ("7. bias_correction (bool)", group['bias_correction']),
        ("8. param (Tensor)", p.data),
        ("9. grad (Tensor)", p.grad.data),
        ("10. exp_avg (Tensor)", state['exp_avg']),
        ("11. exp_avg_sq (Tensor)", state['exp_avg_sq'])
    ]

    for name, val in arg_list:
        if hasattr(val, 'shape'): # If Tensor
            print(f"[{name}]: Type={type(val)}, Dtype={val.dtype}, Device={val.device}, Shape={val.shape}")
        else: # If Scalar
            print(f"[{name}]: Type={type(val)}, Value={val}")

except Exception as e:
    print(f"DEBUG ERROR: {e}")
print("="*80 + "\n")
# ================= DEBUG END =================

self.ds_opt_adam.adam_update(self.opt_id, state['step'], group['lr'], beta1, beta2, eps_val,
                             group['weight_decay'], bias_correction_val, p.data, p.grad.data,
                             state['exp_avg'], state['exp_avg_sq'])
return loss

Problem 8: No Data! (Meta Tensor Issue)

File: /root/miniconda3/envs/kohyass/lib/python3.11/site-packages/transformers/modeling_utils.py
Line 2031: Add enabled parameter.
init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config(), enabled=False), set_zero3_state()]
Reason: When loading models locally, transformers uses meta tensors, but importing checkpoints finds empty meta tensors causing errors. Disable DeepSpeed default initialization.
Reference: https://github.com/zai-org/ChatGLM-6B/issues/530

Problem 9: mat1 and mat2 not equal (Dtype mismatch)

File: kohya_ss/sd-scripts/library/flux_models.py
Line 1068: Add text and img tensor type conversion.
Reason: Clip and T5 process in float32, but other types set to bfloat16 caused mismatch.

def forward(
    self,
    img: Tensor,
    img_ids: Tensor,
    txt: Tensor,
    txt_ids: Tensor,
    timesteps: Tensor,
    y: Tensor,
    block_controlnet_hidden_states=None,
    block_controlnet_single_hidden_states=None,
    guidance: Tensor | None = None,
    txt_attention_mask: Tensor | None = None,
) -> Tensor:
    
    target_dtype = self.img_in.weight.dtype  # Use this layer's weight type as standard
    
    if img.dtype != target_dtype:
        img = img.to(target_dtype)
        
    if txt.dtype != target_dtype:
        txt = txt.to(target_dtype)
        
    if timesteps.dtype != target_dtype:
        timesteps = timesteps.to(target_dtype)
        
    if guidance is not None and guidance.dtype != target_dtype:
        guidance = guidance.to(target_dtype)
        
    if y is not None and y.dtype != target_dtype:
        y = y.to(target_dtype)

        
    if img.ndim != 3 or txt.ndim != 3:
        raise ValueError("Input img and txt tensors must have 3 dimensions.")
        # ==========================
        # Next Code

DeepSpeed AttributeError: ‘DeepSpeedZeRoOffload’ object has no attribute ‘backward’

Reason: DeepSpeed not initialized.
Debug: Check /root/autodl-tmp/kohya_ss/sd-scripts/library/deepspeed_utils.py Line 87.
Note: Line 64 in deepspeed_utils.py might return None if DeepSpeed is not set.