Lora-Script

Abstract

This post primarily records issues encountered during model fine-tuning and some parameter configurations. The graphics cards used were rented from a cloud provider.

  • Multi-card model training mainly uses a modified kohya_ss framework and utilizes deepspeed 3 for multi-GPU training.
  • Single-card training uses the aki toolkit for lora-script training.

Bash Config

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
[model]
v2 = false
v_parameterization = false
pretrained_model_name_or_path = "./sd-models/realismEngineSDXL_v30VAE.safetensors"
vae = "./sd-models/sdxl_vae.safetensors"

[dataset]
train_data_dir = "./train/001"
reg_data_dir = ""
prior_loss_weight = 1
cache_latents = true
shuffle_caption = true
enable_bucket = true

[additional_network]
network_dim = 32
network_alpha = 16
network_train_unet_only = false
network_train_text_encoder_only = false
network_module = "networks.lora"
network_args = []

[optimizer]
unet_lr = 1e-4
text_encoder_lr = 1e-5
optimizer_type = "AdamW8bit"
lr_scheduler = "cosine_with_restarts"
lr_warmup_steps = 0
lr_restart_cycles = 1

[training]
resolution = "512,512"
train_batch_size = 1
max_train_epochs = 10
noise_offset = 0.0
keep_tokens = 0
xformers = true
lowram = false
clip_skip = 2
mixed_precision = "fp16"
save_precision = "fp16"

[sample_prompt]
sample_sampler = "euler_a"
sample_every_n_epochs = 1

[saving]
output_name = "xtgz-centos-sdxl"
save_every_n_epochs = 2
save_n_epoch_ratio = 0
save_last_n_epochs = 499
save_state = false
save_model_as = "safetensors"
output_dir = "./output"
logging_dir = "./logs"
log_prefix = "output_name"

[others]
min_bucket_reso = 256
max_bucket_reso = 1024
caption_extension = ".txt"
max_token_length = 225
seed = 1337

Common Issues/Troubleshooting

Mirror Site Address

  • export HF_ENDPOINT=https://hf-mirror.com

Missing Modules

  • ModuleNotFoundError: No module named 'bitsandbytes'
    • Solution: pip install bitsandbytes>=0.43.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/

Pip Permission Issues

  • WARNING: Running pip as the 'root' user can result in broken permissions...
    • Solution: Using a virtual environment is recommended, but for quick fixes locally: pip install bitsandbytes>=0.43.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/

Folder Permission Issues

  • WARNING: Ignoring invalid distribution -orch ...
    • Solution: Delete the folder path .~orch or other similarly named temporary folders.

xformers Issues

  • no modules name 'xformers' in cuda 12.8:
    • Solution: pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu128

torchvision Issues

  • not libpng libjpeg . or need build torchvision before ***
    • Solution: pip3 install torchvision --index-url https://download.pytorch.org/whl/cu128

CUDA 12.8 Installation

  • Download: wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux.run
  • Silent Install: sh cuda_12.8.0_570.86.10_linux.run --toolkit --toolkitpath=/root/autodl-tmp/cuda-12.8 --silent
  • Modify Environment Variables:
    • echo 'export PATH=/root/autodl-tmp/cuda-12.8/bin:$PATH' >> ~/.bashrc
    • echo 'export LD_LIBRARY_PATH=/root/autodl-tmp/cuda-12.8/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc

xformers Error: No Kernel Image

  • Error: CUDA error ... no kernel image is available for execution on the device
  1. Fix: First check your CUDA version:
    nvcc --version -> compare with nvidia-smi CUDA version.
  2. If inconsistent, use the steps above to install the correct new version of CUDA.
  3. Check versions: conda list. Ensure PyTorch version matches xformers version. If you installed 2.7.1 but xformers only supports up to 2.7.0, downgrade first:
    pip3 install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
  4. If it still errors, it is indeed an xformers issue:
    1. Run python -m xformers.info
    2. Check build.envs. It’s possible the package for 12.x compute capability only supports up to 9.0.
      • build.env.TORCH_CUDA_ARCH_LIST: 6.0+PTX 7.0 7.5 8.0+PTX 9.0a
    3. Confirm current GPU compute capability: nvidia-smi --query-gpu=compute_cap --format=csv (e.g., 12.0)
    4. Modify environment variable for compute capability: export TORCH_CUDA_ARCH_LIST="12.0" (single session)
    5. Download source and compile (use mirror for acceleration):
      pip install -v --no-build-isolation -U git+https://ghfast.top/https://github.com/facebookresearch/xformers.git@main#egg=xformers
  5. After installation, check xformers build version again. If it matches your compute capability, it is correct.

xformers Error: Incompatibility & Slow Download

  • If install-cn.ps1 fails to install the virtual environment, try switching to install.ps1. Network issues mainly affect Torch installation (domestic mirrors may not help much).
  • Manual Installation:
    • If network is too slow, reinstall Torch manually:
    • Open install.ps1, verify commands.
    • python.exe -m venv venv (Create venv)
    • .\venv\Scripts\activate (Activate venv)
    • Use nvidia-smi to find CUDA version, go to PyTorch Website to download the corresponding .whl file.
    • Go to xformers repo/site, find the installation command for your CUDA version.
    • pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu128
    • Place the two downloaded packages in the lora-scripts folder.
    • Manually install:
      • pip install .\torch-2.7.0+cu128-cp310-cp310-win_amd64.whl
      • pip install .\xformers-0.0.30-cp310-cp310-win_amd64.whl
    • Finally, update the environment in the PS script.

Flux Training: google/t5-xxl Download Error

Multi-GPU Training Issues Summary

Complete Training Parameters

  1. train_flux.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#!/bin/bash

# =================================================================
# Configuration Area (Modify paths to match your machine)
# =================================================================

# 1. Disable P2P direct access (Core fix for Illegal memory access)
# export NCCL_P2P_DISABLE=1

# 2. Disable InfiniBand (Prevent crash from trying to use server-grade networking)
export NCCL_IB_DISABLE=1

# 3. Force blocking mode (If error occurs, shows the exact line)
export CUDA_LAUNCH_BLOCKING=1

# Explicitly specify GPUs (0,1,2,3)
export CUDA_VISIBLE_DEVICES=0,1,2,3

# Optimize memory allocation to prevent fragmentation
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

# FLUX.1 Model Path
MODEL_PATH="train/sd-models/flux1-dev.safetensors"

# CLIP-L Model Path
CLIP_PATH="train/sd-models/clip_l.safetensors"

# T5-XXL Model Path
T5_PATH="train/sd-models/t5xxl_fp16.safetensors"

# AE Model Path
AE_PATH="train/sd-models/flux-ae.safetensors"

# Output Directory
OUTPUT_DIR="./output"

# =================================================================
# Run Command
# =================================================================

accelerate launch \
--deepspeed_config_file "ds_config.json" \
--use_deepspeed \
--num_cpu_threads_per_process 8 \
--gpu_ids 0,1,2,3 \
--mixed_precision bf16 \
--num_processes 4 \
--num_machines 1 \
--num_cpu_threads_per_process 1 \
--offload_optimizer_device cpu \
--offload_param_device cpu \
"sd-scripts/flux_train.py" \
--config_file "dreambooth_flux_config.toml"\
--optimizer_type="adafactor" \
--optimizer_args "scale_parameter=False" "relative_step=False" "warmup_init=False" \
--cache_text_encoder_outputs \
--cache_latents \
--full_bf16 \
--lowram \
--gradient_checkpointing \
--cache_latents \
--max_data_loader_n_workers 0 \
--learning_rate 1e-5 \
--cache_latents_to_disk \
--cache_text_encoder_outputs_to_disk

ds_config.json

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": "auto",
"steps_per_print": 1,
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "nvme",
"nvme_path": "/root/autodl-tmp/ds_cache",
"pin_memory": true,
"buffer_count": 5,
"fast_init": false
},
"offload_param": {
"device": "nvme",
"nvme_path": "/root/autodl-tmp/ds_cache",
"pin_memory": true,
"buffer_count": 5,
"buffer_size": 1e8,
"max_in_cpu": 1e9
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": 5e7,
"stage3_prefetch_bucket_size": 5e7,
"stage3_param_persistence_threshold": 1e4,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_clipping": 1.0,
"bf16": {
"enabled": true
}
}

dreambooth_config.toml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
ae = "train/sd-models/flux-ae.safetensors"
blocks_to_swap = 0
bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
cache_text_encoder_outputs = true
cache_text_encoder_outputs_to_disk = true
caption_dropout_every_n_epochs = 0
caption_dropout_rate = 0
caption_extension = ".txt"
clip_l = "/root/autodl-tmp/kohya_ss/train/sd-models/clip_l.safetensors"
cpu_offload_checkpointing = true
discrete_flow_shift = 3.1582
double_blocks_to_swap = 0
dynamo_backend = "no"
epoch = 50
fp8_base = true
full_bf16 = false
# gradient_accumulation_steps = 1
gradient_checkpointing = true
guidance_scale = 1
huber_c = 0.1
huber_scale = 1
huber_schedule = "snr"
keep_tokens = 0
learning_rate = 4e-6
learning_rate_te = 0
loss_type = "l2"
lr_scheduler = "constant"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
lr_warmup_steps = 0
max_bucket_reso = 1024
max_data_loader_n_workers = 0
max_grad_norm = 1
max_timestep = 500
max_token_length = 75
max_train_steps = 250
min_bucket_reso = 256
mixed_precision = "bf16"
model_prediction_type = "sigma_scaled"
multires_noise_discount = 0.3
no_token_padding = true
noise_offset_type = "Original"
optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False", "weight_decay=0.01",]
# optimizer_args = [ ]
optimizer_type = "Adafactor"
# optimizer_type = "AdamW8bit"
output_dir = "outputs"
output_name = "Quality_1"
persistent_data_loader_workers = 0
pretrained_model_name_or_path = "train/sd-models/flux1-dev.safetensors"
prior_loss_weight = 1
resolution = "512,512"
# sample_prompts = "/outputs/sample/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 10
save_model_as = "safetensors"
save_precision = "bf16"
sdpa = true
seed = 1
single_blocks_to_swap = 0
t5xxl = "train/sd-models/t5xxl_fp16.safetensors"
t5xxl_max_token_length = 225
timestep_sampling = "sigmoid"
train_batch_size = 1
train_blocks = "all"
train_data_dir = "train/images"
wandb_run_name = "Quality_1"

1. Gradio Port Error

Traceback (most recent call last): OSError: Cannot find empty port in range: 28001-28001…

  • Solution:
    • netstat -ano | findstr :28001
    • taskkill /PID <PID> /F

2. CUDA Out of Memory

torch.OutOfMemoryError: CUDA out of memory.

  • Note: Sometimes reported as VRAM error but is actually caused by system RAM overflow. Need to modify batch_size.

3. use_libuv = 0

  • Reference: Introduction to Libuv TCPStore Backend
  • Issue: If use_libuv = 0 is set in environment variables but explicitly set to True in code (route 3), the code value takes precedence. I set lib_use to False in all relevant files.

4. DistributedDataParallel Error

ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules…

  • Solution: Add parameter to cancel CPU offloading/swapping: --blocks_to_swap = 0

5. NotImplementedError: Cannot copy out of meta tensor; no data!

  • Cause: Model tensors not initialized properly.
  • Fix: Modify /root/autodl-tmp/kohya_ss/sd-scripts/library/flux_utils.py

6. DeepSpeed OOM

Cause: DeepSpeed loading Flux data into VRAM and RAM causing OOM.

Model Training Config:

  • Full Fine-tuning Flux1-dev
  • Environment: PyTorch 2.7.0, Python 3.12 (Ubuntu 22.04), CUDA 12.8
  • Hardware: RTX 5090 (32GB) * 4, 360GB RAM
  • Storage: 50GB System, 440GB Data

Solution: Offload optimizer and params to NVMe using DeepSpeed to move data pressure to disk.
(See ds_config.json configuration above, specifically offload_optimizer and offload_param set to nvme).

7. TypeError: adam_update(): incompatible function arguments.

DeepSpeed 3 passes incorrect eps value type causing this.
Fix Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
beta1, beta2 = group['betas']

# ================= FIX START: Handle eps tuple conflict =================
# 1. Force handle eps: If tuple from Adafactor, force to Adam default 1e-8
eps_val = group['eps']

if isinstance(eps_val, tuple) or isinstance(eps_val, list):
# Adafactor eps is (1e-30, 1e-3), causing division by zero or instability in Adam
# So if tuple, use Adam standard default 1e-8
eps_val = 1e-8
else:
eps_val = float(eps_val)

# 2. Force handle step (Prevent Tensor)
step_val = state['step']
if hasattr(step_val, 'item'):
step_val = int(step_val.item())
else:
step_val = int(step_val)

# 3. Force handle bias_correction (Prevent int)
bias_correction_val = bool(group['bias_correction'])

# ================= DEBUG START =================
print("\n" + "="*30 + " DEBUG ADAM UPDATE " + "="*30)
try:
# Extract variables for checking
arg_list = [
("0. opt_id (int)", self.opt_id),
("1. step (int)", state['step']),
("2. lr (float)", group['lr']),
("3. beta1 (float)", beta1),
("4. beta2 (float)", beta2),
("5. eps (float)", group['eps']),
("6. weight_decay (float)", group['weight_decay']),
("7. bias_correction (bool)", group['bias_correction']),
("8. param (Tensor)", p.data),
("9. grad (Tensor)", p.grad.data),
("10. exp_avg (Tensor)", state['exp_avg']),
("11. exp_avg_sq (Tensor)", state['exp_avg_sq'])
]

for name, val in arg_list:
if hasattr(val, 'shape'): # If Tensor
print(f"[{name}]: Type={type(val)}, Dtype={val.dtype}, Device={val.device}, Shape={val.shape}")
else: # If Scalar
print(f"[{name}]: Type={type(val)}, Value={val}")

except Exception as e:
print(f"DEBUG ERROR: {e}")
print("="*80 + "\n")
# ================= DEBUG END =================

self.ds_opt_adam.adam_update(self.opt_id, state['step'], group['lr'], beta1, beta2, eps_val,
group['weight_decay'], bias_correction_val, p.data, p.grad.data,
state['exp_avg'], state['exp_avg_sq'])
return loss

Problem 8: No Data! (Meta Tensor Issue)

  • File: /root/miniconda3/envs/kohyass/lib/python3.11/site-packages/transformers/modeling_utils.py
  • Line 2031: Add enabled parameter.
  • init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config(), enabled=False), set_zero3_state()]
  • Reason: When loading models locally, transformers uses meta tensors, but importing checkpoints finds empty meta tensors causing errors. Disable DeepSpeed default initialization.
  • Reference: https://github.com/zai-org/ChatGLM-6B/issues/530

Problem 9: mat1 and mat2 not equal (Dtype mismatch)

  • File: kohya_ss/sd-scripts/library/flux_models.py
  • Line 1068: Add text and img tensor type conversion.
  • Reason: Clip and T5 process in float32, but other types set to bfloat16 caused mismatch.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def forward(
self,
img: Tensor,
img_ids: Tensor,
txt: Tensor,
txt_ids: Tensor,
timesteps: Tensor,
y: Tensor,
block_controlnet_hidden_states=None,
block_controlnet_single_hidden_states=None,
guidance: Tensor | None = None,
txt_attention_mask: Tensor | None = None,
) -> Tensor:

target_dtype = self.img_in.weight.dtype # Use this layer's weight type as standard

if img.dtype != target_dtype:
img = img.to(target_dtype)

if txt.dtype != target_dtype:
txt = txt.to(target_dtype)

if timesteps.dtype != target_dtype:
timesteps = timesteps.to(target_dtype)

if guidance is not None and guidance.dtype != target_dtype:
guidance = guidance.to(target_dtype)

if y is not None and y.dtype != target_dtype:
y = y.to(target_dtype)


if img.ndim != 3 or txt.ndim != 3:
raise ValueError("Input img and txt tensors must have 3 dimensions.")
# ==========================
# Next Code

DeepSpeed AttributeError: ‘DeepSpeedZeRoOffload’ object has no attribute ‘backward’

  • Reason: DeepSpeed not initialized.
  • Debug: Check /root/autodl-tmp/kohya_ss/sd-scripts/library/deepspeed_utils.py Line 87.
  • Note: Line 64 in deepspeed_utils.py might return None if DeepSpeed is not set.

NCCL enqueue.cc:1556 NCCL WARN Cuda failure 700 ‘an illegal memory access was encountered’

  • pip install nvidia-nccl-cu12>2.26.2
  • This error may occur on RTX 5090 but does not affect training.