Lora模型训练记录 | Graptolite Blog

摘要

主要是为了记录自己再模型微调的过程中遇到的一些问题，和一些参数配置，显卡为云端租赁使用

多卡模型训练主要是使用koyha_ss的框架修改，并使用deepspeed3进行多卡训练
单卡则是使用的aki的工具包进行lora-script的训练

bash config

[model]
v2 = false
v_parameterization = false
pretrained_model_name_or_path = "./sd-models/realismEngineSDXL_v30VAE.safetensors"
vae = "./sd-models/sdxl_vae.safetensors"

[dataset]
train_data_dir = "./train/001"
reg_data_dir = ""
prior_loss_weight = 1
cache_latents = true
shuffle_caption = true
enable_bucket = true

[additional_network]
network_dim = 32
network_alpha = 16
network_train_unet_only = false
network_train_text_encoder_only = false
network_module = "networks.lora"
network_args = []

[optimizer]
unet_lr = 1e-4
text_encoder_lr = 1e-5
optimizer_type = "AdamW8bit"
lr_scheduler = "cosine_with_restarts"
lr_warmup_steps = 0
lr_restart_cycles = 1

[training]
resolution = "512,512"
train_batch_size = 1
max_train_epochs = 10
noise_offset = 0.0
keep_tokens = 0
xformers = true
lowram = false
clip_skip = 2
mixed_precision = "fp16"
save_precision = "fp16"

[sample_prompt]
sample_sampler = "euler_a"
sample_every_n_epochs = 1

[saving]
output_name = "xtgz-centos-sdxl"
save_every_n_epochs = 2
save_n_epoch_ratio = 0
save_last_n_epochs = 499
save_state = false
save_model_as = "safetensors"
output_dir = "./output"
logging_dir = "./logs"
log_prefix = "output_name"

[others]
min_bucket_reso = 256
max_bucket_reso = 1024
caption_extension = ".txt"
max_token_length = 225
seed = 1337

常见问题

镜像站地址

export HF_ENDPOINT=https://hf-mirror.com

模型缺失

ModuleNotFoundError: No module named ‘bitsandbytes’ pip install bitsandbytes>=0.43.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/

pip权限问题

WARNING: Running pip as the ‘root’ user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv pip install bitsandbytes>=0.43.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/

文件夹权限问题

WARNING: Ignoring invalid distribution -orch (/root/miniconda3/lib/python3.10/site-packages) delete floder path ’.~orch’ or other same sytle

xformers问题

no modules name ‘xformers’ in cuda 12.8 : pip3 install -U xformers —index-url https://download.pytorch.org/whl/cu128

torchvision问题

not libpng libjpeg . or need build torchvision before *** : pip3 install torchvision —index-url https://download.pytorch.org/whl/cu128

cuda 12.8 安装

- 下载地址： `wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux.run`
- 静默安装 `sh cuda_12.8.0_570.86.10_linux.run \ --toolkit \ --toolkitpath=/root/autodl-tmp/cuda-12.8 \ --silent`
- 修改环境变量 
	- `echo 'export PATH=/root/autodl-tmp/cuda-12.8/bin:$PATH' >> ~/.bashrc`
	- `echo 'export LD_LIBRARY_PATH=/root/autodl-tmp/cuda-12.8/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc`

xformers报错

CUDA error (/__w/xformers/xformers/third_party/flash-attention/hopper/flash_fwd_launch_template.h:175): no kernel image is available for execution on the device Traceback (most recent call last):

修复这个问题，首先需要检查cuda： nvcc --version -> nvidia-smi 中的cuda版本对应
如果不一致则使用上面的cuda安装新版本
conda list 检查pytorch版本与xforemers版本是否一致，如果安装了2.7.1但是最高支持到2.7.0，可以首先降级使用 pip3 install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
依旧报错则确实为xformers的问题
1. 运行python -m xformers.info
2. 查看里面的build.envs版本，有可能12的算力显卡下载的包算力兼容最高到9.0
  1. build.env.TORCH_CUDA_ARCH_LIST: 6.0+PTX 7.0 7.5 8.0+PTX 9.0a
3. 确认当前显卡的算力nvidia-smi --query-gpu=compute_cap --format=csv 12.0
4. 修改环境变量中的算力值： export TORCH_CUDA_ARCH_LIST="12.0" 单次修改
5. 下载源码并编译，可以使用镜像网站加速下载pip install -v --no-build-isolation -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers->pip install -v --no-build-isolation -U git+https:/ghfast.top/https://github.com/facebookresearch/xformers.git@main#egg=xformers
安装完成后，检查一下xformers的build版本，如果与自己算力一致则正确

xformers 报错，新版本无法与旧版本兼容，并且电脑下载缓慢的问题

- 如果直接点击install-cn.ps1 会出现无法安装虚拟环境的问题，可以切换到 install.ps1进行安装，网络问题主要体现在torch安装，国内版本没有明显改善
- 安装
	- 如果网速过慢的情况下，需要重新安装torch，可以参照下面的步骤进行
	- 打开install.ps1，手动复制下面的命令
	- `python.exe -m venv venv` 创建虚拟环境
	- 激活虚拟环境`.\venv\Scripts\activate`
	- 使用`nvidia-smi`找到自己的cuda版本，去[torch官网](https://pytorch.org/get-started/locally/)找到相应的`.whl`文件手动下载
	- ![[Pasted image 20251117091728.png]]
	- 前往xformers官网，找到安装的命令行粘贴到浏览器手动下载，我是cuda128所以使用cuda128的安装命令
	- `pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu128`
	- 将下载的两个包放在lora-scropt的文件夹中![[Pasted image 20251117092155.png]]
	- 使用命令手动安装这两个包，先安装torch，在安装xformers
	- ` pip install .\torch-2.7.0+cu128-cp310-cp310-win_amd64.whl`
	- `pip install .\xformers-0.0.30-cp310-cp310-win_amd64.whl`
	- 其次再在ps文件中更新一下环境文件，即可顺利使用![[Pasted image 20251117092327.png]]

Flux 训练出现需要下载google/t5-xxl 的报错

- [https://blog.csdn.net/sinat_29957455/article/details/142782264](https://blog.csdn.net/sinat_29957455/article/details/142782264)

多卡训练的问题汇总

完整的训练参数

tran_flux.sh

	   #!/bin/bash

  

# =================================================================

# 配置区域 (请在此处修改为您机器上的实际路径)

# =================================================================

# 1. 禁用 P2P 直接访问 (解决 Illegal memory access 的核心)

# export NCCL_P2P_DISABLE=1

# 2. 禁用 InfiniBand (防止尝试使用服务器级网络导致崩溃)

export NCCL_IB_DISABLE=1

# 3. 强制使用阻塞模式 (如果再次报错，能看到具体是哪一行代码炸了)

export CUDA_LAUNCH_BLOCKING=1

# 显式指定使用的 GPU (0,1,2,3)

export CUDA_VISIBLE_DEVICES=0,1,2,3


# 优化内存分配，防止碎片化

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

# FLUX.1 模型路径 (例如 /root/autodl-tmp/flux1-dev.safetensors)

MODEL_PATH="train/sd-models/flux1-dev.safetensors" 

# CLIP-L 模型路径

CLIP_PATH="train/sd-models/clip_l.safetensors"

# T5-XXL 模型路径

T5_PATH="train/sd-models/t5xxl_fp16.safetensors"

# AE 模型路径

AE_PATH="train/sd-models/flux-ae.safetensors"


# 输出文件夹路径

OUTPUT_DIR="./output"

  
  

# =================================================================

# 运行命令 (下方参数未修改，直接引用上方变量)

# =================================================================

  

accelerate launch \

  --deepspeed_config_file "ds_config.json" \

  --use_deepspeed \

  --num_cpu_threads_per_process 8 \

  --gpu_ids 0,1,2,3 \

  --mixed_precision bf16 \

  --num_processes 4 \

  --num_machines 1 \

  --num_cpu_threads_per_process 1 \

  --offload_optimizer_device cpu \

  --offload_param_device cpu \

  "sd-scripts/flux_train.py" \

  --config_file "dreambooth_flux_config.toml"\

  --optimizer_type="adafactor" \

  --optimizer_args "scale_parameter=False" "relative_step=False" "warmup_init=False" \

  --cache_text_encoder_outputs \

  --cache_latents \

  --full_bf16 \

  --lowram \

  --gradient_checkpointing \

  --cache_latents \

  --max_data_loader_n_workers 0 \

  --learning_rate 1e-5 \

  --cache_latents_to_disk \

  --cache_text_encoder_outputs_to_disk

ds_config.json

	   {

	   "train_batch_size": "auto",

    "train_micro_batch_size_per_gpu": 1,

    "gradient_accumulation_steps": "auto",

    "steps_per_print": 1,

    "zero_optimization": {

        "stage": 3,

        "offload_optimizer": {

            "device": "nvme",

            "nvme_path": "/root/autodl-tmp/ds_cache",

            "pin_memory": true,

            "buffer_count": 5,

            "fast_init": false

        },

        "offload_param": {

            "device": "nvme",

            "nvme_path": "/root/autodl-tmp/ds_cache",

            "pin_memory": true,

            "buffer_count": 5,

            "buffer_size": 1e8,

            "max_in_cpu": 1e9

        },

        "overlap_comm": true,

        "contiguous_gradients": true,

        "sub_group_size": 1e9,

        "reduce_bucket_size": 5e7,

        "stage3_prefetch_bucket_size": 5e7,

        "stage3_param_persistence_threshold": 1e4,

        "stage3_max_live_parameters": 1e9,

        "stage3_max_reuse_distance": 1e9,

        "stage3_gather_16bit_weights_on_model_save": true

    },

    "gradient_clipping": 1.0,

    "bf16": {

        "enabled": true

    }

}

dereambooth_config.toml

ae = "train/sd-models/flux-ae.safetensors"
blocks_to_swap = 0
bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
cache_text_encoder_outputs = true
cache_text_encoder_outputs_to_disk = true
caption_dropout_every_n_epochs = 0
caption_dropout_rate = 0
caption_extension = ".txt"
clip_l = "/root/autodl-tmp/kohya_ss/train/sd-models/clip_l.safetensors"
cpu_offload_checkpointing = true
discrete_flow_shift = 3.1582
double_blocks_to_swap = 0
dynamo_backend = "no"
epoch = 50
fp8_base = true
full_bf16 = false
# gradient_accumulation_steps = 1
gradient_checkpointing = true
guidance_scale = 1
huber_c = 0.1
huber_scale = 1
huber_schedule = "snr"
keep_tokens = 0
learning_rate = 4e-6
learning_rate_te = 0
loss_type = "l2"
lr_scheduler = "constant"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
lr_warmup_steps = 0
max_bucket_reso = 1024
max_data_loader_n_workers = 0
max_grad_norm = 1
max_timestep = 500
max_token_length = 75
max_train_steps = 250
min_bucket_reso = 256
mixed_precision = "bf16"
model_prediction_type = "sigma_scaled"
multires_noise_discount = 0.3
no_token_padding = true
noise_offset_type = "Original"
optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False", "weight_decay=0.01",]
# optimizer_args = [ ]
optimizer_type = "Adafactor"
# optimizer_type = "AdamW8bit"
output_dir = "outputs"
output_name = "Quality_1"
persistent_data_loader_workers = 0
pretrained_model_name_or_path = "train/sd-models/flux1-dev.safetensors"
prior_loss_weight = 1
resolution = "512,512"
# sample_prompts = "/outputs/sample/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 10
save_model_as = "safetensors"
save_precision = "bf16"
sdpa = true
seed = 1
single_blocks_to_swap = 0
t5xxl = "train/sd-models/t5xxl_fp16.safetensors"
t5xxl_max_token_length = 225
timestep_sampling = "sigmoid"
train_batch_size = 1
train_blocks = "all"
train_data_dir = "train/images"
wandb_run_name = "Quality_1"

1. Traceback (most recent call last):OSError: Cannot find empty port in range: 28001-28001. You can specify a different port by setting the GRADIO_SERVER_PORT environment variable or passing the `server_port` parameter to `launch()`

  ```
  File "E:\lora training\lora-scripts-v1.8.5\mikazuki\dataset-tag-editor\scripts\launch.py", line 102, in <module>
  interface.main()
  File "E:\lora training\lora-scripts-v1.8.5\mikazuki\dataset-tag-editor\scripts\interface.py", line 218, in main
  app, _, _ = interface.launch(
  File "E:\lora training\lora-scripts-v1.8.5\venv\lib\site-packages\gradio\blocks.py", line 1907, in launch
  ) = networking.start_server(
  File "E:\lora training\lora-scripts-v1.8.5\venv\lib\site-packages\gradio\networking.py", line 207, in start_serverraise OSError
OSError: Cannot find empty port in range: 28001-28001. You can specify a different port by setting the GRADIO_SERVER_PORT environment variable or passing the `server_port` parameter to `launch()`.
 ```
 - 运行命令 
  - `netstat -ano | findstr :28001`
  - `taskkill /PID 12345 /F`

2. torch.OutOfMemoryError: CUDA out of memory.

报错显卡内存报错但是实际可能是系统内存溢出导致，需要重新修改batch_size

3 . use_libuv = 0

参考文章 : Introduction to Libuv TCPStore Backend
其中route 3 中提示，如果在环境变量中设置了use_libuv = 0 但是在代码中赋值为True ，则依旧会按照代码中执行，所以后面我将所有报错的文件中的lib_use设置为固定值False

4. `ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [1], output_device 1, and module parameters {device(type='cpu')}`

增加新的参数，取消在cpu进行分片：--blocks_to_swap = 0 E:/lora training/lora-scripts-v1.8.5/sd-models/flux-ae.safetensors E:/lora training/lora-scripts-v1.8.5/sd-models/clip_l.safetensors E:/lora training/lora-scripts-v1.8.5/sd-models/t5xxl_fp16.safetensors E:/Kohya_FLUX_DreamBooth_v18/kohya_ss/train E:/lora training/lora-scripts-v1.8.5/sd-models/flux1-dev.safetensors

5. NotImplementedError: Cannot copy out of meta tensor; no data!

是因为模型张量没有初始话导致，修改下面路径的/root/autodl-tmp/kohya_ss/sd-scripts/library/flux_utils.py

6. deepspeed OOM

原因是因为deepspeed在训练flux时在显存和内存中加载数据导致的OOM 训练模型：全量微调Flux1-dev模型配置：

PyTorch  2.7.0

Python  3.12(ubuntu22.04)

CUDA  12.8

GPU

RTX 5090(32GB) * 4

CPU64 vCPU Intel(R) Xeon(R) Gold 6459C

内存360GB

硬盘

系统盘:30 GB

数据盘:免费:50GB SSD  付费:440GB

解决办法：在显存活内存不足的情况下，使用deepspeed中的nvme数据盘进行存储数据，将所有的数据压力转移到硬盘中

{

    "train_batch_size": "auto",

    "train_micro_batch_size_per_gpu": 1,

    "gradient_accumulation_steps": "auto",

    "steps_per_print": 1,

    "zero_optimization": {

        "stage": 3,

        "offload_optimizer": {

            "device": "nvme",

            "nvme_path": "/root/autodl-tmp/ds_cache",

            "pin_memory": true,

            "buffer_count": 5,

            "fast_init": false

        },

        "offload_param": {

            "device": "nvme",

            "nvme_path": "/root/autodl-tmp/ds_cache",

            "pin_memory": true,

            "buffer_count": 5,

            "buffer_size": 1e8,

            "max_in_cpu": 1e9

        },

        "overlap_comm": true,

        "contiguous_gradients": true,

        "sub_group_size": 1e9,

        "reduce_bucket_size": 5e7,

        "stage3_prefetch_bucket_size": 5e7,

        "stage3_param_persistence_threshold": 1e4,

        "stage3_max_live_parameters": 1e9,

        "stage3_max_reuse_distance": 1e9,

        "stage3_gather_16bit_weights_on_model_save": true

    },

    "gradient_clipping": 1.0,

    "bf16": {

        "enabled": true

    }

}

7. `TypeError: adam_update(): incompatible function arguments.`

deepspeed 3 使用这个函数传入eps的值错误导致的问题出现

修改代码

                beta1, beta2 = group['betas']

                # ================= 修复开始：处理 eps 元组冲突 =================

                # 1. 强制处理 eps：如果是 Adafactor 传来的元组，强制改为 Adam 的默认值 1e-8
                eps_val = group['eps']

                if isinstance(eps_val, tuple) or isinstance(eps_val, list):

                    # Adafactor 的 eps 是 (1e-30, 1e-3)，但这会导致 Adam 除以零溢出或不稳定

                    # 所以如果发现是元组，直接使用 Adam 的标准默认值 1e-8

                    eps_val = 1e-8

                else:

                    eps_val = float(eps_val)

  

                # 2. 强制处理 step (防止 Tensor)

                step_val = state['step']

                if hasattr(step_val, 'item'):

                    step_val = int(step_val.item())

                else:

                    step_val = int(step_val)

  

                # 3. 强制处理 bias_correction (防止 int)

                bias_correction_val = bool(group['bias_correction'])

                # ================= DEBUG START =================

                print("\n" + "="*30 + " DEBUG ADAM UPDATE " + "="*30)

                try:

                    # 提取变量方便检查

                    arg_list = [

                        ("0. opt_id (int)", self.opt_id),

                        ("1. step (int)", state['step']),

                        ("2. lr (float)", group['lr']),

                        ("3. beta1 (float)", beta1),

                        ("4. beta2 (float)", beta2),

                        ("5. eps (float)", group['eps']),

                        ("6. weight_decay (float)", group['weight_decay']),

                        ("7. bias_correction (bool)", group['bias_correction']),

                        ("8. param (Tensor)", p.data),

                        ("9. grad (Tensor)", p.grad.data),

                        ("10. exp_avg (Tensor)", state['exp_avg']),

                        ("11. exp_avg_sq (Tensor)", state['exp_avg_sq'])

                    ]

  

                    for name, val in arg_list:

                        if hasattr(val, 'shape'): # 如果是 Tensor

                            print(f"[{name}]: Type={type(val)}, Dtype={val.dtype}, Device={val.device}, Shape={val.shape}")

                        else: # 如果是标量

                            print(f"[{name}]: Type={type(val)}, Value={val}")

  

                except Exception as e:

                    print(f"DEBUG ERROR: {e}")

                print("="*80 + "\n")

                # ================= DEBUG END =================

                self.ds_opt_adam.adam_update(self.opt_id, state['step'], group['lr'], beta1, beta2, eps_val,

                                             group['weight_decay'], bias_correction_val, p.data, p.grad.data,

                                             state['exp_avg'], state['exp_avg_sq'])

        return loss

Problem 0 ：****** , No Data !

/root/miniconda3/envs/kohyass/lib/python3.11/site-packages/transformers/modeling_utils.py

Line 2031 : Add enbaled parameter

init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config(),**enabled = False**), set_zero3_state()]

Reason : 本地下载模型， transformers会使用 meta tensor ，但是导入checkpoint会发现是空的meta从而报错，禁止deepspeed进行默认初始化

https://github.com/zai-org/ChatGLM-6B/issues/530

mat1 and mat2 not equal …

kohya_ss/sd-scripts/library/flux_models.py

line 1068 增加txt和img的向量类型，因为clip与t5处理类型为 folat32 ，设置其他的类型为bfolat16导致

    def forward(
        self,
        img: Tensor,
        img_ids: Tensor,
        txt: Tensor,
        txt_ids: Tensor,
        timesteps: Tensor,
        y: Tensor,
        block_controlnet_hidden_states=None,
        block_controlnet_single_hidden_states=None,
        guidance: Tensor | None = None,
        txt_attention_mask: Tensor | None = None,
    ) -> Tensor:
        
        target_dtype = self.img_in.weight.dtype  #以此层权重类型为准
        
        if img.dtype != target_dtype:
            img = img.to(target_dtype)
            
        if txt.dtype != target_dtype:
            txt = txt.to(target_dtype)
            
        if timesteps.dtype != target_dtype:
            timesteps = timesteps.to(target_dtype)
            
        if guidance is not None and guidance.dtype != target_dtype:
            guidance = guidance.to(target_dtype)
            
        if y is not None and y.dtype != target_dtype:
            y = y.to(target_dtype)

            
        if img.ndim != 3 or txt.ndim != 3:
            raise ValueError("Input img and txt tensors must have 3 dimensions.")
            ==========================
            Next Code

deepseed AttributeError: ‘DeepSpeedZeRoOffload’ object has no attribute ‘backward’

是因为deepspeed未初始化，可以在下面的位置打一个print进行查看

/root/autodl-tmp/kohya_ss/sd-scripts/library/deepspeed_utils.py Line 87

kohya_ss/sd-scripts/library/deepspeed_utils.py Line 64 在deepspeed未设置会直接跳过一个返回None

NCCL enqueue.cc:1556 NCCL WARN Cuda failure 700 ‘an illegal memory access was encountered’

pip install nvidia-nccl-cu12>2.26.2 在5090 上会出现这个错误，不影响训练

# EXAMPLE
        <ExampleReadMe>
        Summary: This tool is in the file `Processor.cs`. The core logic is handled by the `DataParser` class, which uses the `Autodesk.Revit.DB.Transaction` API.
        </ExampleReadMe>
        <ExampleJSONOutput>
        {{
        "target_files": ["Processor.cs"],
        "key_classes_and_methods": ["DataParser"],
        "mentioned_apis": ["Autodesk.Revit.DB.Transaction"]
        }}
        </ExampleJSONOutput>

摘要