Lora模型训练记录
Lora-Script
摘要
主要是为了记录自己再模型微调的过程中遇到的一些问题,和一些参数配置 ,显卡为云端租赁使用
- 多卡模型训练主要是使用koyha_ss的框架修改,并使用deepspeed3进行多卡训练
- 单卡则是使用的aki的工具包进行lora-script的训练
bash config
1 | [model] |
常见问题
镜像站地址
- export HF_ENDPOINT=https://hf-mirror.com
模型缺失
- ModuleNotFoundError: No module named ‘bitsandbytes’ pip install bitsandbytes>=0.43.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/
pip权限问题
- WARNING: Running pip as the ‘root’ user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv pip install bitsandbytes>=0.43.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/
文件夹权限问题
- WARNING: Ignoring invalid distribution -orch (/root/miniconda3/lib/python3.10/site-packages) delete floder path ‘.~orch’ or other same sytle
xformers问题
- no modules name ‘xformers’ in cuda 12.8 : pip3 install -U xformers –index-url https://download.pytorch.org/whl/cu128
torchvision问题
- not libpng libjpeg . or need build torchvision before *** : pip3 install torchvision –index-url https://download.pytorch.org/whl/cu128
cuda 12.8 安装
- 下载地址: `wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux.run`
- 静默安装 `sh cuda_12.8.0_570.86.10_linux.run \ --toolkit \ --toolkitpath=/root/autodl-tmp/cuda-12.8 \ --silent`
- 修改环境变量
- `echo 'export PATH=/root/autodl-tmp/cuda-12.8/bin:$PATH' >> ~/.bashrc`
- `echo 'export LD_LIBRARY_PATH=/root/autodl-tmp/cuda-12.8/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc`
xformers报错
CUDA error (/__w/xformers/xformers/third_party/flash-attention/hopper/flash_fwd_launch_template.h:175): no kernel image is available for execution on the device Traceback (most recent call last):
- 修复这个问题,首先需要检查cuda:
nvcc --version->nvidia-smi中的cuda版本对应 - 如果不一致则使用上面的cuda安装新版本
conda list检查pytorch版本与xforemers版本是否一致,如果安装了2.7.1但是最高支持到2.7.0,可以首先降级使用
pip3 install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128- 依旧报错则确实为xformers的问题
- 运行
python -m xformers.info - 查看里面的build.envs版本,有可能12的算力显卡下载的包算力兼容最高到9.0
build.env.TORCH_CUDA_ARCH_LIST: 6.0+PTX 7.0 7.5 8.0+PTX 9.0a
- 确认当前显卡的算力
nvidia-smi --query-gpu=compute_cap --format=csv12.0 - 修改环境变量中的算力值:
export TORCH_CUDA_ARCH_LIST="12.0"单次修改 - 下载源码并编译,可以使用镜像网站加速下载
pip install -v --no-build-isolation -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers->pip install -v --no-build-isolation -U git+https:/ghfast.top/https://github.com/facebookresearch/xformers.git@main#egg=xformers
- 运行
- 安装完成后,检查一下
xformers的build版本,如果与自己算力一致则正确
xformers 报错 , 新版本无法与旧版本兼容 ,并且电脑下载缓慢的问题
- 如果直接点击install-cn.ps1 会出现无法安装虚拟环境的问题,可以切换到 install.ps1进行安装,网络问题主要体现在torch安装,国内版本没有明显改善
- 安装
- 如果网速过慢的情况下,需要重新安装torch,可以参照下面的步骤进行
- 打开install.ps1,手动复制下面的命令
- `python.exe -m venv venv` 创建虚拟环境
- 激活虚拟环境`.\venv\Scripts\activate`
- 使用`nvidia-smi`找到自己的cuda版本,去[torch官网](https://pytorch.org/get-started/locally/)找到相应的`.whl`文件手动下载
- ![[Pasted image 20251117091728.png]]
- 前往xformers官网,找到安装的命令行粘贴到浏览器手动下载,我是cuda128所以使用cuda128的安装命令
- `pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu128`
- 将下载的两个包放在lora-scropt的文件夹中![[Pasted image 20251117092155.png]]
- 使用命令手动安装这两个包,先安装torch,在安装xformers
- ` pip install .\torch-2.7.0+cu128-cp310-cp310-win_amd64.whl`
- `pip install .\xformers-0.0.30-cp310-cp310-win_amd64.whl`
- 其次再在ps文件中更新一下环境文件,即可顺利使用![[Pasted image 20251117092327.png]]
Flux 训练出现需要下载google/t5-xxl 的报错
- [https://blog.csdn.net/sinat_29957455/article/details/142782264](https://blog.csdn.net/sinat_29957455/article/details/142782264)
多卡训练的问题汇总
完整的训练参数
- tran_flux.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112#!/bin/bash
# =================================================================
# 配置区域 (请在此处修改为您机器上的实际路径)
# =================================================================
# 1. 禁用 P2P 直接访问 (解决 Illegal memory access 的核心)
# export NCCL_P2P_DISABLE=1
# 2. 禁用 InfiniBand (防止尝试使用服务器级网络导致崩溃)
export NCCL_IB_DISABLE=1
# 3. 强制使用阻塞模式 (如果再次报错,能看到具体是哪一行代码炸了)
export CUDA_LAUNCH_BLOCKING=1
# 显式指定使用的 GPU (0,1,2,3)
export CUDA_VISIBLE_DEVICES=0,1,2,3
# 优化内存分配,防止碎片化
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
# FLUX.1 模型路径 (例如 /root/autodl-tmp/flux1-dev.safetensors)
MODEL_PATH="train/sd-models/flux1-dev.safetensors"
# CLIP-L 模型路径
CLIP_PATH="train/sd-models/clip_l.safetensors"
# T5-XXL 模型路径
T5_PATH="train/sd-models/t5xxl_fp16.safetensors"
# AE 模型路径
AE_PATH="train/sd-models/flux-ae.safetensors"
# 输出文件夹路径
OUTPUT_DIR="./output"
# =================================================================
# 运行命令 (下方参数未修改,直接引用上方变量)
# =================================================================
accelerate launch \
--deepspeed_config_file "ds_config.json" \
--use_deepspeed \
--num_cpu_threads_per_process 8 \
--gpu_ids 0,1,2,3 \
--mixed_precision bf16 \
--num_processes 4 \
--num_machines 1 \
--num_cpu_threads_per_process 1 \
--offload_optimizer_device cpu \
--offload_param_device cpu \
"sd-scripts/flux_train.py" \
--config_file "dreambooth_flux_config.toml"\
--optimizer_type="adafactor" \
--optimizer_args "scale_parameter=False" "relative_step=False" "warmup_init=False" \
--cache_text_encoder_outputs \
--cache_latents \
--full_bf16 \
--lowram \
--gradient_checkpointing \
--cache_latents \
--max_data_loader_n_workers 0 \
--learning_rate 1e-5 \
--cache_latents_to_disk \
--cache_text_encoder_outputs_to_disk
ds_config.json
1. ```json
{
"train_batch_size": "auto",
“train_micro_batch_size_per_gpu”: 1,
“gradient_accumulation_steps”: “auto”,
“steps_per_print”: 1,
“zero_optimization”: {
“stage”: 3,
“offload_optimizer”: {
“device”: “nvme”,
“nvme_path”: “/root/autodl-tmp/ds_cache”,
“pin_memory”: true,
“buffer_count”: 5,
“fast_init”: false
},
“offload_param”: {
“device”: “nvme”,
“nvme_path”: “/root/autodl-tmp/ds_cache”,
“pin_memory”: true,
“buffer_count”: 5,
“buffer_size”: 1e8,
“max_in_cpu”: 1e9
},
“overlap_comm”: true,
“contiguous_gradients”: true,
“sub_group_size”: 1e9,
“reduce_bucket_size”: 5e7,
“stage3_prefetch_bucket_size”: 5e7,
“stage3_param_persistence_threshold”: 1e4,
“stage3_max_live_parameters”: 1e9,
“stage3_max_reuse_distance”: 1e9,
“stage3_gather_16bit_weights_on_model_save”: true
},
“gradient_clipping”: 1.0,
“bf16”: {
“enabled”: true
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
### dereambooth_config.toml
``` ae = "train/sd-models/flux-ae.safetensors"
blocks_to_swap = 0
bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
cache_text_encoder_outputs = true
cache_text_encoder_outputs_to_disk = true
caption_dropout_every_n_epochs = 0
caption_dropout_rate = 0
caption_extension = ".txt"
clip_l = "/root/autodl-tmp/kohya_ss/train/sd-models/clip_l.safetensors"
cpu_offload_checkpointing = true
discrete_flow_shift = 3.1582
double_blocks_to_swap = 0
dynamo_backend = "no"
epoch = 50
fp8_base = true
full_bf16 = false
# gradient_accumulation_steps = 1
gradient_checkpointing = true
guidance_scale = 1
huber_c = 0.1
huber_scale = 1
huber_schedule = "snr"
keep_tokens = 0
learning_rate = 4e-6
learning_rate_te = 0
loss_type = "l2"
lr_scheduler = "constant"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
lr_warmup_steps = 0
max_bucket_reso = 1024
max_data_loader_n_workers = 0
max_grad_norm = 1
max_timestep = 500
max_token_length = 75
max_train_steps = 250
min_bucket_reso = 256
mixed_precision = "bf16"
model_prediction_type = "sigma_scaled"
multires_noise_discount = 0.3
no_token_padding = true
noise_offset_type = "Original"
optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False", "weight_decay=0.01",]
# optimizer_args = [ ]
optimizer_type = "Adafactor"
# optimizer_type = "AdamW8bit"
output_dir = "outputs"
output_name = "Quality_1"
persistent_data_loader_workers = 0
pretrained_model_name_or_path = "train/sd-models/flux1-dev.safetensors"
prior_loss_weight = 1
resolution = "512,512"
# sample_prompts = "/outputs/sample/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 10
save_model_as = "safetensors"
save_precision = "bf16"
sdpa = true
seed = 1
single_blocks_to_swap = 0
t5xxl = "train/sd-models/t5xxl_fp16.safetensors"
t5xxl_max_token_length = 225
timestep_sampling = "sigmoid"
train_batch_size = 1
train_blocks = "all"
train_data_dir = "train/images"
wandb_run_name = "Quality_1"
1. Traceback (most recent call last):OSError: Cannot find empty port in range: 28001-28001. You can specify a different port by setting the GRADIO_SERVER_PORT environment variable or passing the server_port parameter to launch()
1
2
3
4
5
6
7
8
File "E:\lora training\lora-scripts-v1.8.5\mikazuki\dataset-tag-editor\scripts\launch.py", line 102, in <module>
interface.main()
File "E:\lora training\lora-scripts-v1.8.5\mikazuki\dataset-tag-editor\scripts\interface.py", line 218, in main
app, _, _ = interface.launch(
File "E:\lora training\lora-scripts-v1.8.5\venv\lib\site-packages\gradio\blocks.py", line 1907, in launch
) = networking.start_server(
File "E:\lora training\lora-scripts-v1.8.5\venv\lib\site-packages\gradio\networking.py", line 207, in start_serverraise OSError
OSError: Cannot find empty port in range: 28001-28001. You can specify a different port by setting the GRADIO_SERVER_PORT environment variable or passing the `server_port` parameter to `launch()`.
- 运行命令
- `netstat -ano | findstr :28001`
- `taskkill /PID 12345 /F`
2. torch.OutOfMemoryError: CUDA out of memory.
- 报错显卡内存报错但是实际可能是系统内存溢出导致,需要重新修改batch_size
3 . use_libuv = 0
- 参考文章 : Introduction to Libuv TCPStore Backend
- 其中route 3 中提示,如果在环境变量中设置了
use_libuv = 0但是在代码中赋值为True, 则依旧会按照代码中执行,所以后面我将所有报错的文件中的lib_use设置为固定值False
4. ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [1], output_device 1, and module parameters {device(type='cpu')}
- 增加新的参数,取消在cpu进行分片 :
--blocks_to_swap = 0
E:/lora training/lora-scripts-v1.8.5/sd-models/flux-ae.safetensors
E:/lora training/lora-scripts-v1.8.5/sd-models/clip_l.safetensors
E:/lora training/lora-scripts-v1.8.5/sd-models/t5xxl_fp16.safetensors
E:/Kohya_FLUX_DreamBooth_v18/kohya_ss/train
E:/lora training/lora-scripts-v1.8.5/sd-models/flux1-dev.safetensors
5. NotImplementedError: Cannot copy out of meta tensor; no data!
是因为模型张量没有初始话导致,修改下面路径的/root/autodl-tmp/kohya_ss/sd-scripts/library/flux_utils.py
6. deepspeed OOM
原因是因为deepspeed在训练flux时在显存和内存中加载数据导致的OOM
训练模型:
全量微调Flux1-dev模型
配置 :
1 | PyTorch 2.7.0 |
解决办法: 在显存活内存不足的情况下,使用deepspeed中的nvme数据盘进行存储数据,将所有的数据压力转移到硬盘中
1 | { |
7. TypeError: adam_update(): incompatible function arguments.
deepspeed 3 使用这个函数传入eps的值错误导致的问题出现
修改代码
1 | beta1, beta2 = group['betas'] |
Problem 0 :****** , No Data !
/root/miniconda3/envs/kohyass/lib/python3.11/site-packages/transformers/modeling_utils.py
Line 2031 : Add enbaled parameter
init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config(),**enabled = False**), set_zero3_state()]
Reason : 本地下载模型, transformers会使用 meta tensor , 但是导入checkpoint会发现是空的meta从而报错,禁止deepspeed进行默认初始化
mat1 and mat2 not equal ….
kohya_ss/sd-scripts/library/flux_models.py
line 1068 增加txt和img的向量类型,因为clip与t5处理类型为 folat32 , 设置其他的类型为bfolat16导致
1 | def forward( |
deepseed AttributeError: ‘DeepSpeedZeRoOffload’ object has no attribute ‘backward’
是因为deepspeed未初始化,可以在下面的位置打一个print进行查看
/root/autodl-tmp/kohya_ss/sd-scripts/library/deepspeed_utils.py Line 87
kohya_ss/sd-scripts/library/deepspeed_utils.py Line 64 在deepspeed未设置会直接跳过一个返回None
NCCL enqueue.cc:1556 NCCL WARN Cuda failure 700 ‘an illegal memory access was encountered’
pip install nvidia-nccl-cu12>2.26.2 在5090 上会出现这个错误,不影响训练
1 | # EXAMPLE |

