摘要
主要是为了记录自己再模型微调的过程中遇到的一些问题,和一些参数配置 ,显卡为云端租赁使用
- 多卡模型训练主要是使用koyha_ss的框架修改,并使用deepspeed3进行多卡训练
- 单卡则是使用的aki的工具包进行lora-script的训练
bash config
[model]
v2 = false
v_parameterization = false
pretrained_model_name_or_path = "./sd-models/realismEngineSDXL_v30VAE.safetensors"
vae = "./sd-models/sdxl_vae.safetensors"
[dataset]
train_data_dir = "./train/001"
reg_data_dir = ""
prior_loss_weight = 1
cache_latents = true
shuffle_caption = true
enable_bucket = true
[additional_network]
network_dim = 32
network_alpha = 16
network_train_unet_only = false
network_train_text_encoder_only = false
network_module = "networks.lora"
network_args = []
[optimizer]
unet_lr = 1e-4
text_encoder_lr = 1e-5
optimizer_type = "AdamW8bit"
lr_scheduler = "cosine_with_restarts"
lr_warmup_steps = 0
lr_restart_cycles = 1
[training]
resolution = "512,512"
train_batch_size = 1
max_train_epochs = 10
noise_offset = 0.0
keep_tokens = 0
xformers = true
lowram = false
clip_skip = 2
mixed_precision = "fp16"
save_precision = "fp16"
[sample_prompt]
sample_sampler = "euler_a"
sample_every_n_epochs = 1
[saving]
output_name = "xtgz-centos-sdxl"
save_every_n_epochs = 2
save_n_epoch_ratio = 0
save_last_n_epochs = 499
save_state = false
save_model_as = "safetensors"
output_dir = "./output"
logging_dir = "./logs"
log_prefix = "output_name"
[others]
min_bucket_reso = 256
max_bucket_reso = 1024
caption_extension = ".txt"
max_token_length = 225
seed = 1337
常见问题
镜像站地址
- export HF_ENDPOINT=https://hf-mirror.com
模型缺失
- ModuleNotFoundError: No module named ‘bitsandbytes’ pip install bitsandbytes>=0.43.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/
pip权限问题
- WARNING: Running pip as the ‘root’ user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv pip install bitsandbytes>=0.43.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/
文件夹权限问题
- WARNING: Ignoring invalid distribution -orch (/root/miniconda3/lib/python3.10/site-packages) delete floder path ’.~orch’ or other same sytle
xformers问题
- no modules name ‘xformers’ in cuda 12.8 : pip3 install -U xformers —index-url https://download.pytorch.org/whl/cu128
torchvision问题
- not libpng libjpeg . or need build torchvision before *** : pip3 install torchvision —index-url https://download.pytorch.org/whl/cu128
cuda 12.8 安装
- 下载地址: `wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux.run`
- 静默安装 `sh cuda_12.8.0_570.86.10_linux.run \ --toolkit \ --toolkitpath=/root/autodl-tmp/cuda-12.8 \ --silent`
- 修改环境变量
- `echo 'export PATH=/root/autodl-tmp/cuda-12.8/bin:$PATH' >> ~/.bashrc`
- `echo 'export LD_LIBRARY_PATH=/root/autodl-tmp/cuda-12.8/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc`
xformers报错
CUDA error (/__w/xformers/xformers/third_party/flash-attention/hopper/flash_fwd_launch_template.h:175): no kernel image is available for execution on the device Traceback (most recent call last):
- 修复这个问题,首先需要检查cuda:
nvcc --version->nvidia-smi中的cuda版本对应 - 如果不一致则使用上面的cuda安装新版本
conda list检查pytorch版本与xforemers版本是否一致,如果安装了2.7.1但是最高支持到2.7.0,可以首先降级使用pip3 install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128- 依旧报错则确实为xformers的问题
- 运行
python -m xformers.info - 查看里面的build.envs版本,有可能12的算力显卡下载的包算力兼容最高到9.0
build.env.TORCH_CUDA_ARCH_LIST: 6.0+PTX 7.0 7.5 8.0+PTX 9.0a
- 确认当前显卡的算力
nvidia-smi --query-gpu=compute_cap --format=csv12.0 - 修改环境变量中的算力值:
export TORCH_CUDA_ARCH_LIST="12.0"单次修改 - 下载源码并编译,可以使用镜像网站加速下载
pip install -v --no-build-isolation -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers->pip install -v --no-build-isolation -U git+https:/ghfast.top/https://github.com/facebookresearch/xformers.git@main#egg=xformers
- 运行
- 安装完成后,检查一下
xformers的build版本,如果与自己算力一致则正确
xformers 报错 , 新版本无法与旧版本兼容 ,并且电脑下载缓慢的问题
- 如果直接点击install-cn.ps1 会出现无法安装虚拟环境的问题,可以切换到 install.ps1进行安装,网络问题主要体现在torch安装,国内版本没有明显改善
- 安装
- 如果网速过慢的情况下,需要重新安装torch,可以参照下面的步骤进行
- 打开install.ps1,手动复制下面的命令
- `python.exe -m venv venv` 创建虚拟环境
- 激活虚拟环境`.\venv\Scripts\activate`
- 使用`nvidia-smi`找到自己的cuda版本,去[torch官网](https://pytorch.org/get-started/locally/)找到相应的`.whl`文件手动下载
- ![[Pasted image 20251117091728.png]]
- 前往xformers官网,找到安装的命令行粘贴到浏览器手动下载,我是cuda128所以使用cuda128的安装命令
- `pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu128`
- 将下载的两个包放在lora-scropt的文件夹中![[Pasted image 20251117092155.png]]
- 使用命令手动安装这两个包,先安装torch,在安装xformers
- ` pip install .\torch-2.7.0+cu128-cp310-cp310-win_amd64.whl`
- `pip install .\xformers-0.0.30-cp310-cp310-win_amd64.whl`
- 其次再在ps文件中更新一下环境文件,即可顺利使用![[Pasted image 20251117092327.png]]
Flux 训练出现需要下载google/t5-xxl 的报错
- [https://blog.csdn.net/sinat_29957455/article/details/142782264](https://blog.csdn.net/sinat_29957455/article/details/142782264)
多卡训练的问题汇总
完整的训练参数
tran_flux.sh
#!/bin/bash
# =================================================================
# 配置区域 (请在此处修改为您机器上的实际路径)
# =================================================================
# 1. 禁用 P2P 直接访问 (解决 Illegal memory access 的核心)
# export NCCL_P2P_DISABLE=1
# 2. 禁用 InfiniBand (防止尝试使用服务器级网络导致崩溃)
export NCCL_IB_DISABLE=1
# 3. 强制使用阻塞模式 (如果再次报错,能看到具体是哪一行代码炸了)
export CUDA_LAUNCH_BLOCKING=1
# 显式指定使用的 GPU (0,1,2,3)
export CUDA_VISIBLE_DEVICES=0,1,2,3
# 优化内存分配,防止碎片化
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
# FLUX.1 模型路径 (例如 /root/autodl-tmp/flux1-dev.safetensors)
MODEL_PATH="train/sd-models/flux1-dev.safetensors"
# CLIP-L 模型路径
CLIP_PATH="train/sd-models/clip_l.safetensors"
# T5-XXL 模型路径
T5_PATH="train/sd-models/t5xxl_fp16.safetensors"
# AE 模型路径
AE_PATH="train/sd-models/flux-ae.safetensors"
# 输出文件夹路径
OUTPUT_DIR="./output"
# =================================================================
# 运行命令 (下方参数未修改,直接引用上方变量)
# =================================================================
accelerate launch \
--deepspeed_config_file "ds_config.json" \
--use_deepspeed \
--num_cpu_threads_per_process 8 \
--gpu_ids 0,1,2,3 \
--mixed_precision bf16 \
--num_processes 4 \
--num_machines 1 \
--num_cpu_threads_per_process 1 \
--offload_optimizer_device cpu \
--offload_param_device cpu \
"sd-scripts/flux_train.py" \
--config_file "dreambooth_flux_config.toml"\
--optimizer_type="adafactor" \
--optimizer_args "scale_parameter=False" "relative_step=False" "warmup_init=False" \
--cache_text_encoder_outputs \
--cache_latents \
--full_bf16 \
--lowram \
--gradient_checkpointing \
--cache_latents \
--max_data_loader_n_workers 0 \
--learning_rate 1e-5 \
--cache_latents_to_disk \
--cache_text_encoder_outputs_to_disk
ds_config.json
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": "auto",
"steps_per_print": 1,
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "nvme",
"nvme_path": "/root/autodl-tmp/ds_cache",
"pin_memory": true,
"buffer_count": 5,
"fast_init": false
},
"offload_param": {
"device": "nvme",
"nvme_path": "/root/autodl-tmp/ds_cache",
"pin_memory": true,
"buffer_count": 5,
"buffer_size": 1e8,
"max_in_cpu": 1e9
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": 5e7,
"stage3_prefetch_bucket_size": 5e7,
"stage3_param_persistence_threshold": 1e4,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_clipping": 1.0,
"bf16": {
"enabled": true
}
}
dereambooth_config.toml
ae = "train/sd-models/flux-ae.safetensors"
blocks_to_swap = 0
bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
cache_text_encoder_outputs = true
cache_text_encoder_outputs_to_disk = true
caption_dropout_every_n_epochs = 0
caption_dropout_rate = 0
caption_extension = ".txt"
clip_l = "/root/autodl-tmp/kohya_ss/train/sd-models/clip_l.safetensors"
cpu_offload_checkpointing = true
discrete_flow_shift = 3.1582
double_blocks_to_swap = 0
dynamo_backend = "no"
epoch = 50
fp8_base = true
full_bf16 = false
# gradient_accumulation_steps = 1
gradient_checkpointing = true
guidance_scale = 1
huber_c = 0.1
huber_scale = 1
huber_schedule = "snr"
keep_tokens = 0
learning_rate = 4e-6
learning_rate_te = 0
loss_type = "l2"
lr_scheduler = "constant"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
lr_warmup_steps = 0
max_bucket_reso = 1024
max_data_loader_n_workers = 0
max_grad_norm = 1
max_timestep = 500
max_token_length = 75
max_train_steps = 250
min_bucket_reso = 256
mixed_precision = "bf16"
model_prediction_type = "sigma_scaled"
multires_noise_discount = 0.3
no_token_padding = true
noise_offset_type = "Original"
optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False", "weight_decay=0.01",]
# optimizer_args = [ ]
optimizer_type = "Adafactor"
# optimizer_type = "AdamW8bit"
output_dir = "outputs"
output_name = "Quality_1"
persistent_data_loader_workers = 0
pretrained_model_name_or_path = "train/sd-models/flux1-dev.safetensors"
prior_loss_weight = 1
resolution = "512,512"
# sample_prompts = "/outputs/sample/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 10
save_model_as = "safetensors"
save_precision = "bf16"
sdpa = true
seed = 1
single_blocks_to_swap = 0
t5xxl = "train/sd-models/t5xxl_fp16.safetensors"
t5xxl_max_token_length = 225
timestep_sampling = "sigmoid"
train_batch_size = 1
train_blocks = "all"
train_data_dir = "train/images"
wandb_run_name = "Quality_1"
1. Traceback (most recent call last):OSError: Cannot find empty port in range: 28001-28001. You can specify a different port by setting the GRADIO_SERVER_PORT environment variable or passing the server_port parameter to launch()
```
File "E:\lora training\lora-scripts-v1.8.5\mikazuki\dataset-tag-editor\scripts\launch.py", line 102, in <module>
interface.main()
File "E:\lora training\lora-scripts-v1.8.5\mikazuki\dataset-tag-editor\scripts\interface.py", line 218, in main
app, _, _ = interface.launch(
File "E:\lora training\lora-scripts-v1.8.5\venv\lib\site-packages\gradio\blocks.py", line 1907, in launch
) = networking.start_server(
File "E:\lora training\lora-scripts-v1.8.5\venv\lib\site-packages\gradio\networking.py", line 207, in start_serverraise OSError
OSError: Cannot find empty port in range: 28001-28001. You can specify a different port by setting the GRADIO_SERVER_PORT environment variable or passing the `server_port` parameter to `launch()`.
```
- 运行命令
- `netstat -ano | findstr :28001`
- `taskkill /PID 12345 /F`
2. torch.OutOfMemoryError: CUDA out of memory.
- 报错显卡内存报错但是实际可能是系统内存溢出导致,需要重新修改batch_size
3 . use_libuv = 0
- 参考文章 : Introduction to Libuv TCPStore Backend
- 其中route 3 中提示,如果在环境变量中设置了
use_libuv = 0但是在代码中赋值为True, 则依旧会按照代码中执行,所以后面我将所有报错的文件中的lib_use设置为固定值False
4. ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [1], output_device 1, and module parameters {device(type='cpu')}
- 增加新的参数,取消在cpu进行分片 :
--blocks_to_swap = 0E:/lora training/lora-scripts-v1.8.5/sd-models/flux-ae.safetensors E:/lora training/lora-scripts-v1.8.5/sd-models/clip_l.safetensors E:/lora training/lora-scripts-v1.8.5/sd-models/t5xxl_fp16.safetensors E:/Kohya_FLUX_DreamBooth_v18/kohya_ss/train E:/lora training/lora-scripts-v1.8.5/sd-models/flux1-dev.safetensors
5. NotImplementedError: Cannot copy out of meta tensor; no data!
是因为模型张量没有初始话导致,修改下面路径的/root/autodl-tmp/kohya_ss/sd-scripts/library/flux_utils.py
6. deepspeed OOM
原因是因为deepspeed在训练flux时在显存和内存中加载数据导致的OOM 训练模型: 全量微调Flux1-dev模型 配置 :
PyTorch 2.7.0
Python 3.12(ubuntu22.04)
CUDA 12.8
GPU
RTX 5090(32GB) * 4
CPU64 vCPU Intel(R) Xeon(R) Gold 6459C
内存360GB
硬盘
系统盘:30 GB
数据盘:免费:50GB SSD 付费:440GB
解决办法: 在显存活内存不足的情况下,使用deepspeed中的nvme数据盘进行存储数据,将所有的数据压力转移到硬盘中
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": "auto",
"steps_per_print": 1,
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "nvme",
"nvme_path": "/root/autodl-tmp/ds_cache",
"pin_memory": true,
"buffer_count": 5,
"fast_init": false
},
"offload_param": {
"device": "nvme",
"nvme_path": "/root/autodl-tmp/ds_cache",
"pin_memory": true,
"buffer_count": 5,
"buffer_size": 1e8,
"max_in_cpu": 1e9
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": 5e7,
"stage3_prefetch_bucket_size": 5e7,
"stage3_param_persistence_threshold": 1e4,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_clipping": 1.0,
"bf16": {
"enabled": true
}
}
7. TypeError: adam_update(): incompatible function arguments.
deepspeed 3 使用这个函数传入eps的值错误导致的问题出现
修改代码
beta1, beta2 = group['betas']
# ================= 修复开始:处理 eps 元组冲突 =================
# 1. 强制处理 eps:如果是 Adafactor 传来的元组,强制改为 Adam 的默认值 1e-8
eps_val = group['eps']
if isinstance(eps_val, tuple) or isinstance(eps_val, list):
# Adafactor 的 eps 是 (1e-30, 1e-3),但这会导致 Adam 除以零溢出或不稳定
# 所以如果发现是元组,直接使用 Adam 的标准默认值 1e-8
eps_val = 1e-8
else:
eps_val = float(eps_val)
# 2. 强制处理 step (防止 Tensor)
step_val = state['step']
if hasattr(step_val, 'item'):
step_val = int(step_val.item())
else:
step_val = int(step_val)
# 3. 强制处理 bias_correction (防止 int)
bias_correction_val = bool(group['bias_correction'])
# ================= DEBUG START =================
print("\n" + "="*30 + " DEBUG ADAM UPDATE " + "="*30)
try:
# 提取变量方便检查
arg_list = [
("0. opt_id (int)", self.opt_id),
("1. step (int)", state['step']),
("2. lr (float)", group['lr']),
("3. beta1 (float)", beta1),
("4. beta2 (float)", beta2),
("5. eps (float)", group['eps']),
("6. weight_decay (float)", group['weight_decay']),
("7. bias_correction (bool)", group['bias_correction']),
("8. param (Tensor)", p.data),
("9. grad (Tensor)", p.grad.data),
("10. exp_avg (Tensor)", state['exp_avg']),
("11. exp_avg_sq (Tensor)", state['exp_avg_sq'])
]
for name, val in arg_list:
if hasattr(val, 'shape'): # 如果是 Tensor
print(f"[{name}]: Type={type(val)}, Dtype={val.dtype}, Device={val.device}, Shape={val.shape}")
else: # 如果是标量
print(f"[{name}]: Type={type(val)}, Value={val}")
except Exception as e:
print(f"DEBUG ERROR: {e}")
print("="*80 + "\n")
# ================= DEBUG END =================
self.ds_opt_adam.adam_update(self.opt_id, state['step'], group['lr'], beta1, beta2, eps_val,
group['weight_decay'], bias_correction_val, p.data, p.grad.data,
state['exp_avg'], state['exp_avg_sq'])
return loss
Problem 0 :****** , No Data !
/root/miniconda3/envs/kohyass/lib/python3.11/site-packages/transformers/modeling_utils.py
Line 2031 : Add enbaled parameter
init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config(),**enabled = False**), set_zero3_state()]
Reason : 本地下载模型, transformers会使用 meta tensor , 但是导入checkpoint会发现是空的meta从而报错,禁止deepspeed进行默认初始化
mat1 and mat2 not equal …
kohya_ss/sd-scripts/library/flux_models.py
line 1068 增加txt和img的向量类型,因为clip与t5处理类型为 folat32 , 设置其他的类型为bfolat16导致
def forward(
self,
img: Tensor,
img_ids: Tensor,
txt: Tensor,
txt_ids: Tensor,
timesteps: Tensor,
y: Tensor,
block_controlnet_hidden_states=None,
block_controlnet_single_hidden_states=None,
guidance: Tensor | None = None,
txt_attention_mask: Tensor | None = None,
) -> Tensor:
target_dtype = self.img_in.weight.dtype #以此层权重类型为准
if img.dtype != target_dtype:
img = img.to(target_dtype)
if txt.dtype != target_dtype:
txt = txt.to(target_dtype)
if timesteps.dtype != target_dtype:
timesteps = timesteps.to(target_dtype)
if guidance is not None and guidance.dtype != target_dtype:
guidance = guidance.to(target_dtype)
if y is not None and y.dtype != target_dtype:
y = y.to(target_dtype)
if img.ndim != 3 or txt.ndim != 3:
raise ValueError("Input img and txt tensors must have 3 dimensions.")
==========================
Next Code
deepseed AttributeError: ‘DeepSpeedZeRoOffload’ object has no attribute ‘backward’
是因为deepspeed未初始化,可以在下面的位置打一个print进行查看
/root/autodl-tmp/kohya_ss/sd-scripts/library/deepspeed_utils.py Line 87
kohya_ss/sd-scripts/library/deepspeed_utils.py Line 64 在deepspeed未设置会直接跳过一个返回None
NCCL enqueue.cc:1556 NCCL WARN Cuda failure 700 ‘an illegal memory access was encountered’
pip install nvidia-nccl-cu12>2.26.2 在5090 上会出现这个错误,不影响训练
# EXAMPLE
<ExampleReadMe>
Summary: This tool is in the file `Processor.cs`. The core logic is handled by the `DataParser` class, which uses the `Autodesk.Revit.DB.Transaction` API.
</ExampleReadMe>
<ExampleJSONOutput>
{{
"target_files": ["Processor.cs"],
"key_classes_and_methods": ["DataParser"],
"mentioned_apis": ["Autodesk.Revit.DB.Transaction"]
}}
</ExampleJSONOutput>