Lora-Script

摘要

主要是为了记录自己再模型微调的过程中遇到的一些问题,和一些参数配置 ,显卡为云端租赁使用

  • 多卡模型训练主要是使用koyha_ss的框架修改,并使用deepspeed3进行多卡训练
  • 单卡则是使用的aki的工具包进行lora-script的训练

bash config

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
[model]
v2 = false
v_parameterization = false
pretrained_model_name_or_path = "./sd-models/realismEngineSDXL_v30VAE.safetensors"
vae = "./sd-models/sdxl_vae.safetensors"

[dataset]
train_data_dir = "./train/001"
reg_data_dir = ""
prior_loss_weight = 1
cache_latents = true
shuffle_caption = true
enable_bucket = true

[additional_network]
network_dim = 32
network_alpha = 16
network_train_unet_only = false
network_train_text_encoder_only = false
network_module = "networks.lora"
network_args = []

[optimizer]
unet_lr = 1e-4
text_encoder_lr = 1e-5
optimizer_type = "AdamW8bit"
lr_scheduler = "cosine_with_restarts"
lr_warmup_steps = 0
lr_restart_cycles = 1

[training]
resolution = "512,512"
train_batch_size = 1
max_train_epochs = 10
noise_offset = 0.0
keep_tokens = 0
xformers = true
lowram = false
clip_skip = 2
mixed_precision = "fp16"
save_precision = "fp16"

[sample_prompt]
sample_sampler = "euler_a"
sample_every_n_epochs = 1

[saving]
output_name = "xtgz-centos-sdxl"
save_every_n_epochs = 2
save_n_epoch_ratio = 0
save_last_n_epochs = 499
save_state = false
save_model_as = "safetensors"
output_dir = "./output"
logging_dir = "./logs"
log_prefix = "output_name"

[others]
min_bucket_reso = 256
max_bucket_reso = 1024
caption_extension = ".txt"
max_token_length = 225
seed = 1337


常见问题

镜像站地址

模型缺失

pip权限问题

文件夹权限问题

  • WARNING: Ignoring invalid distribution -orch (/root/miniconda3/lib/python3.10/site-packages)  delete floder path ‘.~orch’ or other same sytle

xformers问题

torchvision问题

cuda 12.8 安装

- 下载地址: `wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux.run`
- 静默安装 `sh cuda_12.8.0_570.86.10_linux.run \ --toolkit \ --toolkitpath=/root/autodl-tmp/cuda-12.8 \ --silent`
- 修改环境变量 
	- `echo 'export PATH=/root/autodl-tmp/cuda-12.8/bin:$PATH' >> ~/.bashrc`
	- `echo 'export LD_LIBRARY_PATH=/root/autodl-tmp/cuda-12.8/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc`

xformers报错

  • CUDA error (/__w/xformers/xformers/third_party/flash-attention/hopper/flash_fwd_launch_template.h:175): no kernel image is available for execution on the device Traceback (most recent call last):
  1. 修复这个问题,首先需要检查cuda:
    nvcc --version -> nvidia-smi 中的cuda版本对应
  2. 如果不一致则使用上面的cuda安装新版本
  3. conda list 检查pytorch版本与xforemers版本是否一致,如果安装了2.7.1但是最高支持到2.7.0,可以首先降级使用
    pip3 install  torch==2.7.0 torchvision  torchaudio  --index-url https://download.pytorch.org/whl/cu128
  4. 依旧报错则确实为xformers的问题
    1. 运行python -m xformers.info
    2. 查看里面的build.envs版本,有可能12的算力显卡下载的包算力兼容最高到9.0
      1. build.env.TORCH_CUDA_ARCH_LIST: 6.0+PTX 7.0 7.5 8.0+PTX 9.0a
    3. 确认当前显卡的算力nvidia-smi --query-gpu=compute_cap --format=csv 12.0
    4. 修改环境变量中的算力值: export TORCH_CUDA_ARCH_LIST="12.0" 单次修改
    5. 下载源码并编译,可以使用镜像网站加速下载pip install -v --no-build-isolation -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers->pip install -v --no-build-isolation -U git+https:/ghfast.top/https://github.com/facebookresearch/xformers.git@main#egg=xformers
  5. 安装完成后,检查一下xformers的build版本,如果与自己算力一致则正确

xformers 报错 , 新版本无法与旧版本兼容 ,并且电脑下载缓慢的问题

- 如果直接点击install-cn.ps1 会出现无法安装虚拟环境的问题,可以切换到 install.ps1进行安装,网络问题主要体现在torch安装,国内版本没有明显改善
- 安装
	- 如果网速过慢的情况下,需要重新安装torch,可以参照下面的步骤进行
	- 打开install.ps1,手动复制下面的命令
	- `python.exe -m venv venv` 创建虚拟环境
	- 激活虚拟环境`.\venv\Scripts\activate`
	- 使用`nvidia-smi`找到自己的cuda版本,去[torch官网](https://pytorch.org/get-started/locally/)找到相应的`.whl`文件手动下载
	- ![[Pasted image 20251117091728.png]]
	- 前往xformers官网,找到安装的命令行粘贴到浏览器手动下载,我是cuda128所以使用cuda128的安装命令
	- `pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu128`
	- 将下载的两个包放在lora-scropt的文件夹中![[Pasted image 20251117092155.png]]
	- 使用命令手动安装这两个包,先安装torch,在安装xformers
	- ` pip install .\torch-2.7.0+cu128-cp310-cp310-win_amd64.whl`
	- `pip install .\xformers-0.0.30-cp310-cp310-win_amd64.whl`
	- 其次再在ps文件中更新一下环境文件,即可顺利使用![[Pasted image 20251117092327.png]]

Flux 训练出现需要下载google/t5-xxl 的报错

- [https://blog.csdn.net/sinat_29957455/article/details/142782264](https://blog.csdn.net/sinat_29957455/article/details/142782264)

多卡训练的问题汇总

完整的训练参数

  1. tran_flux.sh
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    	   #!/bin/bash



    # =================================================================

    # 配置区域 (请在此处修改为您机器上的实际路径)

    # =================================================================

    # 1. 禁用 P2P 直接访问 (解决 Illegal memory access 的核心)

    # export NCCL_P2P_DISABLE=1

    # 2. 禁用 InfiniBand (防止尝试使用服务器级网络导致崩溃)

    export NCCL_IB_DISABLE=1

    # 3. 强制使用阻塞模式 (如果再次报错,能看到具体是哪一行代码炸了)

    export CUDA_LAUNCH_BLOCKING=1

    # 显式指定使用的 GPU (0,1,2,3)

    export CUDA_VISIBLE_DEVICES=0,1,2,3


    # 优化内存分配,防止碎片化

    export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

    # FLUX.1 模型路径 (例如 /root/autodl-tmp/flux1-dev.safetensors)

    MODEL_PATH="train/sd-models/flux1-dev.safetensors"

    # CLIP-L 模型路径

    CLIP_PATH="train/sd-models/clip_l.safetensors"

    # T5-XXL 模型路径

    T5_PATH="train/sd-models/t5xxl_fp16.safetensors"

    # AE 模型路径

    AE_PATH="train/sd-models/flux-ae.safetensors"


    # 输出文件夹路径

    OUTPUT_DIR="./output"




    # =================================================================

    # 运行命令 (下方参数未修改,直接引用上方变量)

    # =================================================================



    accelerate launch \

      --deepspeed_config_file "ds_config.json" \

      --use_deepspeed \

      --num_cpu_threads_per_process 8 \

      --gpu_ids 0,1,2,3 \

      --mixed_precision bf16 \

      --num_processes 4 \

      --num_machines 1 \

      --num_cpu_threads_per_process 1 \

      --offload_optimizer_device cpu \

      --offload_param_device cpu \

      "sd-scripts/flux_train.py" \

      --config_file "dreambooth_flux_config.toml"\

      --optimizer_type="adafactor" \

      --optimizer_args "scale_parameter=False" "relative_step=False" "warmup_init=False" \

      --cache_text_encoder_outputs \

      --cache_latents \

      --full_bf16 \

      --lowram \

      --gradient_checkpointing \

      --cache_latents \

      --max_data_loader_n_workers 0 \

      --learning_rate 1e-5 \

      --cache_latents_to_disk \

      --cache_text_encoder_outputs_to_disk

ds_config.json

1. ```json
   {

   "train_batch_size": "auto",

    “train_micro_batch_size_per_gpu”: 1,

    “gradient_accumulation_steps”: “auto”,

    “steps_per_print”: 1,

    “zero_optimization”: {

        “stage”: 3,

        “offload_optimizer”: {

            “device”: “nvme”,

            “nvme_path”: “/root/autodl-tmp/ds_cache”,

            “pin_memory”: true,

            “buffer_count”: 5,

            “fast_init”: false

        },

        “offload_param”: {

            “device”: “nvme”,

            “nvme_path”: “/root/autodl-tmp/ds_cache”,

            “pin_memory”: true,

            “buffer_count”: 5,

            “buffer_size”: 1e8,

            “max_in_cpu”: 1e9

        },

        “overlap_comm”: true,

        “contiguous_gradients”: true,

        “sub_group_size”: 1e9,

        “reduce_bucket_size”: 5e7,

        “stage3_prefetch_bucket_size”: 5e7,

        “stage3_param_persistence_threshold”: 1e4,

        “stage3_max_live_parameters”: 1e9,

        “stage3_max_reuse_distance”: 1e9,

        “stage3_gather_16bit_weights_on_model_save”: true

    },

    “gradient_clipping”: 1.0,

    “bf16”: {

        “enabled”: true

    }

}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74

### dereambooth_config.toml
``` ae = "train/sd-models/flux-ae.safetensors"
blocks_to_swap = 0
bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
cache_text_encoder_outputs = true
cache_text_encoder_outputs_to_disk = true
caption_dropout_every_n_epochs = 0
caption_dropout_rate = 0
caption_extension = ".txt"
clip_l = "/root/autodl-tmp/kohya_ss/train/sd-models/clip_l.safetensors"
cpu_offload_checkpointing = true
discrete_flow_shift = 3.1582
double_blocks_to_swap = 0
dynamo_backend = "no"
epoch = 50
fp8_base = true
full_bf16 = false
# gradient_accumulation_steps = 1
gradient_checkpointing = true
guidance_scale = 1
huber_c = 0.1
huber_scale = 1
huber_schedule = "snr"
keep_tokens = 0
learning_rate = 4e-6
learning_rate_te = 0
loss_type = "l2"
lr_scheduler = "constant"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
lr_warmup_steps = 0
max_bucket_reso = 1024
max_data_loader_n_workers = 0
max_grad_norm = 1
max_timestep = 500
max_token_length = 75
max_train_steps = 250
min_bucket_reso = 256
mixed_precision = "bf16"
model_prediction_type = "sigma_scaled"
multires_noise_discount = 0.3
no_token_padding = true
noise_offset_type = "Original"
optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False", "weight_decay=0.01",]
# optimizer_args = [ ]
optimizer_type = "Adafactor"
# optimizer_type = "AdamW8bit"
output_dir = "outputs"
output_name = "Quality_1"
persistent_data_loader_workers = 0
pretrained_model_name_or_path = "train/sd-models/flux1-dev.safetensors"
prior_loss_weight = 1
resolution = "512,512"
# sample_prompts = "/outputs/sample/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 10
save_model_as = "safetensors"
save_precision = "bf16"
sdpa = true
seed = 1
single_blocks_to_swap = 0
t5xxl = "train/sd-models/t5xxl_fp16.safetensors"
t5xxl_max_token_length = 225
timestep_sampling = "sigmoid"
train_batch_size = 1
train_blocks = "all"
train_data_dir = "train/images"
wandb_run_name = "Quality_1"

1. Traceback (most recent call last):OSError: Cannot find empty port in range: 28001-28001. You can specify a different port by setting the GRADIO_SERVER_PORT environment variable or passing the server_port parameter to launch()

  
1
2
3
4
5
6
7
8
  File "E:\lora training\lora-scripts-v1.8.5\mikazuki\dataset-tag-editor\scripts\launch.py", line 102, in <module>
interface.main()
File "E:\lora training\lora-scripts-v1.8.5\mikazuki\dataset-tag-editor\scripts\interface.py", line 218, in main
app, _, _ = interface.launch(
File "E:\lora training\lora-scripts-v1.8.5\venv\lib\site-packages\gradio\blocks.py", line 1907, in launch
) = networking.start_server(
File "E:\lora training\lora-scripts-v1.8.5\venv\lib\site-packages\gradio\networking.py", line 207, in start_serverraise OSError
OSError: Cannot find empty port in range: 28001-28001. You can specify a different port by setting the GRADIO_SERVER_PORT environment variable or passing the `server_port` parameter to `launch()`.
- 运行命令 - `netstat -ano | findstr :28001` - `taskkill /PID 12345 /F`

2. torch.OutOfMemoryError: CUDA out of memory.

  1. 报错显卡内存报错但是实际可能是系统内存溢出导致,需要重新修改batch_size

3 . use_libuv = 0

  1. 参考文章 : Introduction to Libuv TCPStore Backend
  2. 其中route 3 中提示,如果在环境变量中设置了use_libuv = 0 但是在代码中赋值为True , 则依旧会按照代码中执行,所以后面我将所有报错的文件中的lib_use设置为固定值False

4. ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [1], output_device 1, and module parameters {device(type='cpu')}

  1. 增加新的参数,取消在cpu进行分片 :--blocks_to_swap = 0
    E:/lora training/lora-scripts-v1.8.5/sd-models/flux-ae.safetensors
    E:/lora training/lora-scripts-v1.8.5/sd-models/clip_l.safetensors
    E:/lora training/lora-scripts-v1.8.5/sd-models/t5xxl_fp16.safetensors
    E:/Kohya_FLUX_DreamBooth_v18/kohya_ss/train
    E:/lora training/lora-scripts-v1.8.5/sd-models/flux1-dev.safetensors

5. NotImplementedError: Cannot copy out of meta tensor; no data!

是因为模型张量没有初始话导致,修改下面路径的/root/autodl-tmp/kohya_ss/sd-scripts/library/flux_utils.py

6. deepspeed OOM

原因是因为deepspeed在训练flux时在显存和内存中加载数据导致的OOM
训练模型:
全量微调Flux1-dev模型
配置 :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
PyTorch  2.7.0

Python  3.12(ubuntu22.04)

CUDA  12.8

GPU

RTX 5090(32GB) * 4

CPU64 vCPU Intel(R) Xeon(R) Gold 6459C

内存360GB

硬盘

系统盘:30 GB

数据盘:免费:50GB SSD  付费:440GB

解决办法: 在显存活内存不足的情况下,使用deepspeed中的nvme数据盘进行存储数据,将所有的数据压力转移到硬盘中

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
{

    "train_batch_size": "auto",

    "train_micro_batch_size_per_gpu": 1,

    "gradient_accumulation_steps": "auto",

    "steps_per_print": 1,

    "zero_optimization": {

        "stage": 3,

        "offload_optimizer": {

            "device": "nvme",

            "nvme_path": "/root/autodl-tmp/ds_cache",

            "pin_memory": true,

            "buffer_count": 5,

            "fast_init": false

        },

        "offload_param": {

            "device": "nvme",

            "nvme_path": "/root/autodl-tmp/ds_cache",

            "pin_memory": true,

            "buffer_count": 5,

            "buffer_size": 1e8,

            "max_in_cpu": 1e9

        },

        "overlap_comm": true,

        "contiguous_gradients": true,

        "sub_group_size": 1e9,

        "reduce_bucket_size": 5e7,

        "stage3_prefetch_bucket_size": 5e7,

        "stage3_param_persistence_threshold": 1e4,

        "stage3_max_live_parameters": 1e9,

        "stage3_max_reuse_distance": 1e9,

        "stage3_gather_16bit_weights_on_model_save": true

    },

    "gradient_clipping": 1.0,

    "bf16": {

        "enabled": true

    }

}

7. TypeError: adam_update(): incompatible function arguments.

deepspeed 3 使用这个函数传入eps的值错误导致的问题出现

修改代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
                beta1, beta2 = group['betas']

                # ================= 修复开始:处理 eps 元组冲突 =================

                # 1. 强制处理 eps:如果是 Adafactor 传来的元组,强制改为 Adam 的默认值 1e-8
                eps_val = group['eps']

                if isinstance(eps_val, tuple) or isinstance(eps_val, list):

                    # Adafactor 的 eps 是 (1e-30, 1e-3),但这会导致 Adam 除以零溢出或不稳定

                    # 所以如果发现是元组,直接使用 Adam 的标准默认值 1e-8

                    eps_val = 1e-8

                else:

                    eps_val = float(eps_val)



                # 2. 强制处理 step (防止 Tensor)

                step_val = state['step']

                if hasattr(step_val, 'item'):

                    step_val = int(step_val.item())

                else:

                    step_val = int(step_val)



                # 3. 强制处理 bias_correction (防止 int)

                bias_correction_val = bool(group['bias_correction'])

                # ================= DEBUG START =================

                print("\n" + "="*30 + " DEBUG ADAM UPDATE " + "="*30)

                try:

                    # 提取变量方便检查

                    arg_list = [

                        ("0. opt_id (int)", self.opt_id),

                        ("1. step (int)", state['step']),

                        ("2. lr (float)", group['lr']),

                        ("3. beta1 (float)", beta1),

                        ("4. beta2 (float)", beta2),

                        ("5. eps (float)", group['eps']),

                        ("6. weight_decay (float)", group['weight_decay']),

                        ("7. bias_correction (bool)", group['bias_correction']),

                        ("8. param (Tensor)", p.data),

                        ("9. grad (Tensor)", p.grad.data),

                        ("10. exp_avg (Tensor)", state['exp_avg']),

                        ("11. exp_avg_sq (Tensor)", state['exp_avg_sq'])

                    ]



                    for name, val in arg_list:

                        if hasattr(val, 'shape'): # 如果是 Tensor

                            print(f"[{name}]: Type={type(val)}, Dtype={val.dtype}, Device={val.device}, Shape={val.shape}")

                        else: # 如果是标量

                            print(f"[{name}]: Type={type(val)}, Value={val}")



                except Exception as e:

                    print(f"DEBUG ERROR: {e}")

                print("="*80 + "\n")

                # ================= DEBUG END =================

                self.ds_opt_adam.adam_update(self.opt_id, state['step'], group['lr'], beta1, beta2, eps_val,

                                             group['weight_decay'], bias_correction_val, p.data, p.grad.data,

                                             state['exp_avg'], state['exp_avg_sq'])

        return loss

Problem 0 :****** , No Data !

/root/miniconda3/envs/kohyass/lib/python3.11/site-packages/transformers/modeling_utils.py

Line 2031 : Add enbaled parameter

init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config(),**enabled = False**), set_zero3_state()]

Reason : 本地下载模型, transformers会使用 meta tensor , 但是导入checkpoint会发现是空的meta从而报错,禁止deepspeed进行默认初始化

https://github.com/zai-org/ChatGLM-6B/issues/530

mat1 and mat2 not equal ….

kohya_ss/sd-scripts/library/flux_models.py

line 1068 增加txt和img的向量类型,因为clip与t5处理类型为 folat32 , 设置其他的类型为bfolat16导致

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def forward(
self,
img: Tensor,
img_ids: Tensor,
txt: Tensor,
txt_ids: Tensor,
timesteps: Tensor,
y: Tensor,
block_controlnet_hidden_states=None,
block_controlnet_single_hidden_states=None,
guidance: Tensor | None = None,
txt_attention_mask: Tensor | None = None,
) -> Tensor:

target_dtype = self.img_in.weight.dtype #以此层权重类型为准

if img.dtype != target_dtype:
img = img.to(target_dtype)

if txt.dtype != target_dtype:
txt = txt.to(target_dtype)

if timesteps.dtype != target_dtype:
timesteps = timesteps.to(target_dtype)

if guidance is not None and guidance.dtype != target_dtype:
guidance = guidance.to(target_dtype)

if y is not None and y.dtype != target_dtype:
y = y.to(target_dtype)


if img.ndim != 3 or txt.ndim != 3:
raise ValueError("Input img and txt tensors must have 3 dimensions.")
==========================
Next Code

deepseed AttributeError: ‘DeepSpeedZeRoOffload’ object has no attribute ‘backward’

是因为deepspeed未初始化,可以在下面的位置打一个print进行查看

/root/autodl-tmp/kohya_ss/sd-scripts/library/deepspeed_utils.py Line 87

kohya_ss/sd-scripts/library/deepspeed_utils.py Line 64 在deepspeed未设置会直接跳过一个返回None

NCCL enqueue.cc:1556 NCCL WARN Cuda failure 700 ‘an illegal memory access was encountered’

pip install nvidia-nccl-cu12>2.26.2 在5090 上会出现这个错误,不影响训练

1
2
3
4
5
6
7
8
9
10
11
# EXAMPLE
<ExampleReadMe>
Summary: This tool is in the file `Processor.cs`. The core logic is handled by the `DataParser` class, which uses the `Autodesk.Revit.DB.Transaction` API.
</ExampleReadMe>
<ExampleJSONOutput>
{{
"target_files": ["Processor.cs"],
"key_classes_and_methods": ["DataParser"],
"mentioned_apis": ["Autodesk.Revit.DB.Transaction"]
}}
</ExampleJSONOutput>