train_dreambooth_lora.py failed on two machines #3363

bohong13 · 2023-05-08T07:13:33Z

Describe the bug

I have found two errors.

when process save checkpoint

Traceback (most recent call last):
  File "/home/momistest/db/diffusers/examples/dreambooth/train_dreambooth_lora.py", line 1112, in <module>
    main(args)
  File "/home/momistest/db/diffusers/examples/dreambooth/train_dreambooth_lora.py", line 991, in main
    LoraLoaderMixin.save_lora_weights(
  File "/home/momistest/db/diffusers/src/diffusers/loaders.py", line 1111, in save_lora_weights
    for module_name, param in unet_lora_layers.state_dict().items()
  File "/home/momistest/anaconda3/envs/hg/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "/home/momistest/anaconda3/envs/hg/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1820, in state_dict
    hook_result = hook(self, destination, prefix, local_metadata)
  File "/home/momistest/db/diffusers/src/diffusers/loaders.py", line 74, in map_to
    num = int(key.split(".")[1])  # 0 is always "layers"
ValueError: invalid literal for int() with base 10: 'layers'

Then I try to solve this error using the method in issue #3284

but i get this error

Traceback (most recent call last):
  File "/home/momistest/db/diffusers/examples/dreambooth/train_dreambooth_lora.py", line 1112, in <module>
    main(args)
  File "/home/momistest/db/diffusers/examples/dreambooth/train_dreambooth_lora.py", line 1067, in main
    pipeline.load_lora_weights(args.output_dir)
  File "/home/momistest/db/diffusers/src/diffusers/loaders.py", line 846, in load_lora_weights
    self.unet.load_attn_procs(unet_lora_state_dict)
  File "/home/momistest/db/diffusers/src/diffusers/loaders.py", line 305, in load_attn_procs
    self.set_attn_processor(attn_processors)
  File "/home/momistest/db/diffusers/src/diffusers/models/unet_2d_condition.py", line 533, in set_attn_processor
    fn_recursive_attn_processor(name, module, processor)
  File "/home/momistest/db/diffusers/src/diffusers/models/unet_2d_condition.py", line 530, in fn_recursive_attn_processor
    fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)
  File "/home/momistest/db/diffusers/src/diffusers/models/unet_2d_condition.py", line 530, in fn_recursive_attn_processor
    fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)
  File "/home/momistest/db/diffusers/src/diffusers/models/unet_2d_condition.py", line 530, in fn_recursive_attn_processor
    fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)
  [Previous line repeated 3 more times]
  File "/home/momistest/db/diffusers/src/diffusers/models/unet_2d_condition.py", line 527, in fn_recursive_attn_processor
    module.set_processor(processor.pop(f"{name}.processor"))
KeyError: 'down_blocks.0.attentions.0.transformer_blocks.0.attn1.processor'

Abnormal model parameter exchange.
I have two machines on the same local network, but when I monitor the network traffic using iftop, the model parameters exchange packet of TX and RX is not the same.

196.168.1.123 => 192.168.1.183     20.2kb
              <=                     416b

TX：23.0MB
RX：108MB
TOTAL 131MB

Reproduction

I followed this dog example to run the program on two machines.

I have two laptops with NVIDIA RTX 3080 GPUs.
machine 1 IP is 192.168.1.123
machine 2 IP is 192.168.1.183

The environment and package versions of the two machines are exactly the same

Accelerate env is 
- `Accelerate` version: 0.18.0
- Platform: Linux-5.13.0-30-generic-x86_64-with-glibc2.31
- Python version: 3.10.9
- Numpy version: 1.24.3
- PyTorch version (GPU?): 2.0.0+cu117 (True)
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: MULTI_GPU
	- mixed_precision: no
	- use_cpu: False
	- num_processes: 2
	- machine_rank: 0
	- num_machines: 2
	- gpu_ids: all
	- main_process_ip: 192.168.1.123
	- main_process_port: 29500
	- rdzv_backend: static
	- same_network: True
	- main_training_function: main
	- downcast_bf16: no
	- tpu_use_cluster: False
	- tpu_use_sudo: False
	- tpu_env: []

and I Run this script on two machine

export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export INSTANCE_DIR="/home/momistest/db/diffusers/examples/dreambooth/dog"
export OUTPUT_DIR="/home/momistest/db/diffusers/examples/dreambooth/lora_output"

NCCL_DEBUG=INFO accelerate launch train_dreambooth_lora.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="a photo of sks dog" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --checkpointing_steps=100 \
  --learning_rate=1e-4 \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --validation_prompt="A photo of sks dog in a bucket" \
  --validation_epochs=50 \
  --seed="0" \

Logs

Steps: 100%|██████████████████████████████████| 500/500 [06:52<00:00,  1.71it/s, loss=0.208, lr=0.0001]Model weights saved in /home/momistest/db/diffusers/examples/dreambooth/lora_output/pytorch_lora_weights.bin
{'requires_safety_checker'} was not found in config. Values will be initialized to default values.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.
{'prediction_type'} was not found in config. Values will be initialized to default values.
{'scaling_factor'} was not found in config. Values will be initialized to default values.
{'timestep_post_act', 'num_class_embeds', 'resnet_time_scale_shift', 'resnet_skip_time_act', 'addition_embed_type_num_heads', 'conv_in_kernel', 'mid_block_only_cross_attention', 'only_cross_attention', 'time_embedding_act_fn', 'addition_embed_type', 'encoder_hid_dim', 'use_linear_projection', 'conv_out_kernel', 'upcast_attention', 'class_embeddings_concat', 'class_embed_type', 'time_embedding_dim', 'mid_block_type', 'projection_class_embeddings_input_dim', 'dual_cross_attention', 'resnet_out_scale_factor', 'cross_attention_norm', 'time_embedding_type', 'time_cond_proj_dim'} was not found in config. Values will be initialized to default values.
{'sample_max_value', 'thresholding', 'solver_type', 'solver_order', 'dynamic_thresholding_ratio', 'use_karras_sigmas', 'algorithm_type', 'lower_order_final'} was not found in config. Values will be initialized to default values.
Loading unet.
Traceback (most recent call last):
  File "/home/momistest/db/diffusers/examples/dreambooth/train_dreambooth_lora.py", line 1112, in <module>
    main(args)
  File "/home/momistest/db/diffusers/examples/dreambooth/train_dreambooth_lora.py", line 1067, in main
    pipeline.load_lora_weights(args.output_dir)
  File "/home/momistest/db/diffusers/src/diffusers/loaders.py", line 847, in load_lora_weights
    self.unet.load_attn_procs(unet_lora_state_dict)
  File "/home/momistest/db/diffusers/src/diffusers/loaders.py", line 305, in load_attn_procs
    self.set_attn_processor(attn_processors)
  File "/home/momistest/db/diffusers/src/diffusers/models/unet_2d_condition.py", line 533, in set_attn_processor
    fn_recursive_attn_processor(name, module, processor)
  File "/home/momistest/db/diffusers/src/diffusers/models/unet_2d_condition.py", line 530, in fn_recursive_attn_processor
    fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)
  File "/home/momistest/db/diffusers/src/diffusers/models/unet_2d_condition.py", line 530, in fn_recursive_attn_processor
    fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)
  File "/home/momistest/db/diffusers/src/diffusers/models/unet_2d_condition.py", line 530, in fn_recursive_attn_processor
    fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)
  [Previous line repeated 3 more times]
  File "/home/momistest/db/diffusers/src/diffusers/models/unet_2d_condition.py", line 527, in fn_recursive_attn_processor
    module.set_processor(processor.pop(f"{name}.processor"))
KeyError: 'down_blocks.0.attentions.0.transformer_blocks.0.attn1.processor'
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: 
wandb: Run history:
wandb: loss ▁▂▂▁▁▁▃█▂▁▂▂▁▄▁▁▁▁▂▁▁▃▂▁▂▃▁▂▂▁▁▁▄▂▁▂▂▁▁▁
wandb:   lr ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: 
wandb: Run summary:
wandb: loss 0.20755
wandb:   lr 0.0001
wandb: 
wandb: 🚀 View run swift-aardvark-13 at: https://wandb.ai/account/dreambooth-lora/runs/qvn0373n
wandb: Synced 6 W&B file(s), 16 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-xxxxxx-qvn0373n/logs
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 200217) of binary: /home/momistest/anaconda3/envs/hg/bin/python
Traceback (most recent call last):
  File "/home/momistest/anaconda3/envs/hg/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/momistest/anaconda3/envs/hg/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/momistest/anaconda3/envs/hg/lib/python3.10/site-packages/accelerate/commands/launch.py", line 914, in launch_command
    multi_gpu_launcher(args)
  File "/home/momistest/anaconda3/envs/hg/lib/python3.10/site-packages/accelerate/commands/launch.py", line 603, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/momistest/anaconda3/envs/hg/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/momistest/anaconda3/envs/hg/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/momistest/anaconda3/envs/hg/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train_dreambooth_lora.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-08_02:07:48
  host      : host-192-168-1-123.openstacklocal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 200217)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

System Info

- `diffusers` version: 0.17.0.dev0
- Platform: Linux-5.13.0-30-generic-x86_64-with-glibc2.31
- Python version: 3.10.9
- PyTorch version (GPU?): 2.0.0+cu117 (True)
- Huggingface_hub version: 0.14.1
- Transformers version: 4.28.1
- Accelerate version: 0.18.0
- xFormers version: 0.0.19
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

The text was updated successfully, but these errors were encountered:

haowang1013 · 2023-05-08T07:46:54Z

I have the same problem, the KeyError: 'down_blocks.0.attentions.0.transformer_blocks.0.attn1.processor' is caused by the fact that the state dict generated by the latest training code has prefix to all the layers, so down_blocks.0.attentions.0.transformer_blocks.0.attn1.processor is acutally called unet.down_blocks.0.attentions.0.transformer_blocks.0.attn1.processor in the state dict. I think this is because the new training code supports text encoder tuning so the prefix is added to differentiate...

bohong13 · 2023-05-08T07:54:38Z

@haowang1013
But when i training lora only using one machine , everything is fine .

haowang1013 · 2023-05-08T08:03:03Z

@haowang1013 But when i training lora only using one machine , everything is fine .

Yeah I never had any problem with training, probably because I was only using one machine. That key errro happens to me when I tried to load the lora state dict using pipeline.unet.load_attn_procs

haowang1013 · 2023-05-08T08:11:12Z

This fixed the loading error for me, in unet_2d_condition.py

        def fn_recursive_attn_processor(name: str, module: torch.nn.Module, processor):
            if hasattr(module, "set_processor"):
                if not isinstance(processor, dict):
                    module.set_processor(processor)
                else:
                    # was module.set_processor(processor.pop(f"{name}.processor"))
                    processor_name = f"{name}.processor"
                    if processor_name in processor:
                        module.set_processor(processor.pop(processor_name))
                    else:
                        processor_name = f"unet.{processor_name}"
                        module.set_processor(processor.pop(processor_name))

bohong13 · 2023-05-08T08:27:15Z

This fixed the loading error for me, in unet_2d_condition.py

        def fn_recursive_attn_processor(name: str, module: torch.nn.Module, processor):
            if hasattr(module, "set_processor"):
                if not isinstance(processor, dict):
                    module.set_processor(processor)
                else:
                    # was module.set_processor(processor.pop(f"{name}.processor"))
                    processor_name = f"{name}.processor"
                    if processor_name in processor:
                        module.set_processor(processor.pop(processor_name))
                    else:
                        processor_name = f"unet.{processor_name}"
                        module.set_processor(processor.pop(processor_name))

Thank you！ but i still get error when i using two machines

Steps: 100%|█████████████████████████████████| 500/500 [06:49<00:00,  1.75it/s, loss=0.0406, lr=0.0001]Model weights saved in /home/momistest/db/diffusers/examples/dreambooth/lora_output/checkpoint-500/pytorch_lora_weights.bin
05/08/2023 16:23:43 - INFO - __main__ - Saved state to /home/momistest/db/diffusers/examples/dreambooth/lora_output/checkpoint-500
Steps: 100%|██████████████████████████████████| 500/500 [06:49<00:00,  1.75it/s, loss=0.208, lr=0.0001]Model weights saved in /home/momistest/db/diffusers/examples/dreambooth/lora_output/pytorch_lora_weights.bin
{'requires_safety_checker'} was not found in config. Values will be initialized to default values.
{'prediction_type'} was not found in config. Values will be initialized to default values.
{'mid_block_only_cross_attention', 'encoder_hid_dim', 'resnet_skip_time_act', 'time_embedding_act_fn', 'time_cond_proj_dim', 'resnet_out_scale_factor', 'resnet_time_scale_shift', 'only_cross_attention', 'class_embed_type', 'projection_class_embeddings_input_dim', 'time_embedding_dim', 'addition_embed_type_num_heads', 'upcast_attention', 'conv_out_kernel', 'cross_attention_norm', 'dual_cross_attention', 'class_embeddings_concat', 'mid_block_type', 'num_class_embeds', 'timestep_post_act', 'time_embedding_type', 'conv_in_kernel', 'use_linear_projection', 'addition_embed_type'} was not found in config. Values will be initialized to default values.
{'scaling_factor'} was not found in config. Values will be initialized to default values.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.
{'dynamic_thresholding_ratio', 'lower_order_final', 'thresholding', 'solver_type', 'algorithm_type', 'solver_order', 'use_karras_sigmas', 'sample_max_value'} was not found in config. Values will be initialized to default values.
Loading unet.
Traceback (most recent call last):
 File "/home/momistest/db/diffusers/examples/dreambooth/train_dreambooth_lora.py", line 1112, in <module>
   main(args)
 File "/home/momistest/db/diffusers/examples/dreambooth/train_dreambooth_lora.py", line 1067, in main
   pipeline.load_lora_weights(args.output_dir)
 File "/home/momistest/db/diffusers/src/diffusers/loaders.py", line 847, in load_lora_weights
   self.unet.load_attn_procs(unet_lora_state_dict)
 File "/home/momistest/db/diffusers/src/diffusers/loaders.py", line 305, in load_attn_procs
   self.set_attn_processor(attn_processors)
 File "/home/momistest/db/diffusers/src/diffusers/models/unet_2d_condition.py", line 539, in set_attn_processor
   fn_recursive_attn_processor(name, module, processor)
 File "/home/momistest/db/diffusers/src/diffusers/models/unet_2d_condition.py", line 536, in fn_recursive_attn_processor
   fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)
 File "/home/momistest/db/diffusers/src/diffusers/models/unet_2d_condition.py", line 536, in fn_recursive_attn_processor
   fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)
 File "/home/momistest/db/diffusers/src/diffusers/models/unet_2d_condition.py", line 536, in fn_recursive_attn_processor
   fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)
 [Previous line repeated 3 more times]
 File "/home/momistest/db/diffusers/src/diffusers/models/unet_2d_condition.py", line 533, in fn_recursive_attn_processor
   module.set_processor(processor.pop(processor_name))
KeyError: 'unet.down_blocks.0.attentions.0.transformer_blocks.0.attn1.processor'

haowang1013 · 2023-05-08T08:50:25Z

Which version are you on? There's a commit that has a bunch of lora related fixes which is not included in 0.16.1

You may have to wait till the next version or install the latest version from github.

patrickvonplaten · 2023-05-08T09:09:13Z

This seems to be related to #3353 - trying to fix it asap

github-actions · 2023-06-07T15:02:52Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

bohong13 added the bug Something isn't working label May 8, 2023

patrickvonplaten mentioned this issue May 8, 2023

Fix various bugs with LoRA Dreambooth and Dreambooth script #3353

Merged

github-actions bot added the stale Issues that haven't received updates label Jun 7, 2023

github-actions bot closed this as completed Jun 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train_dreambooth_lora.py failed on two machines #3363

train_dreambooth_lora.py failed on two machines #3363

bohong13 commented May 8, 2023

haowang1013 commented May 8, 2023

bohong13 commented May 8, 2023 •

edited

Loading

haowang1013 commented May 8, 2023

haowang1013 commented May 8, 2023

bohong13 commented May 8, 2023 •

edited

Loading

haowang1013 commented May 8, 2023

patrickvonplaten commented May 8, 2023

github-actions bot commented Jun 7, 2023

train_dreambooth_lora.py failed on two machines #3363

train_dreambooth_lora.py failed on two machines #3363

Comments

bohong13 commented May 8, 2023

Describe the bug

Reproduction

Logs

System Info

haowang1013 commented May 8, 2023

bohong13 commented May 8, 2023 • edited Loading

haowang1013 commented May 8, 2023

haowang1013 commented May 8, 2023

bohong13 commented May 8, 2023 • edited Loading

haowang1013 commented May 8, 2023

patrickvonplaten commented May 8, 2023

github-actions bot commented Jun 7, 2023

bohong13 commented May 8, 2023 •

edited

Loading

bohong13 commented May 8, 2023 •

edited

Loading