Advanced training SD1.5 has an issue when saving checkpoints #8732

josemerinom · 2024-06-29T01:43:25Z

Describe the bug

Today I trained using examples/dreambooth/train_dreambooth_lora.py in google colab, everything was OK

I wanted to try examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py I use the stable diffusion 1.5 model original (which I cloned on my HF), but when I try to save to the checkpoint, an error is generated

dataset = 10 images

checkpointing_steps=10 --> ValueError: unexpected save model: <class 'transformers.models.clip.modeling_clip.CLIPTextModel'>

other error When I change the checkpoint to a number different from the number of images:
checkpointing_steps=20 --> NameError: free variable 'pipeline' referenced before assignment in enclosing scope

validation prompt: None
06/30/2024 01:09:00 - INFO - main - ***** Running training *****

Reproduction

%cd /content
!mkdir /content/cache
!mkdir /content/dataset
!mkdir /content/log
!mkdir /content/train
!git clone --branch v0.29.2-patch https://github.com/huggingface/diffusers
!pip install accelerate==0.31.0
!pip install datasets==2.19.0
!pip install ftfy==6.2.0
!pip install Jinja2==3.1.4
!pip install peft==0.11.1
!pip install tensorboard==2.15.2
!pip install torchvision==0.18.0+cu121
!pip install transformers==4.42.3
%cd /content/diffusers
!pip install -e .
!accelerate config
%cd /content/diffusers/examples/advanced_diffusion_training

https://colab.research.google.com/github/josemerinom/test/blob/master/test.ipynb

Logs 1 (checkpointing_steps=10)

06/30/2024 01:08:45 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'prediction_type', 'variance_type', 'dynamic_thresholding_ratio', 'clip_sample_range', 'thresholding', 'timestep_spacing', 'rescale_betas_zero_snr', 'sample_max_value'} was not found in config. Values will be initialized to default values.
{'use_post_quant_conv', 'force_upcast', 'use_quant_conv', 'latents_std', 'scaling_factor', 'shift_factor', 'latents_mean'} was not found in config. Values will be initialized to default values.
{'num_class_embeds', 'encoder_hid_dim', 'projection_class_embeddings_input_dim', 'time_embedding_act_fn', 'use_linear_projection', 'resnet_skip_time_act', 'mid_block_only_cross_attention', 'dual_cross_attention', 'attention_type', 'time_cond_proj_dim', 'addition_embed_type_num_heads', 'time_embedding_type', 'conv_out_kernel', 'reverse_transformer_layers_per_block', 'class_embeddings_concat', 'resnet_time_scale_shift', 'class_embed_type', 'transformer_layers_per_block', 'encoder_hid_dim_type', 'conv_in_kernel', 'only_cross_attention', 'addition_time_embed_dim', 'resnet_out_scale_factor', 'cross_attention_norm', 'addition_embed_type', 'time_embedding_dim', 'mid_block_type', 'dropout', 'num_attention_heads', 'timestep_post_act', 'upcast_attention'} was not found in config. Values will be initialized to default values.
validation prompt: None
06/30/2024 01:09:00 - INFO - __main__ - ***** Running training *****
06/30/2024 01:09:00 - INFO - __main__ -   Num examples = 10
06/30/2024 01:09:00 - INFO - __main__ -   Num batches each epoch = 10
06/30/2024 01:09:00 - INFO - __main__ -   Num Epochs = 10
06/30/2024 01:09:00 - INFO - __main__ -   Instantaneous batch size per device = 1
06/30/2024 01:09:00 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
06/30/2024 01:09:00 - INFO - __main__ -   Gradient Accumulation steps = 1
06/30/2024 01:09:00 - INFO - __main__ -   Total optimization steps = 100
Steps:  10% 10/100 [00:07<00:50,  1.80it/s, loss=0.00439, lr=0.0001]06/30/2024 01:09:07 - INFO - accelerate.accelerator - Saving current state to /content/drive/MyDrive/train/checkpoint-10
/usr/local/lib/python3.10/dist-packages/peft/utils/save_and_load.py:195: UserWarning: Could not find a config file in /content/drive/MyDrive/zero/zero15 - will assume that the vocabulary was not modified.
  warnings.warn(
Traceback (most recent call last):
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 2002, in <module>
    main(args)
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1791, in main
    accelerator.save_state(save_path)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2955, in save_state
    hook(self._models, weights, output_dir)
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1293, in save_model_hook
    raise ValueError(f"unexpected save model: {model.__class__}")
ValueError: unexpected save model: <class 'transformers.models.clip.modeling_clip.CLIPTextModel'>
Steps:  10% 10/100 [00:07<01:11,  1.25it/s, loss=0.00439, lr=0.0001]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1097, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 703, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

Logs 2 (checkpointing_steps=20)

06/30/2024 01:11:01 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'rescale_betas_zero_snr', 'variance_type', 'sample_max_value', 'thresholding', 'timestep_spacing', 'dynamic_thresholding_ratio', 'clip_sample_range', 'prediction_type'} was not found in config. Values will be initialized to default values.
{'latents_std', 'latents_mean', 'shift_factor', 'scaling_factor', 'force_upcast', 'use_quant_conv', 'use_post_quant_conv'} was not found in config. Values will be initialized to default values.
{'encoder_hid_dim', 'dropout', 'attention_type', 'resnet_out_scale_factor', 'time_embedding_type', 'conv_out_kernel', 'mid_block_only_cross_attention', 'transformer_layers_per_block', 'addition_embed_type_num_heads', 'num_attention_heads', 'only_cross_attention', 'num_class_embeds', 'time_embedding_act_fn', 'mid_block_type', 'addition_time_embed_dim', 'encoder_hid_dim_type', 'resnet_time_scale_shift', 'dual_cross_attention', 'class_embed_type', 'upcast_attention', 'resnet_skip_time_act', 'use_linear_projection', 'class_embeddings_concat', 'time_embedding_dim', 'addition_embed_type', 'conv_in_kernel', 'reverse_transformer_layers_per_block', 'timestep_post_act', 'projection_class_embeddings_input_dim', 'cross_attention_norm', 'time_cond_proj_dim'} was not found in config. Values will be initialized to default values.
validation prompt: None
06/30/2024 01:11:15 - INFO - __main__ - ***** Running training *****
06/30/2024 01:11:15 - INFO - __main__ -   Num examples = 10
06/30/2024 01:11:15 - INFO - __main__ -   Num batches each epoch = 10
06/30/2024 01:11:15 - INFO - __main__ -   Num Epochs = 10
06/30/2024 01:11:15 - INFO - __main__ -   Instantaneous batch size per device = 1
06/30/2024 01:11:15 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
06/30/2024 01:11:15 - INFO - __main__ -   Gradient Accumulation steps = 1
06/30/2024 01:11:15 - INFO - __main__ -   Total optimization steps = 100
Steps:  10% 10/100 [00:08<00:51,  1.74it/s, loss=0.125, lr=0.0001]  Traceback (most recent call last):
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 2002, in <module>
    main(args)
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1854, in main
    images = [
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1855, in <listcomp>
    pipeline(**pipeline_args, generator=generator).images[0]
NameError: free variable 'pipeline' referenced before assignment in enclosing scope
Steps:  10% 10/100 [00:08<01:18,  1.14it/s, loss=0.125, lr=0.0001]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1097, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 703, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

System Info

🤗 Diffusers version: 0.29.2
Platform: Linux-6.1.85+-x86_64-with-glibc2.35
Running on a notebook?: No
Running on Google Colab?: No
Python version: 3.10.12
PyTorch version (GPU?): 2.3.0+cu121 (True)
Flax version (CPU?/GPU?/TPU?): 0.8.4 (gpu)
Jax version: 0.4.26
JaxLib version: 0.4.26
Huggingface_hub version: 0.23.4
Transformers version: 4.42.3
Accelerate version: 0.31.0
PEFT version: 0.11.1
Bitsandbytes version: not installed
Safetensors version: 0.4.3
xFormers version: not installed
Accelerator: Tesla T4, 15360 MiB VRAM
Using GPU in script?:
Using distributed or parallel set-up in script?:

sayakpaul · 2024-06-29T03:37:49Z

Cc: @linoytsaban

DN6 · 2024-07-01T09:59:31Z

@josemerinom Thanks for reporting. Opened a PR to fix #8753

josemerinom · 2024-07-01T17:27:09Z

@linoytsaban @DN6

Hello, I tested the changes in the branch --branch dreambooth-advanced
I still get the error when saving
But only when I use the --train_text_encoder parameter
When I don't use --train_text_encoder it saves the checkpoint

reproduction > https://colab.research.google.com/github/josemerinom/test/blob/master/test2.ipynb

DN6 · 2024-07-02T05:58:40Z

@josemerinom Could you share the exact traceback here? Not a screenshot.

josemerinom · 2024-07-02T11:27:49Z

@josemerinom Could you share the exact traceback here? Not a screenshot.

Here is test 2 that I did with the changes that were made in the code https://colab.research.google.com/github/josemerinom/test/blob/master/test2.ipynb

here what you request:

Reproduction

%cd /content
!mkdir /content/dataset
!mkdir /content/log
!mkdir /content/train
!git clone --branch dreambooth-advanced https://github.com/huggingface/diffusers
!pip install accelerate==0.31.0
!pip install datasets==2.19.0
!pip install ftfy==6.2.0
!pip install Jinja2==3.1.4
!pip install peft==0.11.1
!pip install tensorboard==2.15.2
!pip install torchvision==0.18.0+cu121
!pip install transformers==4.42.3
%cd /content/diffusers
!pip install -e .
!accelerate config
%cd /content/diffusers/examples/advanced_diffusion_training
!accelerate launch --num_cpu_threads_per_process=1 train_dreambooth_lora_sd15_advanced.py \
  --adam_beta1=0.9 \
  --adam_beta2=0.999 \
  --adam_epsilon=1e-8 \
  --adam_weight_decay=0.01 \
  --checkpointing_steps=10 \
  --dataloader_num_workers=0 \
  --gradient_accumulation_steps=1 \
  --instance_data_dir="/content/dataset" \
  --instance_prompt="c4myl4" \
  --learning_rate=1e-4 \
  --logging_dir="/content/log" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_grad_norm=1 \
  --max_train_steps=100 \
  --mixed_precision="fp16" \
  --optimizer="AdamW" \
  --output_dir="/content/train" \
  --pretrained_model_name_or_path="josemerinom/zero15" \
  --prior_loss_weight=1 \
  --rank=32 \
  --resolution=512 \
  --seed=0 \
  --text_encoder_lr=1e-4 \
  --train_batch_size=1 \
  --train_text_encoder \
  #

Logs

2024-07-01 17:07:37.574871: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-01 17:07:37.574933: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-01 17:07:37.576329: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-01 17:07:37.590021: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-01 17:07:39.007929: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
07/01/2024 17:07:41 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

tokenizer/tokenizer_config.json: 100% 806/806 [00:00<00:00, 4.23MB/s]
tokenizer/vocab.json: 100% 1.06M/1.06M [00:00<00:00, 7.92MB/s]
tokenizer/merges.txt: 100% 525k/525k [00:00<00:00, 2.65MB/s]
tokenizer/special_tokens_map.json: 100% 472/472 [00:00<00:00, 2.87MB/s]
text_encoder/config.json: 100% 617/617 [00:00<00:00, 3.92MB/s]
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
scheduler/scheduler_config.json: 100% 308/308 [00:00<00:00, 1.88MB/s]
{'timestep_spacing', 'thresholding', 'sample_max_value', 'rescale_betas_zero_snr', 'variance_type', 'clip_sample_range', 'dynamic_thresholding_ratio', 'prediction_type'} was not found in config. Values will be initialized to default values.
model.safetensors: 100% 492M/492M [00:10<00:00, 46.9MB/s]
vae/config.json: 100% 547/547 [00:00<00:00, 2.60MB/s]
diffusion_pytorch_model.safetensors: 100% 335M/335M [00:02<00:00, 134MB/s]
{'scaling_factor', 'use_post_quant_conv', 'shift_factor', 'latents_std', 'force_upcast', 'use_quant_conv', 'latents_mean'} was not found in config. Values will be initialized to default values.
unet/config.json: 100% 743/743 [00:00<00:00, 4.52MB/s]
diffusion_pytorch_model.safetensors: 100% 3.44G/3.44G [01:22<00:00, 41.4MB/s]
{'addition_embed_type', 'class_embed_type', 'mid_block_only_cross_attention', 'class_embeddings_concat', 'addition_time_embed_dim', 'resnet_out_scale_factor', 'time_embedding_act_fn', 'reverse_transformer_layers_per_block', 'resnet_skip_time_act', 'attention_type', 'time_embedding_dim', 'resnet_time_scale_shift', 'conv_in_kernel', 'conv_out_kernel', 'timestep_post_act', 'num_class_embeds', 'upcast_attention', 'encoder_hid_dim', 'addition_embed_type_num_heads', 'mid_block_type', 'only_cross_attention', 'time_cond_proj_dim', 'time_embedding_type', 'encoder_hid_dim_type', 'dropout', 'dual_cross_attention', 'use_linear_projection', 'projection_class_embeddings_input_dim', 'cross_attention_norm', 'transformer_layers_per_block', 'num_attention_heads'} was not found in config. Values will be initialized to default values.
validation prompt: None
07/01/2024 17:09:25 - INFO - __main__ - ***** Running training *****
07/01/2024 17:09:25 - INFO - __main__ -   Num examples = 10
07/01/2024 17:09:25 - INFO - __main__ -   Num batches each epoch = 10
07/01/2024 17:09:25 - INFO - __main__ -   Num Epochs = 10
07/01/2024 17:09:25 - INFO - __main__ -   Instantaneous batch size per device = 1
07/01/2024 17:09:25 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
07/01/2024 17:09:25 - INFO - __main__ -   Gradient Accumulation steps = 1
07/01/2024 17:09:25 - INFO - __main__ -   Total optimization steps = 100
Steps:  10% 10/100 [00:08<00:49,  1.81it/s, loss=0.00326, lr=0.0001]07/01/2024 17:09:34 - INFO - accelerate.accelerator - Saving current state to /content/train/checkpoint-10
Traceback (most recent call last):
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 2012, in <module>
    main(args)
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1802, in main
    accelerator.save_state(save_path)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2955, in save_state
    hook(self._models, weights, output_dir)
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1293, in save_model_hook
    raise ValueError(f"unexpected save model: {model.__class__}")
ValueError: unexpected save model: <class 'transformers.models.clip.modeling_clip.CLIPTextModel'>
Steps:  10% 10/100 [00:08<01:20,  1.12it/s, loss=0.00326, lr=0.0001]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1097, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 703, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_dreambooth_lora_sd15_advanced.py', '--adam_beta1=0.9', '--adam_beta2=0.999', '--adam_epsilon=1e-8', '--adam_weight_decay=0.01', '--checkpointing_steps=10', '--dataloader_num_workers=0', '--gradient_accumulation_steps=1', '--instance_data_dir=/content/dataset', '--instance_prompt=c4myl4', '--learning_rate=1e-4', '--logging_dir=/content/log', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_grad_norm=1', '--max_train_steps=100', '--mixed_precision=fp16', '--optimizer=AdamW', '--output_dir=/content/train', '--pretrained_model_name_or_path=josemerinom/zero15', '--prior_loss_weight=1', '--rank=32', '--resolution=512', '--seed=0', '--text_encoder_lr=1e-4', '--train_batch_size=1', '--train_text_encoder']' returned non-zero exit status 1.

DN6 · 2024-07-05T06:02:26Z

@josemerinom Should be fixed in main now.

josemerinom · 2024-07-05T13:00:33Z

@DN6

Test 3: --branch main

Reproduction

https://colab.research.google.com/github/josemerinom/test/blob/master/test3.ipynb

Results

training start: OK
save checkpoint: OK
training completed: OK
test no lora / step 50 / step 100: OK

The learning was done, but... I only used 5 images and 100 steps, the learning is low (few steps)

I will try training more steps and using dora (this is the reason I want to use Advanced training)

Thanks

josemerinom added the bug Something isn't working label Jun 29, 2024

DN6 mentioned this issue Jul 1, 2024

Fix indent in dreambooth lora advanced SD 15 script #8753

Merged

6 tasks

josemerinom closed this as completed Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advanced training SD1.5 has an issue when saving checkpoints #8732

Advanced training SD1.5 has an issue when saving checkpoints #8732

josemerinom commented Jun 29, 2024 •

edited

Loading

sayakpaul commented Jun 29, 2024

DN6 commented Jul 1, 2024

josemerinom commented Jul 1, 2024 •

edited

Loading

DN6 commented Jul 2, 2024

josemerinom commented Jul 2, 2024 •

edited

Loading

DN6 commented Jul 5, 2024

josemerinom commented Jul 5, 2024 •

edited

Loading

Advanced training SD1.5 has an issue when saving checkpoints #8732

Advanced training SD1.5 has an issue when saving checkpoints #8732

Comments

josemerinom commented Jun 29, 2024 • edited Loading

Describe the bug

Reproduction

Logs 1 (checkpointing_steps=10)

Logs 2 (checkpointing_steps=20)

System Info

sayakpaul commented Jun 29, 2024

DN6 commented Jul 1, 2024

josemerinom commented Jul 1, 2024 • edited Loading

DN6 commented Jul 2, 2024

josemerinom commented Jul 2, 2024 • edited Loading

Reproduction

Logs

DN6 commented Jul 5, 2024

josemerinom commented Jul 5, 2024 • edited Loading

Reproduction

Results

josemerinom commented Jun 29, 2024 •

edited

Loading

josemerinom commented Jul 1, 2024 •

edited

Loading

josemerinom commented Jul 2, 2024 •

edited

Loading

josemerinom commented Jul 5, 2024 •

edited

Loading