Skip to content

Advanced training SD1.5 has an issue when saving checkpoints #8732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
josemerinom opened this issue Jun 29, 2024 · 7 comments
Closed

Advanced training SD1.5 has an issue when saving checkpoints #8732

josemerinom opened this issue Jun 29, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@josemerinom
Copy link

josemerinom commented Jun 29, 2024

Describe the bug

Today I trained using examples/dreambooth/train_dreambooth_lora.py in google colab, everything was OK

I wanted to try examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py I use the stable diffusion 1.5 model original (which I cloned on my HF), but when I try to save to the checkpoint, an error is generated

dataset = 10 images

checkpointing_steps=10 --> ValueError: unexpected save model: <class 'transformers.models.clip.modeling_clip.CLIPTextModel'>

other error When I change the checkpoint to a number different from the number of images:
checkpointing_steps=20 --> NameError: free variable 'pipeline' referenced before assignment in enclosing scope

validation prompt: None
06/30/2024 01:09:00 - INFO - main - ***** Running training *****

Reproduction

%cd /content
!mkdir /content/cache
!mkdir /content/dataset
!mkdir /content/log
!mkdir /content/train
!git clone --branch v0.29.2-patch https://github.com/huggingface/diffusers
!pip install accelerate==0.31.0
!pip install datasets==2.19.0
!pip install ftfy==6.2.0
!pip install Jinja2==3.1.4
!pip install peft==0.11.1
!pip install tensorboard==2.15.2
!pip install torchvision==0.18.0+cu121
!pip install transformers==4.42.3
%cd /content/diffusers
!pip install -e .
!accelerate config
%cd /content/diffusers/examples/advanced_diffusion_training

https://colab.research.google.com/github/josemerinom/test/blob/master/test.ipynb

Logs 1 (checkpointing_steps=10)

06/30/2024 01:08:45 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'prediction_type', 'variance_type', 'dynamic_thresholding_ratio', 'clip_sample_range', 'thresholding', 'timestep_spacing', 'rescale_betas_zero_snr', 'sample_max_value'} was not found in config. Values will be initialized to default values.
{'use_post_quant_conv', 'force_upcast', 'use_quant_conv', 'latents_std', 'scaling_factor', 'shift_factor', 'latents_mean'} was not found in config. Values will be initialized to default values.
{'num_class_embeds', 'encoder_hid_dim', 'projection_class_embeddings_input_dim', 'time_embedding_act_fn', 'use_linear_projection', 'resnet_skip_time_act', 'mid_block_only_cross_attention', 'dual_cross_attention', 'attention_type', 'time_cond_proj_dim', 'addition_embed_type_num_heads', 'time_embedding_type', 'conv_out_kernel', 'reverse_transformer_layers_per_block', 'class_embeddings_concat', 'resnet_time_scale_shift', 'class_embed_type', 'transformer_layers_per_block', 'encoder_hid_dim_type', 'conv_in_kernel', 'only_cross_attention', 'addition_time_embed_dim', 'resnet_out_scale_factor', 'cross_attention_norm', 'addition_embed_type', 'time_embedding_dim', 'mid_block_type', 'dropout', 'num_attention_heads', 'timestep_post_act', 'upcast_attention'} was not found in config. Values will be initialized to default values.
validation prompt: None
06/30/2024 01:09:00 - INFO - __main__ - ***** Running training *****
06/30/2024 01:09:00 - INFO - __main__ -   Num examples = 10
06/30/2024 01:09:00 - INFO - __main__ -   Num batches each epoch = 10
06/30/2024 01:09:00 - INFO - __main__ -   Num Epochs = 10
06/30/2024 01:09:00 - INFO - __main__ -   Instantaneous batch size per device = 1
06/30/2024 01:09:00 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
06/30/2024 01:09:00 - INFO - __main__ -   Gradient Accumulation steps = 1
06/30/2024 01:09:00 - INFO - __main__ -   Total optimization steps = 100
Steps:  10% 10/100 [00:07<00:50,  1.80it/s, loss=0.00439, lr=0.0001]06/30/2024 01:09:07 - INFO - accelerate.accelerator - Saving current state to /content/drive/MyDrive/train/checkpoint-10
/usr/local/lib/python3.10/dist-packages/peft/utils/save_and_load.py:195: UserWarning: Could not find a config file in /content/drive/MyDrive/zero/zero15 - will assume that the vocabulary was not modified.
  warnings.warn(
Traceback (most recent call last):
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 2002, in <module>
    main(args)
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1791, in main
    accelerator.save_state(save_path)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2955, in save_state
    hook(self._models, weights, output_dir)
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1293, in save_model_hook
    raise ValueError(f"unexpected save model: {model.__class__}")
ValueError: unexpected save model: <class 'transformers.models.clip.modeling_clip.CLIPTextModel'>
Steps:  10% 10/100 [00:07<01:11,  1.25it/s, loss=0.00439, lr=0.0001]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1097, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 703, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

Logs 2 (checkpointing_steps=20)

06/30/2024 01:11:01 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'rescale_betas_zero_snr', 'variance_type', 'sample_max_value', 'thresholding', 'timestep_spacing', 'dynamic_thresholding_ratio', 'clip_sample_range', 'prediction_type'} was not found in config. Values will be initialized to default values.
{'latents_std', 'latents_mean', 'shift_factor', 'scaling_factor', 'force_upcast', 'use_quant_conv', 'use_post_quant_conv'} was not found in config. Values will be initialized to default values.
{'encoder_hid_dim', 'dropout', 'attention_type', 'resnet_out_scale_factor', 'time_embedding_type', 'conv_out_kernel', 'mid_block_only_cross_attention', 'transformer_layers_per_block', 'addition_embed_type_num_heads', 'num_attention_heads', 'only_cross_attention', 'num_class_embeds', 'time_embedding_act_fn', 'mid_block_type', 'addition_time_embed_dim', 'encoder_hid_dim_type', 'resnet_time_scale_shift', 'dual_cross_attention', 'class_embed_type', 'upcast_attention', 'resnet_skip_time_act', 'use_linear_projection', 'class_embeddings_concat', 'time_embedding_dim', 'addition_embed_type', 'conv_in_kernel', 'reverse_transformer_layers_per_block', 'timestep_post_act', 'projection_class_embeddings_input_dim', 'cross_attention_norm', 'time_cond_proj_dim'} was not found in config. Values will be initialized to default values.
validation prompt: None
06/30/2024 01:11:15 - INFO - __main__ - ***** Running training *****
06/30/2024 01:11:15 - INFO - __main__ -   Num examples = 10
06/30/2024 01:11:15 - INFO - __main__ -   Num batches each epoch = 10
06/30/2024 01:11:15 - INFO - __main__ -   Num Epochs = 10
06/30/2024 01:11:15 - INFO - __main__ -   Instantaneous batch size per device = 1
06/30/2024 01:11:15 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
06/30/2024 01:11:15 - INFO - __main__ -   Gradient Accumulation steps = 1
06/30/2024 01:11:15 - INFO - __main__ -   Total optimization steps = 100
Steps:  10% 10/100 [00:08<00:51,  1.74it/s, loss=0.125, lr=0.0001]  Traceback (most recent call last):
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 2002, in <module>
    main(args)
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1854, in main
    images = [
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1855, in <listcomp>
    pipeline(**pipeline_args, generator=generator).images[0]
NameError: free variable 'pipeline' referenced before assignment in enclosing scope
Steps:  10% 10/100 [00:08<01:18,  1.14it/s, loss=0.125, lr=0.0001]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1097, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 703, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

System Info

  • 🤗 Diffusers version: 0.29.2
  • Platform: Linux-6.1.85+-x86_64-with-glibc2.35
  • Running on a notebook?: No
  • Running on Google Colab?: No
  • Python version: 3.10.12
  • PyTorch version (GPU?): 2.3.0+cu121 (True)
  • Flax version (CPU?/GPU?/TPU?): 0.8.4 (gpu)
  • Jax version: 0.4.26
  • JaxLib version: 0.4.26
  • Huggingface_hub version: 0.23.4
  • Transformers version: 4.42.3
  • Accelerate version: 0.31.0
  • PEFT version: 0.11.1
  • Bitsandbytes version: not installed
  • Safetensors version: 0.4.3
  • xFormers version: not installed
  • Accelerator: Tesla T4, 15360 MiB VRAM
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:
@josemerinom josemerinom added the bug Something isn't working label Jun 29, 2024
@sayakpaul
Copy link
Member

Cc: @linoytsaban

@DN6
Copy link
Collaborator

DN6 commented Jul 1, 2024

@josemerinom Thanks for reporting. Opened a PR to fix #8753

@josemerinom
Copy link
Author

josemerinom commented Jul 1, 2024

@linoytsaban @DN6

Hello, I tested the changes in the branch --branch dreambooth-advanced
I still get the error when saving
But only when I use the --train_text_encoder parameter
When I don't use --train_text_encoder it saves the checkpoint

reproduction > https://colab.research.google.com/github/josemerinom/test/blob/master/test2.ipynb

@DN6
Copy link
Collaborator

DN6 commented Jul 2, 2024

@josemerinom Could you share the exact traceback here? Not a screenshot.

@josemerinom
Copy link
Author

josemerinom commented Jul 2, 2024

@josemerinom Could you share the exact traceback here? Not a screenshot.

Here is test 2 that I did with the changes that were made in the code https://colab.research.google.com/github/josemerinom/test/blob/master/test2.ipynb

here what you request:

Reproduction

%cd /content
!mkdir /content/dataset
!mkdir /content/log
!mkdir /content/train
!git clone --branch dreambooth-advanced https://github.com/huggingface/diffusers
!pip install accelerate==0.31.0
!pip install datasets==2.19.0
!pip install ftfy==6.2.0
!pip install Jinja2==3.1.4
!pip install peft==0.11.1
!pip install tensorboard==2.15.2
!pip install torchvision==0.18.0+cu121
!pip install transformers==4.42.3
%cd /content/diffusers
!pip install -e .
!accelerate config
%cd /content/diffusers/examples/advanced_diffusion_training
!accelerate launch --num_cpu_threads_per_process=1 train_dreambooth_lora_sd15_advanced.py \
  --adam_beta1=0.9 \
  --adam_beta2=0.999 \
  --adam_epsilon=1e-8 \
  --adam_weight_decay=0.01 \
  --checkpointing_steps=10 \
  --dataloader_num_workers=0 \
  --gradient_accumulation_steps=1 \
  --instance_data_dir="/content/dataset" \
  --instance_prompt="c4myl4" \
  --learning_rate=1e-4 \
  --logging_dir="/content/log" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_grad_norm=1 \
  --max_train_steps=100 \
  --mixed_precision="fp16" \
  --optimizer="AdamW" \
  --output_dir="/content/train" \
  --pretrained_model_name_or_path="josemerinom/zero15" \
  --prior_loss_weight=1 \
  --rank=32 \
  --resolution=512 \
  --seed=0 \
  --text_encoder_lr=1e-4 \
  --train_batch_size=1 \
  --train_text_encoder \
  #

Logs

2024-07-01 17:07:37.574871: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-01 17:07:37.574933: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-01 17:07:37.576329: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-01 17:07:37.590021: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-01 17:07:39.007929: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
07/01/2024 17:07:41 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

tokenizer/tokenizer_config.json: 100% 806/806 [00:00<00:00, 4.23MB/s]
tokenizer/vocab.json: 100% 1.06M/1.06M [00:00<00:00, 7.92MB/s]
tokenizer/merges.txt: 100% 525k/525k [00:00<00:00, 2.65MB/s]
tokenizer/special_tokens_map.json: 100% 472/472 [00:00<00:00, 2.87MB/s]
text_encoder/config.json: 100% 617/617 [00:00<00:00, 3.92MB/s]
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
scheduler/scheduler_config.json: 100% 308/308 [00:00<00:00, 1.88MB/s]
{'timestep_spacing', 'thresholding', 'sample_max_value', 'rescale_betas_zero_snr', 'variance_type', 'clip_sample_range', 'dynamic_thresholding_ratio', 'prediction_type'} was not found in config. Values will be initialized to default values.
model.safetensors: 100% 492M/492M [00:10<00:00, 46.9MB/s]
vae/config.json: 100% 547/547 [00:00<00:00, 2.60MB/s]
diffusion_pytorch_model.safetensors: 100% 335M/335M [00:02<00:00, 134MB/s]
{'scaling_factor', 'use_post_quant_conv', 'shift_factor', 'latents_std', 'force_upcast', 'use_quant_conv', 'latents_mean'} was not found in config. Values will be initialized to default values.
unet/config.json: 100% 743/743 [00:00<00:00, 4.52MB/s]
diffusion_pytorch_model.safetensors: 100% 3.44G/3.44G [01:22<00:00, 41.4MB/s]
{'addition_embed_type', 'class_embed_type', 'mid_block_only_cross_attention', 'class_embeddings_concat', 'addition_time_embed_dim', 'resnet_out_scale_factor', 'time_embedding_act_fn', 'reverse_transformer_layers_per_block', 'resnet_skip_time_act', 'attention_type', 'time_embedding_dim', 'resnet_time_scale_shift', 'conv_in_kernel', 'conv_out_kernel', 'timestep_post_act', 'num_class_embeds', 'upcast_attention', 'encoder_hid_dim', 'addition_embed_type_num_heads', 'mid_block_type', 'only_cross_attention', 'time_cond_proj_dim', 'time_embedding_type', 'encoder_hid_dim_type', 'dropout', 'dual_cross_attention', 'use_linear_projection', 'projection_class_embeddings_input_dim', 'cross_attention_norm', 'transformer_layers_per_block', 'num_attention_heads'} was not found in config. Values will be initialized to default values.
validation prompt: None
07/01/2024 17:09:25 - INFO - __main__ - ***** Running training *****
07/01/2024 17:09:25 - INFO - __main__ -   Num examples = 10
07/01/2024 17:09:25 - INFO - __main__ -   Num batches each epoch = 10
07/01/2024 17:09:25 - INFO - __main__ -   Num Epochs = 10
07/01/2024 17:09:25 - INFO - __main__ -   Instantaneous batch size per device = 1
07/01/2024 17:09:25 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
07/01/2024 17:09:25 - INFO - __main__ -   Gradient Accumulation steps = 1
07/01/2024 17:09:25 - INFO - __main__ -   Total optimization steps = 100
Steps:  10% 10/100 [00:08<00:49,  1.81it/s, loss=0.00326, lr=0.0001]07/01/2024 17:09:34 - INFO - accelerate.accelerator - Saving current state to /content/train/checkpoint-10
Traceback (most recent call last):
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 2012, in <module>
    main(args)
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1802, in main
    accelerator.save_state(save_path)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2955, in save_state
    hook(self._models, weights, output_dir)
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1293, in save_model_hook
    raise ValueError(f"unexpected save model: {model.__class__}")
ValueError: unexpected save model: <class 'transformers.models.clip.modeling_clip.CLIPTextModel'>
Steps:  10% 10/100 [00:08<01:20,  1.12it/s, loss=0.00326, lr=0.0001]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1097, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 703, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_dreambooth_lora_sd15_advanced.py', '--adam_beta1=0.9', '--adam_beta2=0.999', '--adam_epsilon=1e-8', '--adam_weight_decay=0.01', '--checkpointing_steps=10', '--dataloader_num_workers=0', '--gradient_accumulation_steps=1', '--instance_data_dir=/content/dataset', '--instance_prompt=c4myl4', '--learning_rate=1e-4', '--logging_dir=/content/log', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_grad_norm=1', '--max_train_steps=100', '--mixed_precision=fp16', '--optimizer=AdamW', '--output_dir=/content/train', '--pretrained_model_name_or_path=josemerinom/zero15', '--prior_loss_weight=1', '--rank=32', '--resolution=512', '--seed=0', '--text_encoder_lr=1e-4', '--train_batch_size=1', '--train_text_encoder']' returned non-zero exit status 1.

@DN6
Copy link
Collaborator

DN6 commented Jul 5, 2024

@josemerinom Should be fixed in main now.

@josemerinom
Copy link
Author

josemerinom commented Jul 5, 2024

@DN6

Test 3: --branch main

Reproduction

https://colab.research.google.com/github/josemerinom/test/blob/master/test3.ipynb

Results

training start: OK
save checkpoint: OK
training completed: OK
test no lora / step 50 / step 100: OK

The learning was done, but... I only used 5 images and 100 steps, the learning is low (few steps)

I will try training more steps and using dora (this is the reason I want to use Advanced training)

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants