-
Notifications
You must be signed in to change notification settings - Fork 6k
Advanced training SD1.5 has an issue when saving checkpoints #8732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Cc: @linoytsaban |
@josemerinom Thanks for reporting. Opened a PR to fix #8753 |
Hello, I tested the changes in the branch --branch dreambooth-advanced reproduction > https://colab.research.google.com/github/josemerinom/test/blob/master/test2.ipynb |
@josemerinom Could you share the exact traceback here? Not a screenshot. |
Here is test 2 that I did with the changes that were made in the code https://colab.research.google.com/github/josemerinom/test/blob/master/test2.ipynb here what you request: Reproduction%cd /content
!mkdir /content/dataset
!mkdir /content/log
!mkdir /content/train
!git clone --branch dreambooth-advanced https://github.com/huggingface/diffusers
!pip install accelerate==0.31.0
!pip install datasets==2.19.0
!pip install ftfy==6.2.0
!pip install Jinja2==3.1.4
!pip install peft==0.11.1
!pip install tensorboard==2.15.2
!pip install torchvision==0.18.0+cu121
!pip install transformers==4.42.3
%cd /content/diffusers
!pip install -e .
!accelerate config
%cd /content/diffusers/examples/advanced_diffusion_training
!accelerate launch --num_cpu_threads_per_process=1 train_dreambooth_lora_sd15_advanced.py \
--adam_beta1=0.9 \
--adam_beta2=0.999 \
--adam_epsilon=1e-8 \
--adam_weight_decay=0.01 \
--checkpointing_steps=10 \
--dataloader_num_workers=0 \
--gradient_accumulation_steps=1 \
--instance_data_dir="/content/dataset" \
--instance_prompt="c4myl4" \
--learning_rate=1e-4 \
--logging_dir="/content/log" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_grad_norm=1 \
--max_train_steps=100 \
--mixed_precision="fp16" \
--optimizer="AdamW" \
--output_dir="/content/train" \
--pretrained_model_name_or_path="josemerinom/zero15" \
--prior_loss_weight=1 \
--rank=32 \
--resolution=512 \
--seed=0 \
--text_encoder_lr=1e-4 \
--train_batch_size=1 \
--train_text_encoder \
# Logs2024-07-01 17:07:37.574871: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-01 17:07:37.574933: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-01 17:07:37.576329: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-01 17:07:37.590021: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-01 17:07:39.007929: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
07/01/2024 17:07:41 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16
tokenizer/tokenizer_config.json: 100% 806/806 [00:00<00:00, 4.23MB/s]
tokenizer/vocab.json: 100% 1.06M/1.06M [00:00<00:00, 7.92MB/s]
tokenizer/merges.txt: 100% 525k/525k [00:00<00:00, 2.65MB/s]
tokenizer/special_tokens_map.json: 100% 472/472 [00:00<00:00, 2.87MB/s]
text_encoder/config.json: 100% 617/617 [00:00<00:00, 3.92MB/s]
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
scheduler/scheduler_config.json: 100% 308/308 [00:00<00:00, 1.88MB/s]
{'timestep_spacing', 'thresholding', 'sample_max_value', 'rescale_betas_zero_snr', 'variance_type', 'clip_sample_range', 'dynamic_thresholding_ratio', 'prediction_type'} was not found in config. Values will be initialized to default values.
model.safetensors: 100% 492M/492M [00:10<00:00, 46.9MB/s]
vae/config.json: 100% 547/547 [00:00<00:00, 2.60MB/s]
diffusion_pytorch_model.safetensors: 100% 335M/335M [00:02<00:00, 134MB/s]
{'scaling_factor', 'use_post_quant_conv', 'shift_factor', 'latents_std', 'force_upcast', 'use_quant_conv', 'latents_mean'} was not found in config. Values will be initialized to default values.
unet/config.json: 100% 743/743 [00:00<00:00, 4.52MB/s]
diffusion_pytorch_model.safetensors: 100% 3.44G/3.44G [01:22<00:00, 41.4MB/s]
{'addition_embed_type', 'class_embed_type', 'mid_block_only_cross_attention', 'class_embeddings_concat', 'addition_time_embed_dim', 'resnet_out_scale_factor', 'time_embedding_act_fn', 'reverse_transformer_layers_per_block', 'resnet_skip_time_act', 'attention_type', 'time_embedding_dim', 'resnet_time_scale_shift', 'conv_in_kernel', 'conv_out_kernel', 'timestep_post_act', 'num_class_embeds', 'upcast_attention', 'encoder_hid_dim', 'addition_embed_type_num_heads', 'mid_block_type', 'only_cross_attention', 'time_cond_proj_dim', 'time_embedding_type', 'encoder_hid_dim_type', 'dropout', 'dual_cross_attention', 'use_linear_projection', 'projection_class_embeddings_input_dim', 'cross_attention_norm', 'transformer_layers_per_block', 'num_attention_heads'} was not found in config. Values will be initialized to default values.
validation prompt: None
07/01/2024 17:09:25 - INFO - __main__ - ***** Running training *****
07/01/2024 17:09:25 - INFO - __main__ - Num examples = 10
07/01/2024 17:09:25 - INFO - __main__ - Num batches each epoch = 10
07/01/2024 17:09:25 - INFO - __main__ - Num Epochs = 10
07/01/2024 17:09:25 - INFO - __main__ - Instantaneous batch size per device = 1
07/01/2024 17:09:25 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 1
07/01/2024 17:09:25 - INFO - __main__ - Gradient Accumulation steps = 1
07/01/2024 17:09:25 - INFO - __main__ - Total optimization steps = 100
Steps: 10% 10/100 [00:08<00:49, 1.81it/s, loss=0.00326, lr=0.0001]07/01/2024 17:09:34 - INFO - accelerate.accelerator - Saving current state to /content/train/checkpoint-10
Traceback (most recent call last):
File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 2012, in <module>
main(args)
File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1802, in main
accelerator.save_state(save_path)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2955, in save_state
hook(self._models, weights, output_dir)
File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1293, in save_model_hook
raise ValueError(f"unexpected save model: {model.__class__}")
ValueError: unexpected save model: <class 'transformers.models.clip.modeling_clip.CLIPTextModel'>
Steps: 10% 10/100 [00:08<01:20, 1.12it/s, loss=0.00326, lr=0.0001]
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1097, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 703, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_dreambooth_lora_sd15_advanced.py', '--adam_beta1=0.9', '--adam_beta2=0.999', '--adam_epsilon=1e-8', '--adam_weight_decay=0.01', '--checkpointing_steps=10', '--dataloader_num_workers=0', '--gradient_accumulation_steps=1', '--instance_data_dir=/content/dataset', '--instance_prompt=c4myl4', '--learning_rate=1e-4', '--logging_dir=/content/log', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_grad_norm=1', '--max_train_steps=100', '--mixed_precision=fp16', '--optimizer=AdamW', '--output_dir=/content/train', '--pretrained_model_name_or_path=josemerinom/zero15', '--prior_loss_weight=1', '--rank=32', '--resolution=512', '--seed=0', '--text_encoder_lr=1e-4', '--train_batch_size=1', '--train_text_encoder']' returned non-zero exit status 1. |
@josemerinom Should be fixed in main now. |
Test 3: --branch main Reproductionhttps://colab.research.google.com/github/josemerinom/test/blob/master/test3.ipynb Resultstraining start: OK The learning was done, but... I only used 5 images and 100 steps, the learning is low (few steps) I will try training more steps and using dora (this is the reason I want to use Advanced training) Thanks |
Describe the bug
Today I trained using examples/dreambooth/train_dreambooth_lora.py in google colab, everything was OK
I wanted to try examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py I use the stable diffusion 1.5 model original (which I cloned on my HF), but when I try to save to the checkpoint, an error is generated
dataset = 10 images
checkpointing_steps=10 --> ValueError: unexpected save model: <class 'transformers.models.clip.modeling_clip.CLIPTextModel'>
other error When I change the checkpoint to a number different from the number of images:
checkpointing_steps=20 --> NameError: free variable 'pipeline' referenced before assignment in enclosing scope
validation prompt: None
06/30/2024 01:09:00 - INFO - main - ***** Running training *****
Reproduction
%cd /content
!mkdir /content/cache
!mkdir /content/dataset
!mkdir /content/log
!mkdir /content/train
!git clone --branch v0.29.2-patch https://github.com/huggingface/diffusers
!pip install accelerate==0.31.0
!pip install datasets==2.19.0
!pip install ftfy==6.2.0
!pip install Jinja2==3.1.4
!pip install peft==0.11.1
!pip install tensorboard==2.15.2
!pip install torchvision==0.18.0+cu121
!pip install transformers==4.42.3
%cd /content/diffusers
!pip install -e .
!accelerate config
%cd /content/diffusers/examples/advanced_diffusion_training
https://colab.research.google.com/github/josemerinom/test/blob/master/test.ipynb
Logs 1 (checkpointing_steps=10)
Logs 2 (checkpointing_steps=20)
System Info
The text was updated successfully, but these errors were encountered: