text-to-image trainning resume from checkpoints get an error #3871

yijinsheng · 2023-06-26T03:07:46Z

Describe the bug

I try to train a text-to-image model by using the script diffusers/examples/text_to_image
it success when I train it from scratch
but when i add the param --resume_from_checkpoint "checkpoint-10000"
I get the error
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Reproduction

use the script to train a text-to-image model

export MODEL_NAME="/root/yjs/models--stabilityai--stable-diffusion-2-1/snapshots/845609e6cf0a060d8cd837297e5c169df5bff72c"
export TRAIN_DIR="/root/yjs/train_data"
export OUTPUT_DIR="./out_models"

accelerate launch train_text_to_image.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$TRAIN_DIR \
  --use_ema \
  --resolution=768 --center_crop --random_flip \
  --train_batch_size=8 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --mixed_precision="fp16" \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --lr_scheduler="constant" --lr_warmup_steps=0 \
  --output_dir=${OUTPUT_DIR} \
  --image_column="image" --caption_column='additional_feature'\
  --enable_xformers_memory_efficient_attention \
  --checkpointing_steps 1000

I get many checkpoints folders in my out_models directory ，and I choose one and add the param --resume_from_checkpoint "checkpoint-10000" and run the following script

export MODEL_NAME="/root/yjs/models--stabilityai--stable-diffusion-2-1/snapshots/845609e6cf0a060d8cd837297e5c169df5bff72c"
export TRAIN_DIR="/root/yjs/train_data"
export OUTPUT_DIR="./out_models"

accelerate launch train_text_to_image.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$TRAIN_DIR \
  --use_ema \
  --resolution=768 --center_crop --random_flip \
  --train_batch_size=8 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --mixed_precision="fp16" \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --lr_scheduler="constant" --lr_warmup_steps=0 \
  --output_dir=${OUTPUT_DIR} \
  --image_column="image" --caption_column='additional_feature'\
  --enable_xformers_memory_efficient_attention \
  --checkpointing_steps 1000 \
  --resume_from_checkpoint "checkpoint-10000"

Logs

─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /root/yjs/diffusers/examples/text_to_image/train_text_to_image.py:792 in <module>                │
│                                                                                                  │
│   789                                                                                            │
│   790                                                                                            │
│   791 if __name__ == "__main__":                                                                 │
│ ❱ 792 │   main()                                                                                 │
│   793                                                                                            │
│                                                                                                  │
│ /root/yjs/diffusers/examples/text_to_image/train_text_to_image.py:751 in main                    │
│                                                                                                  │
│   748 │   │   │   # Checks if the accelerator has performed an optimization step behind the sc   │
│   749 │   │   │   if accelerator.sync_gradients:                                                 │
│   750 │   │   │   │   if args.use_ema:                                                           │
│ ❱ 751 │   │   │   │   │   ema_unet.step(unet.parameters())                                       │
│   752 │   │   │   │   progress_bar.update(1)                                                     │
│   753 │   │   │   │   global_step += 1                                                           │
│   754 │   │   │   │   accelerator.log({"train_loss": train_loss}, step=global_step)              │
│                                                                                                  │
│ /root/.local/conda/envs/sd/lib/python3.10/site-packages/torch/utils/_contextlib.py:115 in        │
│ decorate_context                                                                                 │
│                                                                                                  │
│   112 │   @functools.wraps(func)                                                                 │
│   113 │   def decorate_context(*args, **kwargs):                                                 │
│   114 │   │   with ctx_factory():                                                                │
│ ❱ 115 │   │   │   return func(*args, **kwargs)                                                   │
│   116 │                                                                                          │
│   117 │   return decorate_context                                                                │
│   118                                                                                            │
│                                                                                                  │
│ /root/yjs/diffusers/examples/text_to_image/train_text_to_image.py:320 in step                    │
│                                                                                                  │
│   317 │   │                                                                                      │
│   318 │   │   for s_param, param in zip(self.shadow_params, parameters):                         │
│   319 │   │   │   if param.requires_grad:                                                        │
│ ❱ 320 │   │   │   │   s_param.sub_(one_minus_decay * (s_param - param))                          │
│   321 │   │   │   else:                                                                          │
│   322 │   │   │   │   s_param.copy_(param)                                                       │
│   323                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Steps:   1%|▍                                                      | 7/1000 [00:25<1:01:21,  3.71s/it, lr=1e-5, step_loss=0.372]

System Info

1.system info

Tesla V100 NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0
Linux dl-1626155627-pod-jupyter-67f8849f59-k8tff 3.10.0-1062.el7.bclinux.x86_64 Add glide modeling files #1 SMP Thu Mar 5 14:02:53 CST 2020 x86_64 x86_64 x86_64 GNU/Linux

2.python env

diffusers version: 0.16.1
Platform: Linux-3.10.0-1062.el7.bclinux.x86_64-x86_64-with-glibc2.27
Python version: 3.10.11
PyTorch version (GPU?): 2.0.1+cu117 (True)
Huggingface_hub version: 0.14.1
Transformers version: 4.29.2
Accelerate version: 0.19.0
xFormers version: 0.0.20
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

No response

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2023-06-28T13:22:53Z

cc @sayakpaul here

sayakpaul · 2023-06-29T05:43:48Z

Hmm, seems like this is a device placement issue. @yijinsheng would you maybe like to contribute a PR for this to be fixed?

yijinsheng · 2023-06-29T05:53:42Z

Hmm, seems like this is a device placement issue. @yijinsheng would you maybe like to contribute a PR for this to be fixed?

in face ,i don't know how to solve it and hope someone can help me 😅 @sayakpaul

github-actions · 2023-07-26T15:02:48Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

yijinsheng added the bug Something isn't working label Jun 26, 2023

patrickvonplaten assigned sayakpaul Jun 28, 2023

github-actions bot added the stale Issues that haven't received updates label Jul 26, 2023

github-actions bot closed this as completed Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text-to-image trainning resume from checkpoints get an error #3871

text-to-image trainning resume from checkpoints get an error #3871

yijinsheng commented Jun 26, 2023

patrickvonplaten commented Jun 28, 2023

sayakpaul commented Jun 29, 2023

yijinsheng commented Jun 29, 2023

github-actions bot commented Jul 26, 2023

text-to-image trainning resume from checkpoints get an error #3871

text-to-image trainning resume from checkpoints get an error #3871

Comments

yijinsheng commented Jun 26, 2023

Describe the bug

Reproduction

Logs

System Info

Who can help?

patrickvonplaten commented Jun 28, 2023

sayakpaul commented Jun 29, 2023

yijinsheng commented Jun 29, 2023

github-actions bot commented Jul 26, 2023