Skip to content

[Community] Help us fix the LR schedulers when num_train_epochs is passed in a distributed training env #8384

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
11 of 48 tasks
sayakpaul opened this issue Jun 3, 2024 · 5 comments · Fixed by #11240 · May be fixed by #11239
Open
11 of 48 tasks

[Community] Help us fix the LR schedulers when num_train_epochs is passed in a distributed training env #8384

sayakpaul opened this issue Jun 3, 2024 · 5 comments · Fixed by #11240 · May be fixed by #11239

Comments

@sayakpaul
Copy link
Member

sayakpaul commented Jun 3, 2024

Context

Refer to #8312 for the full context. The changes introduced in the PR should be propagated to the following scripts, too:

  • advanced_diffusion_training

    • train_dreambooth_lora_sd15_advanced.py
    • train_dreambooth_lora_sdxl_advanced.py
  • consistency_distillation

    • train_lcm_distill_lora_sdxl.py
  • controlnet

    • train_controlnet.py
    • train_controlnet_sdxl.py
  • custom_diffusion

    • train_custom_diffusion.py
  • dreambooth

    • train_dreambooth.py
    • train_dreambooth_lora.py
    • train_dreambooth_lora_sdxl.py
  • instruct_pix2pix

    • train_instruct_pix2pix.py
    • rain_instruct_pix2pix_sdxl.py
  • kandinsky2_2/text_to_image

    • train_text_to_image_decoder.py
    • train_text_to_image_prior.py
    • train_text_to_image_lora_decoder.py
    • train_text_to_image_lora_prior.py
  • t2i_adapter

    • train_t2i_adapter_sdxl.py
  • text_to_image

    • train_text_to_image.py
    • train_text_to_image_sdxl.py
    • train_text_to_image_lora.py
    • train_text_to_image_lora_sdxl.py
  • textual_inversion

    • textual_inversion.py
    • textual_inversion_sdxl.py
  • unconditional_image_generation

    • train_unconditional.py
  • wuerstchen

    • text_to_image/train_text_to_image_prior.py
    • text_to_image/train_text_to_image_lora_prior.py
  • research_projects (low-priority)

    • consistency_training/train_cm_ct_unconditional.py
    • diffusion_dpo/train_diffusion_dpo.py
    • diffusion_dpo/train_diffusion_dpo_sdxl.py
    • diffusion_orpo/train_diffusion_orpo_sdxl_lora.py
    • dreambooth_inpaint/train_dreambooth_inpaint.py
    • dreambooth_inpaint/train_dreambooth_inpaint_lora.py
    • instructpix2pix_lora/train_instruct_pix2pix_lora.py
    • intel_opts/textual_inversion/textual_inversion_bf16.py
    • intel_opts/textual_inversion_dfq/textual_inversion.py
    • lora/train_text_to_image_lora.py
    • multi_subject_dreambooth/train_multi_subject_dreambooth.py
    • multi_token_textual_inversion/textual_inversion.py
    • onnxruntime/text_to_image/train_text_to_image.py
    • onnxruntime/textual_inversion/textual_inversion.py
    • onnxruntime/unconditional_image_generation/train_unconditional.py
    • realfill/train_realfill.py
    • scheduled_huber_loss_training/dreambooth/train_dreambooth.py
    • scheduled_huber_loss_training/dreambooth/train_dreambooth_lora.py
    • scheduled_huber_loss_training/dreambooth/train_dreambooth_lora_sdxl.py
    • scheduled_huber_loss_training/text_to_image/train_text_to_image.py
    • scheduled_huber_loss_training/text_to_image/train_text_to_image_sdxl.py
    • scheduled_huber_loss_training/text_to_image/train_text_to_image_lora.py
    • scheduled_huber_loss_training/text_to_image/train_text_to_image_lora_sdxl.py

The following scripts do not have the argument --num_train_epochs:

  • amused
    • train_amused.py
  • research_projects
    • multi_subject_dreambooth_inpainting/train_multi_subject_dreambooth_inpainting.py

So, they don't need to be updated.

Then we have the following scripts that don't use accelerator to prepare the datasets:

Distributed dataset sharding is done by WebDataset, not accelerator. So, we can skip them for now.

  • consistency_distillation
    • train_lcm_distill_sd_wds.py
    • train_lcm_distill_sdxl_wds.py
    • train_lcm_distill_lora_sd_wds.py
    • train_lcm_distill_lora_sdxl_wds.py
  • research_projects
    • controlnet/train_controlnet_webdataset.py
    • diffusion_orpo/train_diffusion_orpo_sdxl_lora_wds.py

Steps to follow when opening PRs

  • Target one AND only one training script in a single PR.
  • When you open a PR, please mention this issue.
  • Mention @sayakpaul and @geniuspatrick for a review.
  • Accompany your PR with a minimal training command using the num_train_epochs CLI arg.
  • Enjoy!
@a-r-r-o-w
Copy link
Member

@sayakpaul Are all of the above resolved? Maybe for the ones that are remaining, we can complete it ourselves - in which case, do you know which ones are remaining?

@sayakpaul
Copy link
Member Author

I think we can still have this opened as it's good chance for the community for contributions.

@kghamilton89
Copy link
Contributor

@sayakpaul is this issue still looking for contribution?

@sayakpaul
Copy link
Member Author

@kghamilton89 I just updated the list of the scripts that have been updated. So, yes, very much open for contributions.

@kghamilton89
Copy link
Contributor

@sayakpaul, the above PRs close out the DreamBooth trainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment