-
Notifications
You must be signed in to change notification settings - Fork 5.9k
[Community] Help us fix the LR schedulers when num_train_epochs
is passed in a distributed training env
#8384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@sayakpaul Are all of the above resolved? Maybe for the ones that are remaining, we can complete it ourselves - in which case, do you know which ones are remaining? |
I think we can still have this opened as it's good chance for the community for contributions. |
@sayakpaul is this issue still looking for contribution? |
@kghamilton89 I just updated the list of the scripts that have been updated. So, yes, very much open for contributions. |
@sayakpaul, the above PRs close out the DreamBooth trainers. |
Context
Refer to #8312 for the full context. The changes introduced in the PR should be propagated to the following scripts, too:
advanced_diffusion_training
consistency_distillation
controlnet
custom_diffusion
dreambooth
instruct_pix2pix
kandinsky2_2/text_to_image
t2i_adapter
text_to_image
textual_inversion
unconditional_image_generation
wuerstchen
research_projects (low-priority)
The following scripts do not have the argument
--num_train_epochs
:So, they don't need to be updated.
Then we have the following scripts that don't use
accelerator
to prepare the datasets:Distributed dataset sharding is done by WebDataset, not
accelerator
. So, we can skip them for now.Steps to follow when opening PRs
num_train_epochs
CLI arg.The text was updated successfully, but these errors were encountered: