-
Notifications
You must be signed in to change notification settings - Fork 5.9k
train_dreambooth_lora.py failed on two machines #3363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I have the same problem, the |
@haowang1013 |
Yeah I never had any problem with training, probably because I was only using one machine. That key errro happens to me when I tried to load the lora state dict using |
This fixed the loading error for me, in
|
Thank you! but i still get error when i using two machines
|
Which version are you on? There's a commit that has a bunch of lora related fixes which is not included in 0.16.1 You may have to wait till the next version or install the latest version from github. |
This seems to be related to #3353 - trying to fix it asap |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Describe the bug
I have found two errors.
Then I try to solve this error using the method in issue #3284
but i get this error
I have two machines on the same local network, but when I monitor the network traffic using iftop, the model parameters exchange packet of TX and RX is not the same.
Reproduction
I followed this dog example to run the program on two machines.
I have two laptops with NVIDIA RTX 3080 GPUs.
machine 1 IP is 192.168.1.123
machine 2 IP is 192.168.1.183
The environment and package versions of the two machines are exactly the same
and I Run this script on two machine
Logs
System Info
The text was updated successfully, but these errors were encountered: