fix QAT version dependency #1333

felipemello1 · 2024-08-14T15:24:51Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

We updated torchtune to use torchao 0.4.0. It breaks unless user has pytorch 2.4.0. In our scripts, we were using import guards:

https://github.com/felipemello1/torchtune/blob/04ccbf2601653e0e2cceb75e59394df5517d26e3/torchtune/utils/quantization.py#L12

However, "TORCH_VERSION_AFTER_2_4" actually didnt include 2_4. This was fixed in torchao here: pytorch/ao#684, but it wont be available to us until their next release.

After updating TorchAO and the import guards, another error was raised:

[rank0]:   File "/home/felipemello/.conda/envs/test_ao/lib/python3.10/site-packages/torchao/quantization/prototype/qat/utils.py", line 42, in forward
[rank0]:     assert input.dtype == torch.float32

This is because QAT recipe now requires the model to be in float32. More context here: https://github.com/pytorch/ao/blob/0b66ff01ab6ba4094823b8cb134ab5b5a744d73a/torchao/quantization/prototype/qat/utils.py#L39

Changing the QAT recipes to have dtpye = fp32 solved it

Changelog

Update torchao=0.4.0
Remove pin from Numpy (this is unrelated to this PR, but it was something we needed to do, so it made sense to test everything together Unpin Numpy #1344)
Temporarily change the import guards. They MUST be updated with the next torchao release. (should I add some assertion that checks torchao__version__ <= 0.4.0?)
QAT configs use fp32

Test plan

I was able to run the code below. But I did not try to compare with previous version.

tune run --nproc_per_node 8 qat_distributed --config llama3/8B_qat_full

pytorch-bot · 2024-08-14T15:24:54Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1333

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 04ccbf2 with merge base 6a7951f ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

joecummings · 2024-08-14T15:38:48Z

cc @msaroufim

msaroufim

Assuming CI is green this is OK to merge

joecummings · 2024-08-14T16:00:00Z

@felipemello1 Would you mind just also double checking our version guards for AO? We'll need to be extra careful around this now that we're relaxing this pin.

CI will only catch our tests, which are a subset of how our library is used.

felipemello1 · 2024-08-14T16:04:35Z

Would you mind just also double checking our version guards for AO?

Not sure what you mean by it. Can you add an example or a link of what you would like me to do?

pbontrager · 2024-08-14T18:09:35Z

This change allows for stable builds of torchtune to break in the future. If there is a stable package for torchtune that works fine with torchao, and then torchao releases a new stable package with bc breaking changes, our existing stable packages would try to install the new torchao package and break. We need to keep torchao pinned and then use a tool like dependabot to keep the pinned version up to date.

For CI we should decide if we want to pin to the latest version of PyTorch and possible have separate tests for PyTorch nightlies with unpinned PyTorch libraries. @ebsmothers

still discussing

msaroufim · 2024-08-14T18:40:53Z

If there is a stable package for torchtune that works fine with torchao, and then torchao releases a new stable package with bc breaking changes, our existing stable packages

TLDR: just update to 0.4

When users pip install torchtune the ao version should be pinned so the official release packages are guaranteed to work. If then users choose to upgrade AO there is no guarantee things will work (same as PyTorch) but we'll try our best not to break things for no good reason

In tune CI you should always be testing all your latest stable dependencies and all your latest nightly dependencies. We should never be catching BC issues at release time but at nightly CI time, that way upgrading a stable release can be a safe activity. Personally I wouldn't wait more than a few days after an official AO release to make an upgrade

msaroufim · 2024-08-16T16:29:32Z

recipes/configs/llama2/7B_qat_full.yaml

@@ -65,7 +65,7 @@ enable_activation_checkpointing: True
 memory_efficient_fsdp_wrap: False

 # Reduced precision
-dtype: bf16
+dtype: fp32


cc @andrewor14

This doesn't look right. I will submit a PR to remove that assertion

Btw how come this wasn't caught in the tune nightly CI? @joecummings

@msaroufim we don't actually test with our "prod" configs, instead we define a set of test configs that we deem to be (pretty) representative of the configs we provide. Unfortunately to do loss parity checks we tend to set dtype=fp32 in the tests (see here for the QAT test), so as a result this one slipped by

msaroufim

version upgrade looks fine to me, probably want Andrew to also review the reduced precision change you have since not sure if that has perf implication

andrewor14 · 2024-08-16T17:29:13Z

@felipemello1 How urgent is upgrading torchao? I will submit a fix in torchao itself to remove that assertion, but I don't think we want to change the default precision in the QAT recipes. We have another release planned in early september, but if that's too late for you maybe we can do a 0.4.1 release with the fix?

This was added originally for perf reasons specific to 8da4w, but the autograd.Function has since been adapted for more general use. A few users are hitting this assertion error. More context: pytorch/torchtune#1333

andrewor14 · 2024-08-16T17:33:08Z

pytorch/ao#692

.

felipemello1 · 2024-08-16T18:15:19Z

We have another release planned in early september, but if that's too late for you maybe we can do a 0.4.1 release with the fix?

Thanks for the fix @andrewor14!

If making a release is not a huge effort, this would solve multiple problems: Our regression error, the import version, and the dtype. Itt would be convenient. However, i dont think that we have a huge number of users using QAT, so waiting for september wouldnt be terrible.

In summary, if making the release is easy, that would be very neat. But if its going to take you a considerable amount of time, we can wait 2 weeks.

This was added originally for perf reasons specific to 8da4w, but the autograd.Function has since been adapted for more general use. A few users are hitting this assertion error. More context: pytorch/torchtune#1333

ebsmothers · 2024-08-21T23:26:39Z

Is this on hold until next torchao release then? And if so are we gonna just bump to 0.5.0? If so let's make sure that our nightly CI is green before that release

felipemello1 · 2024-08-22T01:17:51Z

Is this on hold until next torchao release then?

thats my understanding

remove torchao version pin

5b684ce

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 14, 2024

felipemello1 requested review from ebsmothers and joecummings August 14, 2024 15:24

felipemello1 mentioned this pull request Aug 14, 2024

import do_bench fails with pytorch nightly pytorch/ao#676

Closed

msaroufim self-requested a review August 14, 2024 15:42

msaroufim previously approved these changes Aug 14, 2024

View reviewed changes

Felipe Mello added 3 commits August 16, 2024 07:18

update to 0.4.0

ac64d18

remove numpy pin

c3537e6

update version guards + dtpye in configs

04ccbf2

felipemello1 changed the title ~~remove torchao version pin~~ fix QAT version dependency Aug 16, 2024

felipemello1 requested a review from msaroufim August 16, 2024 15:04

msaroufim reviewed Aug 16, 2024

View reviewed changes

msaroufim self-requested a review August 16, 2024 17:21

msaroufim previously approved these changes Aug 16, 2024

View reviewed changes

andrewor14 mentioned this pull request Aug 16, 2024

Relax QAT dtype assertion pytorch/ao#692

Merged

msaroufim self-requested a review August 16, 2024 17:36

felipemello1 closed this Sep 3, 2024

fix QAT version dependency #1333

fix QAT version dependency #1333

Uh oh!

Conversation

felipemello1 commented Aug 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changelog

Test plan

Uh oh!

pytorch-bot bot commented Aug 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1333

✅ No Failures

Uh oh!

joecummings commented Aug 14, 2024

Uh oh!

msaroufim left a comment

Choose a reason for hiding this comment

Uh oh!

joecummings commented Aug 14, 2024

Uh oh!

felipemello1 commented Aug 14, 2024

Uh oh!

pbontrager commented Aug 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msaroufim commented Aug 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msaroufim Aug 16, 2024

Choose a reason for hiding this comment

Uh oh!

andrewor14 Aug 16, 2024

Choose a reason for hiding this comment

Uh oh!

msaroufim Aug 16, 2024

Choose a reason for hiding this comment

Uh oh!

ebsmothers Aug 21, 2024

Choose a reason for hiding this comment

Uh oh!

msaroufim left a comment

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Aug 16, 2024

Uh oh!

andrewor14 commented Aug 16, 2024

Uh oh!

felipemello1 commented Aug 16, 2024

Uh oh!

ebsmothers commented Aug 21, 2024

Uh oh!

felipemello1 commented Aug 22, 2024

Uh oh!

Uh oh!

felipemello1 commented Aug 14, 2024 •

edited

Loading

pytorch-bot bot commented Aug 14, 2024 •

edited

Loading

pbontrager commented Aug 14, 2024 •

edited

Loading

msaroufim commented Aug 14, 2024 •

edited

Loading