[DONT MERGE] PR to debug CI failures on windows #6195

YosuaMichael · 2022-06-23T09:38:35Z

This PR is just meant to check on CI failures described on this issue: #6189

First we try to skip the big models in CPU as well to make sure the problem is not caused by big model [confirmed GREEN for windows test]
Second, we will test the big models with torch nightly 20220621 -> Still got issue
@vfdev-5 try on torch nightly 20220618 and it is green (confirmed)
using inference_mode and only skip jit, fx, and backprop doesn't work

In summary, seems like there is a change on core since the nightly 20220618 works fine, but the later version is not. The main suspect of the problem is memory usage, and need double check if this is indeed the case with memory profiler. If yes, we should raise to core about this memory issue.

vfdev-5 · 2022-06-23T14:25:15Z

I was debugging with ssh the failing job and confirm that recent pytorch nightlies are failing even if executed just a single test. However, if I install Juin 18 version single test seems passing:

(C:\Users\circleci\project\env) C:\Users\circleci\project>pytest -vvv test/test_models.py::test_classification_model[cpu-regnet_y_128gf]
pytest -vvv test/test_models.py::test_classification_model[cpu-regnet_y_128gf]
============================= test session starts =============================
platform win32 -- Python 3.10.4, pytest-7.1.2, pluggy-1.0.0 -- C:\Users\circleci\project\env\python.exe
cachedir: .pytest_cache
rootdir: C:\Users\circleci\project, configfile: pytest.ini
plugins: cov-3.0.0, mock-3.7.0
collecting ... collected 1 item

test/test_models.py::test_classification_model[cpu-regnet_y_128gf] PASSED [100%]

============================= 1 passed in 30.42s ==============================

(C:\Users\circleci\project\env) C:\Users\circleci\project>pip list | grep torch
pip list | grep torch
torch              1.13.0.dev20220618+cpu
torchvision        0.14.0a0+1eae59a       c:\users\circleci\project

Testing it with latest commits

YosuaMichael · 2022-06-23T14:30:58Z

I was debugging with ssh the failing job and confirm that recent pytorch nightlies are failing even if executed just a single test. However, if I install Juin 18 version single test seems passing:

(C:\Users\circleci\project\env) C:\Users\circleci\project>pytest -vvv test/test_models.py::test_classification_model[cpu-regnet_y_128gf]
pytest -vvv test/test_models.py::test_classification_model[cpu-regnet_y_128gf]
============================= test session starts =============================
platform win32 -- Python 3.10.4, pytest-7.1.2, pluggy-1.0.0 -- C:\Users\circleci\project\env\python.exe
cachedir: .pytest_cache
rootdir: C:\Users\circleci\project, configfile: pytest.ini
plugins: cov-3.0.0, mock-3.7.0
collecting ... collected 1 item

test/test_models.py::test_classification_model[cpu-regnet_y_128gf] PASSED [100%]

============================= 1 passed in 30.42s ==============================

(C:\Users\circleci\project\env) C:\Users\circleci\project>pip list | grep torch
pip list | grep torch
torch              1.13.0.dev20220618+cpu
torchvision        0.14.0a0+1eae59a       c:\users\circleci\project

Testing it with latest commits

cc @datumbox

From this finding, there might be some changes in core that increase memory usage?

…aMichael/vision into debug/check-ci-failure-on-big-model

…ing backprop works

datumbox · 2022-06-24T13:29:14Z

@YosuaMichael Yes seems this way. We should raise to Core and see if they are aware something might have raised the memory requirements on Windows. This is going to be quite a difficult debugging. It's worth creating an issue where you document the findings and summarize, providing references. This will help people investigate.

Skip big model in both cpu and cuda

a0569b1

facebook-github-bot added the cla signed label Jun 23, 2022

Test using nightly 20220621

508951c

YosuaMichael mentioned this pull request Jun 23, 2022

CI fails on windows: ci/circleci: unittest_windows_cpu_pyX.Y #6189

Closed

YosuaMichael and others added 4 commits June 23, 2022 13:30

Try modify so windows also use older nightly

1eae59a

Update config.yml

a84e47f

Update config.yml.in

d4c6fba

Update install.sh

e76071e

YosuaMichael added 5 commits June 23, 2022 15:59

Skip only backprop

de3769c

Merge branch 'debug/check-ci-failure-on-big-model' of github.com:Yosu…

7eb021f

…aMichael/vision into debug/check-ci-failure-on-big-model

Change back the windows pytorch version to 20220621 to check if skipp…

5ec37f8

…ing backprop works

Skip jit, fx, and backprop test for big model

ce85914

Test the core 18 June again

a0bae4d

YosuaMichael self-assigned this Jun 24, 2022

YosuaMichael closed this Jun 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DONT MERGE] PR to debug CI failures on windows #6195

[DONT MERGE] PR to debug CI failures on windows #6195

Uh oh!

YosuaMichael commented Jun 23, 2022 •

edited

Loading

Uh oh!

vfdev-5 commented Jun 23, 2022 •

edited

Loading

Uh oh!

YosuaMichael commented Jun 23, 2022

Uh oh!

datumbox commented Jun 24, 2022 •

edited

Loading

Uh oh!

Uh oh!

[DONT MERGE] PR to debug CI failures on windows #6195

[DONT MERGE] PR to debug CI failures on windows #6195

Uh oh!

Conversation

YosuaMichael commented Jun 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vfdev-5 commented Jun 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YosuaMichael commented Jun 23, 2022

Uh oh!

datumbox commented Jun 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

YosuaMichael commented Jun 23, 2022 •

edited

Loading

vfdev-5 commented Jun 23, 2022 •

edited

Loading

datumbox commented Jun 24, 2022 •

edited

Loading