You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have observed a SIG11 quite early in #9078 with GPU (T4). We suspect that it's specific to the CI, as the test works fine with A100 for 2.7: us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.7.0_3.10_cuda_12.6. It is unclear on whether the issue was surfaced in this PR, but it does succeed with A100 GPU, TPU, CPU and TRN.
+ run_test /__w/xla/xla/pytorch/xla/test/spmd/test_train_spmd_linear_model.py --skip-gradient-checkpointing
+ echo 'Running in PjRt runtime: /__w/xla/xla/pytorch/xla/test/spmd/test_train_spmd_linear_model.py' --skip-gradient-checkpointing
Running in PjRt runtime: /__w/xla/xla/pytorch/xla/test/spmd/test_train_spmd_linear_model.py --skip-gradient-checkpointing
++ command -v nvidia-smi
+ '[' -x /usr/bin/nvidia-smi ']'
+ '[' '' '!=' 0 ']'
+ PJRT_DEVICE=CUDA
+ run_coverage /__w/xla/xla/pytorch/xla/test/spmd/test_train_spmd_linear_model.py --skip-gradient-checkpointing
+ '[' 0 '!=' 0 ']'
+ python3 /__w/xla/xla/pytorch/xla/test/spmd/test_train_spmd_linear_model.py --skip-gradient-checkpointing
/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py:236: UserWarning: XLA_USE_SPMD is being deprecated. Use torch_xla.runtime.use_spmd() without setting XLA_USE_SPMD env-var.
warnings.warn("XLA_USE_SPMD is being deprecated. "
./usr/local/lib/python3.10/site-packages/torch_xla/runtime.py:242: UserWarning: Replicating tensors already initialized on non-virtual XLA device for SPMD to force SPMD mode. This is one-time overhead to setup, and to minimize such, please set SPMD mode before initializting tensors (i.e., call use_spmd() in the beginning of the program).
warnings.warn(
*** Received signal 11 ***
*** BEGIN MANGLED STACK TRACE ***
/usr/local/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so(+0x6ff5636)[0x7fdd5f087636]
/lib/x86_64-linux-gnu/libc.so.6(+0x38de0)[0x7fdf961c5de0]
/usr/local/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so(_ZN9torch_xla7runtime21PjRtComputationClient15PjRtShardedData9GetHandleEv+0x7)[0x7fdd5f06b897]
/usr/local/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so(+0x6a0d08c)[0x7fdd5ea9f08c]
...
/usr/local/bin/../lib/libpython3.10.so.1.0(Py_BytesMain+0x39)[0x7fdf964f53a9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)[0x7fdf961b0d7a]
python3(_start+0x2a)[0x5603066b607a]
*** END MANGLED STACK TRACE ***
*** Begin stack trace ***
tsl::CurrentStackTrace[abi:cxx11]()
torch_xla::runtime::PjRtComputationClient::PjRtShardedData::GetHandle()
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
🐛 Bug
We have observed a SIG11 quite early in #9078 with GPU (T4). We suspect that it's specific to the CI, as the test works fine with A100 for 2.7:
us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.7.0_3.10_cuda_12.6
. It is unclear on whether the issue was surfaced in this PR, but it does succeed with A100 GPU, TPU, CPU and TRN.The text was updated successfully, but these errors were encountered: