Skip to content

SPMD Linear Model test failing with GA API refinement #9128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rpsilva-aws opened this issue May 9, 2025 · 1 comment
Closed

SPMD Linear Model test failing with GA API refinement #9128

rpsilva-aws opened this issue May 9, 2025 · 1 comment
Labels
bug Something isn't working CI CI related change xla:gpu

Comments

@rpsilva-aws
Copy link
Collaborator

rpsilva-aws commented May 9, 2025

🐛 Bug

We have observed a SIG11 quite early in #9078 with GPU (T4). We suspect that it's specific to the CI, as the test works fine with A100 for 2.7: us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.7.0_3.10_cuda_12.6. It is unclear on whether the issue was surfaced in this PR, but it does succeed with A100 GPU, TPU, CPU and TRN.

+ run_test /__w/xla/xla/pytorch/xla/test/spmd/test_train_spmd_linear_model.py --skip-gradient-checkpointing
+ echo 'Running in PjRt runtime: /__w/xla/xla/pytorch/xla/test/spmd/test_train_spmd_linear_model.py' --skip-gradient-checkpointing
Running in PjRt runtime: /__w/xla/xla/pytorch/xla/test/spmd/test_train_spmd_linear_model.py --skip-gradient-checkpointing
++ command -v nvidia-smi
+ '[' -x /usr/bin/nvidia-smi ']'
+ '[' '' '!=' 0 ']'
+ PJRT_DEVICE=CUDA
+ run_coverage /__w/xla/xla/pytorch/xla/test/spmd/test_train_spmd_linear_model.py --skip-gradient-checkpointing
+ '[' 0 '!=' 0 ']'
+ python3 /__w/xla/xla/pytorch/xla/test/spmd/test_train_spmd_linear_model.py --skip-gradient-checkpointing
/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py:236: UserWarning: XLA_USE_SPMD is being deprecated. Use torch_xla.runtime.use_spmd() without setting XLA_USE_SPMD env-var.
  warnings.warn("XLA_USE_SPMD is being deprecated. "
./usr/local/lib/python3.10/site-packages/torch_xla/runtime.py:242: UserWarning: Replicating tensors already initialized on non-virtual XLA device for SPMD to force SPMD mode. This is one-time overhead to setup, and to minimize such, please set SPMD mode before initializting tensors (i.e., call use_spmd() in the beginning of the program).
  warnings.warn(
*** Received signal 11 ***
*** BEGIN MANGLED STACK TRACE ***
/usr/local/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so(+0x6ff5636)[0x7fdd5f087636]
/lib/x86_64-linux-gnu/libc.so.6(+0x38de0)[0x7fdf961c5de0]
/usr/local/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so(_ZN9torch_xla7runtime21PjRtComputationClient15PjRtShardedData9GetHandleEv+0x7)[0x7fdd5f06b897]
/usr/local/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so(+0x6a0d08c)[0x7fdd5ea9f08c]
...
/usr/local/bin/../lib/libpython3.10.so.1.0(Py_BytesMain+0x39)[0x7fdf964f53a9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)[0x7fdf961b0d7a]
python3(_start+0x2a)[0x5603066b607a]
*** END MANGLED STACK TRACE ***

*** Begin stack trace ***
	tsl::CurrentStackTrace[abi:cxx11]()
	
	
	torch_xla::runtime::PjRtComputationClient::PjRtShardedData::GetHandle()
@ysiraichi ysiraichi added bug Something isn't working CI CI related change xla:gpu labels May 12, 2025
@rpsilva-aws
Copy link
Collaborator Author

It ended up being a race condition that requires us to add a synchronous wait operation prior to retrieving all device data nodes in the graph.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CI CI related change xla:gpu
Projects
None yet
Development

No branches or pull requests

2 participants