You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Attempting to address 1) reveals further that among the tests that do run, there are many decorated with a wrapper @pl_multi_process_test, which suppresses assertion errors and exceptions of broken tests.
The result is that we have a lot of tests that are broken but never surface in the CI.
To Reproduce
A simple way to reproduce this is by removing all decorators, which is what I have done in #11098, and then let the tests run and fail. Attached is the full log file of such a CI run: tpu-logs-without-pl-multi.txt
There is of course the infamous cryptic error message for several test cases Exception in device=TPU:2: Cannot replicate if number of devices (1) is different from 8
Which sometimes hints at the possibility that we are accessing the xm.xla_device before spawning processes.
Other examples:
self = <pytorch_lightning.trainer.connectors.accelerator_connector.AcceleratorConnector object at 0x7f7bbead03d0>
@property
def is_distributed(self) -> bool:
# TODO: deprecate this property
# Used for custom plugins.
# Custom plugins should implement is_distributed property.
if hasattr(self.strategy, "is_distributed") and not isinstance(self.accelerator, TPUAccelerator):
return self.strategy.is_distributed
distributed_strategy = (
DDP2Strategy,
DDPStrategy,
DDPSpawnShardedStrategy,
DDPShardedStrategy,
DDPFullyShardedNativeStrategy,
DDPFullyShardedStrategy,
DDPSpawnStrategy,
DeepSpeedStrategy,
TPUSpawnStrategy,
HorovodStrategy,
HPUParallelStrategy,
)
is_distributed = isinstance(self.strategy, distributed_strategy)
if isinstance(self.accelerator, TPUAccelerator):
> is_distributed |= self.strategy.is_distributed
E TypeError: unsupported operand type(s) for |=: 'bool' and 'NoneType'
def has_len_all_ranks(
dataloader: DataLoader,
training_type: "pl.Strategy",
model: Union["pl.LightningModule", "pl.LightningDataModule"],
) -> bool:
"""Checks if a given Dataloader has ``__len__`` method implemented i.e. if it is a finite dataloader or
infinite dataloader."""
try:
local_length = len(dataloader)
total_length = training_type.reduce(torch.tensor(local_length).to(model.device), reduce_op="sum")
> if total_length == 0:
E RuntimeError: Not found: From /job:tpu_worker/replica:0/task:0:
E 2 root error(s) found.
E (0) Not found: No subgraph found for uid 2894109085761937038
E [[{{node XRTExecute}}]]
E (1) Not found: No subgraph found for uid 2894109085761937038
E [[{{node XRTExecute}}]]
E [[XRTExecute_G29]]
E 0 successful operations.
E 0 derived errors ignored.
Furthermore, sometimes, non-deterministically, the CI just stops in the middle of execution:
....
profilers/test_xla_profiler.py::test_xla_profiler_instance FAILED [ 93%]
strategies/test_tpu_spawn.py::test_model_tpu_one_core PASSED [ 95%]
Done with log retrieval attempt.
Exited with code exit status 2
CircleCI received exit code 2
Expected behavior
It is unclear what the intention was when designing the test setup. The decorators were introduced way back in #2512 and have never much changed. Meanwhile, strategy and accelerators have undergone major design changes and countless refactors. I propose to re-evaluate whether the pl_multi_process_test decorator is still needed, and if so, document why, how to use it and when to use it correctly.
Possible Action
My suggestion is to
Remove the decorator
Debug each test on the VM
Run tests that require it in standalone mode
Reduce the verbosity of the mind boggling thousands of nonsense lines printed in the CI
Uh oh!
There was an error while loading. Please reload this page.
🐛 Bug
Recent observations have made it clear that there are many problems with either the TPU implementation in Lightning or the test environment:
@pl_multi_process_test
, which suppresses assertion errors and exceptions of broken tests.The result is that we have a lot of tests that are broken but never surface in the CI.
To Reproduce
A simple way to reproduce this is by removing all decorators, which is what I have done in #11098, and then let the tests run and fail. Attached is the full log file of such a CI run: tpu-logs-without-pl-multi.txt
In summary:
17 failed, 48 passed
There is of course the infamous cryptic error message for several test cases
Exception in device=TPU:2: Cannot replicate if number of devices (1) is different from 8
Which sometimes hints at the possibility that we are accessing the
xm.xla_device
before spawning processes.Other examples:
Furthermore, sometimes, non-deterministically, the CI just stops in the middle of execution:
Expected behavior
It is unclear what the intention was when designing the test setup. The decorators were introduced way back in #2512 and have never much changed. Meanwhile, strategy and accelerators have undergone major design changes and countless refactors. I propose to re-evaluate whether the
pl_multi_process_test
decorator is still needed, and if so, document why, how to use it and when to use it correctly.Possible Action
My suggestion is to
cc @carmocca @akihironitta @Borda @kaushikb11 @rohitgr7
The text was updated successfully, but these errors were encountered: