Skip to content

XLA profiler cannot capture TPU device trace when running on a pod #3446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ronghanghu opened this issue Mar 24, 2022 · 4 comments
Closed

XLA profiler cannot capture TPU device trace when running on a pod #3446

ronghanghu opened this issue Mar 24, 2022 · 4 comments
Assignees

Comments

@ronghanghu
Copy link
Collaborator

ronghanghu commented Mar 24, 2022

🐛 Bug / 🚀 Feature request

Currently, the XLA profiler can capture the TPU device trace when running on a v3-8 TPU VM, but cannot capture the device trace in trace_viewer when running on a TPU pod (e.g. a v3-128 pod). This is unlike the TensorFlow profiler which is able to capture TPU device traces when running on a pod.

As we have frequently observed a bigger overhead from pod training (compared to training on a v3-8 under the same per-TPU batch size, e.g. #3441), it would be great if the TPU device trace can also be captured in the pod case to help understand the performance bottlenecks.

(Not sure whether this should be a bug report or a feature request. Since many practical TPU use cases involve training in pods, it would be great if the XLA profiler could also work for the pod case.)

To Reproduce

While the XLA profiler can capture CPU host traces when doing TPU pod training (e.g. v3-128), it cannot capture the TPU device trace. In particular, no device trace shows up on the trace_viewer page in the profiler, as shown in the screenshot below (only CPU traces are captured).

v3_128

The tensorboard output (including the profiler results) for this case is also uploaded to https://drive.google.com/file/d/108GRRqndJJyEQEhmICx1u4aaiF1vkQ_F/view?usp=sharing.


Note that on the other hand, the profiler successfully captures TPU device traces on a v3-8 (as in the screenshot below).

v3_8


To reproduce the failure case above on v3-128

  1. Allocate a v3-128 TPU VM pod (e.g. with name tpu-debug-128) from the tpu-vm-pt-1.10 environment
  2. Clone the PyTorch XLA repo the TPU VM to download the profiler script (do this on all the nodes in the TPU VM, e.g. throughgcloud alpha compute tpus tpu-vm ssh --worker all):
mkdir -p /home/ronghanghu/workspace && git clone https://github.com/pytorch/xla /home/ronghanghu/workspace/xla
  1. Install the tensorboard profiler plugin on the TPU VM node 0 and start a tensorboard session:
sudo pip3 install -U tensorboard-plugin-profile
mkdir -p /home/ronghanghu/workspace/tpu_pod_xla_profiler_debug/v3-128
tensorboard --logdir /home/ronghanghu/workspace/tpu_pod_xla_profiler_debug/v3-128
  1. Start training on the TPU pod with the profiler server:
TPU_NAME=tpu-debug-128  # change to your TPU name

cd ${HOME} && python3 -m torch_xla.distributed.xla_dist --tpu=${TPU_NAME} --restart-tpuvm-pod-server -- \
python3 -u /home/ronghanghu/workspace/xla/test/test_profile_mp_mnist.py \
  --batch_size 16 --drop_last --num_epochs 200000 --lr 0.0
  1. Forward the tensorboard port 6006 to a local machine and try to capture the profile from localhost:9012 on the profile page on tensorboard. Then check its trace_viewer tool.

Expected behavior

It would be great if TPU device traces can also be captured in pod training.

Environment

  • Reproducible on XLA backend [CPU/TPU]: v3-128 TPU pod with tpu-vm-pt-1.10 runtime
  • torch_xla version: 1.10

Additional context

This issue (that the profiler cannot capture the TPU device on a pod) can be reproduced on all torch_xla 1.9, torch_xla 1.10, and the nightly 20220308 versions.

@miladm miladm self-assigned this Mar 24, 2022
@miladm
Copy link
Collaborator

miladm commented Mar 24, 2022

Thanks @ronghanghu. I will do a repro and circle back. Looks like you tried TPU VM here. Did you also try to run the profiler on a 2VM setup?

@ronghanghu
Copy link
Collaborator Author

Thanks @miladm! I've only tried running on the XLA profiler for the pod case on TPU VMs (since it's my practical use case and TPU VMs are generally faster than TPU nodes), and haven't tried the older way of using a compute engine VM + TPU nodes.

I think this issue (that the profiler cannot capture the TPU device on a pod) can be reproduced on both a v3-32 pod (with 4 TPU VM nodes) and a v3-128 pod (with 16 VM nodes) following the steps above.

@ronghanghu
Copy link
Collaborator Author

Following up on this issue: in our internal test on TPU v4, the XLA profiler worked well on v4-8 but failed to capture TPU device traces on v4-32 or v4-128.

@miladm
Copy link
Collaborator

miladm commented Jan 26, 2025

obsolete

@miladm miladm closed this as completed Jan 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants