Skip to content

Idea: host-OS-independence with alternate syscall interface #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
adamierymenko opened this issue May 2, 2018 · 7 comments
Closed

Comments

@adamierymenko
Copy link

adamierymenko commented May 2, 2018

I have an idea for this that builds on an idea I've had for a long time but have never had (and probably will not have) the time to implement.

The idea is to pseudo-virtualize Linux binaries by rebuilding them to call a function in user-space instead of executing CPU-level syscalls. This would require apps to be rebuilt from source, making it effectively a different "hardware architecture" (e.g. x86_64-uvirt or something) from standard Linux, but would otherwise be fully compatible. This would eliminate the ptrace requirement to make it portable and also might be faster.

It seems like it would be borderline trivial to do this with gVisor.

This would open up the possibility of embedding Linux hosts inside anything including desktop and mobile apps or server processes for other OSes like Windows. You could port, desktop-ify, or even mobile-ify any complex Linux application.

@littleq0903
Copy link

it would be like what WSL does in Windows.

@hugelgupf
Copy link
Collaborator

If you ported gVisor to other OS', i.e. came up with a kvm-like platforms using an OS X / Windows / whatever hypervisor, and moved some Linux-specific bits in the sentry around, you wouldn't even need to modify the binaries you want to run.

You could have to rebuild if you want to skip the hypervisor/syscall-interception-platform and move only the Linux-specific bits around. But that's kind of a totally different project than what gVisor is targeted at.

@bluecmd
Copy link

bluecmd commented May 5, 2018

Isn't the point of gVisor to secure untrusted workloads though? If so, that is stopping the unknown software from embedding syscalls and not play nice?

@adamierymenko
Copy link
Author

@hugelgupf Yes that could work, but the thought I had was to make it possible to run Linux environments anywhere with no elevated permissions.

shentubot pushed a commit that referenced this issue Jul 3, 2018
glibc's malloc also uses SYS_TIME. Permit it.

#0  0x0000000000de6267 in time ()
#1  0x0000000000db19d8 in get_nprocs ()
#2  0x0000000000d8a31a in arena_get2.part ()
#3  0x0000000000d8ab4a in malloc ()
#4  0x0000000000d3c6b5 in __sanitizer::InternalAlloc(unsigned long, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 140737488355328ull, 0ul, __sanitizer::SizeClassMap<3ul, 4ul, 8ul, 17ul, 64ul, 14ul>, 20ul, __sanitizer::TwoLevelByteMap<32768ull, 4096ull, __sanitizer::NoOpMapUnmapCallback>, __sanitizer::NoOpMapUnmapCallback> >*, unsigned long) ()
#5  0x0000000000d4cd70 in __tsan_go_start ()
#6  0x00000000004617a3 in racecall ()
#7  0x00000000010f4ea0 in runtime.findfunctab ()
#8  0x000000000043f193 in runtime.racegostart ()

Signed-off-by: Dmitry Vyukov <[email protected]>
[[email protected]: updated comments and commit message]
Signed-off-by: Michael Pratt <[email protected]>

Change-Id: Ibe2d0dc3035bf5052d5fb802cfaa37c5e0e7a09a
PiperOrigin-RevId: 203042627
dvyukov added a commit to dvyukov/gvisor that referenced this issue Jul 4, 2018
glibc's malloc also uses SYS_TIME. Permit it.

#0  0x0000000000de6267 in time ()
google#1  0x0000000000db19d8 in get_nprocs ()
google#2  0x0000000000d8a31a in arena_get2.part ()
google#3  0x0000000000d8ab4a in malloc ()
google#4  0x0000000000d3c6b5 in __sanitizer::InternalAlloc(unsigned long, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 140737488355328ull, 0ul, __sanitizer::SizeClassMap<3ul, 4ul, 8ul, 17ul, 64ul, 14ul>, 20ul, __sanitizer::TwoLevelByteMap<32768ull, 4096ull, __sanitizer::NoOpMapUnmapCallback>, __sanitizer::NoOpMapUnmapCallback> >*, unsigned long) ()
google#5  0x0000000000d4cd70 in __tsan_go_start ()
google#6  0x00000000004617a3 in racecall ()
google#7  0x00000000010f4ea0 in runtime.findfunctab ()
google#8  0x000000000043f193 in runtime.racegostart ()

Signed-off-by: Dmitry Vyukov <[email protected]>
[[email protected]: updated comments and commit message]
Signed-off-by: Michael Pratt <[email protected]>

Change-Id: Ibe2d0dc3035bf5052d5fb802cfaa37c5e0e7a09a
PiperOrigin-RevId: 203042627
@fvoznika
Copy link
Member

Closing old bugs. Please reopen if you'd like to continue the discussion.

tonistiigi pushed a commit to tonistiigi/gvisor that referenced this issue Jan 30, 2019
glibc's malloc also uses SYS_TIME. Permit it.

#0  0x0000000000de6267 in time ()
#1  0x0000000000db19d8 in get_nprocs ()
#2  0x0000000000d8a31a in arena_get2.part ()
#3  0x0000000000d8ab4a in malloc ()
google#4  0x0000000000d3c6b5 in __sanitizer::InternalAlloc(unsigned long, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 140737488355328ull, 0ul, __sanitizer::SizeClassMap<3ul, 4ul, 8ul, 17ul, 64ul, 14ul>, 20ul, __sanitizer::TwoLevelByteMap<32768ull, 4096ull, __sanitizer::NoOpMapUnmapCallback>, __sanitizer::NoOpMapUnmapCallback> >*, unsigned long) ()
google#5  0x0000000000d4cd70 in __tsan_go_start ()
google#6  0x00000000004617a3 in racecall ()
google#7  0x00000000010f4ea0 in runtime.findfunctab ()
google#8  0x000000000043f193 in runtime.racegostart ()

Signed-off-by: Dmitry Vyukov <[email protected]>
[[email protected]: updated comments and commit message]
Signed-off-by: Michael Pratt <[email protected]>

Change-Id: Ibe2d0dc3035bf5052d5fb802cfaa37c5e0e7a09a
PiperOrigin-RevId: 203042627
Upstream-commit: 6144751
@archerbroler
Copy link

archerbroler commented Feb 26, 2019

bluepill can be done on win too. perhaps its not such hard to port gvisor to win.

@amscanne
Copy link
Contributor

I'm not familiar with the Windows hypervisor APIs, is there good reference documentation?

tanjianfeng pushed a commit to tanjianfeng/gvisor that referenced this issue Aug 2, 2019
Below command under hostinet network will lead to panic:

  $ cat /proc/net/tcp

It's caused by the wrong SizeOfTCPInfo.

  #0 runtime.panicindex()
  google#1 encoding/binary.littleEndian.Uint64
  google#2 encoding/binary.(*littleEndian).Uint64
  google#3 gvisor.dev/gvisor/pkg/binary.unmarshal
  google#4 gvisor.dev/gvisor/pkg/binary.unmarshal
  google#5 gvisor.dev/gvisor/pkg/binary.Unmarshal
  google#6 gvisor.dev/gvisor/pkg/sentry/socket/hostinet.(*socketOperations).State
  google#7 gvisor.dev/gvisor/pkg/sentry/fs/proc.(*netTCP).ReadSeqFileData

Correct SizeOfTCPInfo from 104 to 192 to fix it.

Fixes google#640

Signed-off-by: Jianfeng Tan <[email protected]>
amscanne referenced this issue in amscanne/gvisor May 6, 2020
copybara-service bot pushed a commit that referenced this issue Jul 3, 2024
Distributed training isn't working with PyTorch on certain A100 nodes.

Adds the missing ioctl `UVM_UNMAP_EXTERNAL` allowing for certain NCCL operations to succeed when using [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html), fixing distributed training.

## Reproduction

This affects numerous A100 40GB and 80GB instances in our fleet. This reproduction requires 4 A100 GPUs, either 40GB or 80GB.

- **NVIDIA Driver Version**: 550.54.15
- **CUDA Version**: 12.4
- **NVIDIA device**: NVIDIA A100 80GB PCIe

### Steps

1. **Install gvisor**
```bash
URL="https://storage.googleapis.com/gvisor/releases/master/latest/${ARCH}"
wget -nc "${URL}/runsc" "${URL}/runsc.sha512"
chmod +x runsc
sudo cp runsc /usr/local/bin/runsc
sudo /usr/local/bin/runsc install
sudo systemctl reload docker
```

2. **Add GPU enabling gvisor options**

```json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc": {
            "path": "/usr/local/bin/runsc",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"]

        }
    }
}
```
Reload configs with `sudo systemctl reload docker`.

3. **Run reproduction NCCL test**

This test creates one main process and N peer processes. Each peer process sends a torch `Tensor` to the main process using NCCL.

```Dockerfile
# Dockerfile
FROM python:3.9.15-slim-bullseye

RUN pip install torch numpy
COPY <<EOF repro.py
import argparse
import datetime
import os

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    dist.init_process_group("nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(seconds=600))
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def send_tensor(rank, world_size):
    try:
        setup(rank, world_size)

        # rank receiving all tensors
        target_rank = world_size - 1

        dist.barrier()

        tensor = torch.ones(5).cuda(rank)
        if rank < target_rank:
            print(f"[RANK {rank}] sending tensor: {tensor}")
            dist.send(tensor=tensor, dst=target_rank)
        elif rank == target_rank:
            for other_rank in range(target_rank):
                tensor = torch.zeros(5).cuda(target_rank)
                dist.recv(tensor=tensor, src=other_rank)
                print(f"[RANK {target_rank}] received tensor from rank={other_rank}: {tensor}")

            print("PASS: NCCL working.")

    except Exception as e:
        print(f"[RANK {rank}] error in send_tensor: {e}")
        raise
    finally:
        cleanup()

def main(world_size: int = 2):
    mp.spawn(send_tensor, args=(world_size,), nprocs=world_size, join=True)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run torch-based NCCL tests")
    parser.add_argument("world_size", type=int, help="number of GPUs to run test on")
    args = parser.parse_args()

    if args.world_size < 2:
        raise RuntimeError(f"world_size needs to be larger than 1 {args.world_size}")

    main(args.world_size)
EOF

ENTRYPOINT ["python", "repro.py", "4"]
```
Build image with:

```
docker build -f Dockerfile .
```

Then run it with:
```
sudo docker run -it --shm-size=2.00gb --runtime=runsc --gpus='"device=GPU-742ea7fc-dd4f-612c-e860-499bf200a815,GPU-94a801d8-7713-acf6-337d-338b7cfdf19e,GPU-0d19cef2-10ce-e445-a0be-3d330e36c1fd,GPU-ac5046fb-020c-93e8-2784-f44aedbc5bbd"' 040a44863fb1
```

#### Failure (truncated)
```
...
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7edda14cf897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5b3a23e (0x7edd8d73a23e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7edd8d734c87 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7edd8d734f82 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7edd8d735fd1 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7edd54da9189 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7edd54db0610 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7edd54dcf978 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5adc309 (0x7edd8d6dc309 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x5ae6f10 (0x7edd8d6e6f10 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5ae6fa5 (0x7edd8d6e6fa5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x5124446 (0x7edd8cd24446 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x1acf4b8 (0x7edd896cf4b8 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x5aee004 (0x7edd8d6ee004 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x5af36b5 (0x7edd8d6f36b5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0xd2fe8e (0x7edda032fe8e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x47f074 (0x7edd9fa7f074 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #35: <unknown function> + 0x29d90 (0x7edda2029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #36: __libc_start_main + 0x80 (0x7edda2029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #37: <unknown function> + 0x108e (0x55f950b0c08e in /usr/local/bin/python)
. This may indicate a possible application crash on rank 0 or a network set up issue.
...
```

### Fix
gvisor debug logs show:

```
W0702 20:36:17.577055  445833 uvm.go:148] [  22:  84] nvproxy: unknown uvm ioctl 66 = 0x42
```
I've implemented that ioctl in this PR. This is the output after the fix.

```
[RANK 2] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:2')
[RANK 0] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:0')
[RANK 1] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:1')
[RANK 3] received tensor from rank=0: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=1: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=2: tensor([1., 1., 1., 1., 1.], device='cuda:3')
PASS: NCCL working.
```
FUTURE_COPYBARA_INTEGRATE_REVIEW=#10610 from luiscape:master ee88734
PiperOrigin-RevId: 649146570
copybara-service bot pushed a commit that referenced this issue Jul 3, 2024
Distributed training isn't working with PyTorch on certain A100 nodes.

Adds the missing ioctl `UVM_UNMAP_EXTERNAL` allowing for certain NCCL operations to succeed when using [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html), fixing distributed training.

## Reproduction

This affects numerous A100 40GB and 80GB instances in our fleet. This reproduction requires 4 A100 GPUs, either 40GB or 80GB.

- **NVIDIA Driver Version**: 550.54.15
- **CUDA Version**: 12.4
- **NVIDIA device**: NVIDIA A100 80GB PCIe

### Steps

1. **Install gvisor**
```bash
URL="https://storage.googleapis.com/gvisor/releases/master/latest/${ARCH}"
wget -nc "${URL}/runsc" "${URL}/runsc.sha512"
chmod +x runsc
sudo cp runsc /usr/local/bin/runsc
sudo /usr/local/bin/runsc install
sudo systemctl reload docker
```

2. **Add GPU enabling gvisor options**

```json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc": {
            "path": "/usr/local/bin/runsc",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"]

        }
    }
}
```
Reload configs with `sudo systemctl reload docker`.

3. **Run reproduction NCCL test**

This test creates one main process and N peer processes. Each peer process sends a torch `Tensor` to the main process using NCCL.

```Dockerfile
# Dockerfile
FROM python:3.9.15-slim-bullseye

RUN pip install torch numpy
COPY <<EOF repro.py
import argparse
import datetime
import os

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    dist.init_process_group("nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(seconds=600))
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def send_tensor(rank, world_size):
    try:
        setup(rank, world_size)

        # rank receiving all tensors
        target_rank = world_size - 1

        dist.barrier()

        tensor = torch.ones(5).cuda(rank)
        if rank < target_rank:
            print(f"[RANK {rank}] sending tensor: {tensor}")
            dist.send(tensor=tensor, dst=target_rank)
        elif rank == target_rank:
            for other_rank in range(target_rank):
                tensor = torch.zeros(5).cuda(target_rank)
                dist.recv(tensor=tensor, src=other_rank)
                print(f"[RANK {target_rank}] received tensor from rank={other_rank}: {tensor}")

            print("PASS: NCCL working.")

    except Exception as e:
        print(f"[RANK {rank}] error in send_tensor: {e}")
        raise
    finally:
        cleanup()

def main(world_size: int = 2):
    mp.spawn(send_tensor, args=(world_size,), nprocs=world_size, join=True)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run torch-based NCCL tests")
    parser.add_argument("world_size", type=int, help="number of GPUs to run test on")
    args = parser.parse_args()

    if args.world_size < 2:
        raise RuntimeError(f"world_size needs to be larger than 1 {args.world_size}")

    main(args.world_size)
EOF

ENTRYPOINT ["python", "repro.py", "4"]
```
Build image with:

```
docker build -f Dockerfile .
```

Then run it with:
```
sudo docker run -it --shm-size=2.00gb --runtime=runsc --gpus='"device=GPU-742ea7fc-dd4f-612c-e860-499bf200a815,GPU-94a801d8-7713-acf6-337d-338b7cfdf19e,GPU-0d19cef2-10ce-e445-a0be-3d330e36c1fd,GPU-ac5046fb-020c-93e8-2784-f44aedbc5bbd"' 040a44863fb1
```

#### Failure (truncated)
```
...
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7edda14cf897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5b3a23e (0x7edd8d73a23e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7edd8d734c87 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7edd8d734f82 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7edd8d735fd1 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7edd54da9189 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7edd54db0610 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7edd54dcf978 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5adc309 (0x7edd8d6dc309 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x5ae6f10 (0x7edd8d6e6f10 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5ae6fa5 (0x7edd8d6e6fa5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x5124446 (0x7edd8cd24446 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x1acf4b8 (0x7edd896cf4b8 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x5aee004 (0x7edd8d6ee004 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x5af36b5 (0x7edd8d6f36b5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0xd2fe8e (0x7edda032fe8e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x47f074 (0x7edd9fa7f074 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #35: <unknown function> + 0x29d90 (0x7edda2029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #36: __libc_start_main + 0x80 (0x7edda2029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #37: <unknown function> + 0x108e (0x55f950b0c08e in /usr/local/bin/python)
. This may indicate a possible application crash on rank 0 or a network set up issue.
...
```

### Fix
gvisor debug logs show:

```
W0702 20:36:17.577055  445833 uvm.go:148] [  22:  84] nvproxy: unknown uvm ioctl 66 = 0x42
```
I've implemented that ioctl in this PR. This is the output after the fix.

```
[RANK 2] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:2')
[RANK 0] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:0')
[RANK 1] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:1')
[RANK 3] received tensor from rank=0: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=1: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=2: tensor([1., 1., 1., 1., 1.], device='cuda:3')
PASS: NCCL working.
```
FUTURE_COPYBARA_INTEGRATE_REVIEW=#10610 from luiscape:master ee88734
PiperOrigin-RevId: 649146570
copybara-service bot pushed a commit that referenced this issue Jul 3, 2024
Distributed training isn't working with PyTorch on certain A100 nodes.

Adds the missing ioctl `UVM_UNMAP_EXTERNAL` allowing for certain NCCL operations to succeed when using [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html), fixing distributed training.

## Reproduction

This affects numerous A100 40GB and 80GB instances in our fleet. This reproduction requires 4 A100 GPUs, either 40GB or 80GB.

- **NVIDIA Driver Version**: 550.54.15
- **CUDA Version**: 12.4
- **NVIDIA device**: NVIDIA A100 80GB PCIe

### Steps

1. **Install gvisor**
```bash
URL="https://storage.googleapis.com/gvisor/releases/master/latest/${ARCH}"
wget -nc "${URL}/runsc" "${URL}/runsc.sha512"
chmod +x runsc
sudo cp runsc /usr/local/bin/runsc
sudo /usr/local/bin/runsc install
sudo systemctl reload docker
```

2. **Add GPU enabling gvisor options**

```json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc": {
            "path": "/usr/local/bin/runsc",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"]

        }
    }
}
```
Reload configs with `sudo systemctl reload docker`.

3. **Run reproduction NCCL test**

This test creates one main process and N peer processes. Each peer process sends a torch `Tensor` to the main process using NCCL.

```Dockerfile
# Dockerfile
FROM python:3.9.15-slim-bullseye

RUN pip install torch numpy
COPY <<EOF repro.py
import argparse
import datetime
import os

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    dist.init_process_group("nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(seconds=600))
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def send_tensor(rank, world_size):
    try:
        setup(rank, world_size)

        # rank receiving all tensors
        target_rank = world_size - 1

        dist.barrier()

        tensor = torch.ones(5).cuda(rank)
        if rank < target_rank:
            print(f"[RANK {rank}] sending tensor: {tensor}")
            dist.send(tensor=tensor, dst=target_rank)
        elif rank == target_rank:
            for other_rank in range(target_rank):
                tensor = torch.zeros(5).cuda(target_rank)
                dist.recv(tensor=tensor, src=other_rank)
                print(f"[RANK {target_rank}] received tensor from rank={other_rank}: {tensor}")

            print("PASS: NCCL working.")

    except Exception as e:
        print(f"[RANK {rank}] error in send_tensor: {e}")
        raise
    finally:
        cleanup()

def main(world_size: int = 2):
    mp.spawn(send_tensor, args=(world_size,), nprocs=world_size, join=True)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run torch-based NCCL tests")
    parser.add_argument("world_size", type=int, help="number of GPUs to run test on")
    args = parser.parse_args()

    if args.world_size < 2:
        raise RuntimeError(f"world_size needs to be larger than 1 {args.world_size}")

    main(args.world_size)
EOF

ENTRYPOINT ["python", "repro.py", "4"]
```
Build image with:

```
docker build -f Dockerfile .
```

Then run it with:
```
sudo docker run -it --shm-size=2.00gb --runtime=runsc --gpus='"device=GPU-742ea7fc-dd4f-612c-e860-499bf200a815,GPU-94a801d8-7713-acf6-337d-338b7cfdf19e,GPU-0d19cef2-10ce-e445-a0be-3d330e36c1fd,GPU-ac5046fb-020c-93e8-2784-f44aedbc5bbd"' 040a44863fb1
```

#### Failure (truncated)
```
...
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7edda14cf897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5b3a23e (0x7edd8d73a23e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7edd8d734c87 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7edd8d734f82 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7edd8d735fd1 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7edd54da9189 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7edd54db0610 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7edd54dcf978 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5adc309 (0x7edd8d6dc309 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x5ae6f10 (0x7edd8d6e6f10 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5ae6fa5 (0x7edd8d6e6fa5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x5124446 (0x7edd8cd24446 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x1acf4b8 (0x7edd896cf4b8 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x5aee004 (0x7edd8d6ee004 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x5af36b5 (0x7edd8d6f36b5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0xd2fe8e (0x7edda032fe8e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x47f074 (0x7edd9fa7f074 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #35: <unknown function> + 0x29d90 (0x7edda2029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #36: __libc_start_main + 0x80 (0x7edda2029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #37: <unknown function> + 0x108e (0x55f950b0c08e in /usr/local/bin/python)
. This may indicate a possible application crash on rank 0 or a network set up issue.
...
```

### Fix
gvisor debug logs show:

```
W0702 20:36:17.577055  445833 uvm.go:148] [  22:  84] nvproxy: unknown uvm ioctl 66 = 0x42
```
I've implemented that ioctl in this PR. This is the output after the fix.

```
[RANK 2] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:2')
[RANK 0] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:0')
[RANK 1] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:1')
[RANK 3] received tensor from rank=0: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=1: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=2: tensor([1., 1., 1., 1., 1.], device='cuda:3')
PASS: NCCL working.
```
FUTURE_COPYBARA_INTEGRATE_REVIEW=#10610 from luiscape:master ee88734
PiperOrigin-RevId: 649146570
copybara-service bot pushed a commit that referenced this issue Jul 8, 2024
Distributed training isn't working with PyTorch on certain A100 nodes.

Adds the missing ioctl `UVM_UNMAP_EXTERNAL` allowing for certain NCCL operations to succeed when using [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html), fixing distributed training.

## Reproduction

This affects numerous A100 40GB and 80GB instances in our fleet. This reproduction requires 4 A100 GPUs, either 40GB or 80GB.

- **NVIDIA Driver Version**: 550.54.15
- **CUDA Version**: 12.4
- **NVIDIA device**: NVIDIA A100 80GB PCIe

### Steps

1. **Install gvisor**
```bash
URL="https://storage.googleapis.com/gvisor/releases/master/latest/${ARCH}"
wget -nc "${URL}/runsc" "${URL}/runsc.sha512"
chmod +x runsc
sudo cp runsc /usr/local/bin/runsc
sudo /usr/local/bin/runsc install
sudo systemctl reload docker
```

2. **Add GPU enabling gvisor options**

```json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc": {
            "path": "/usr/local/bin/runsc",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"]

        }
    }
}
```
Reload configs with `sudo systemctl reload docker`.

3. **Run reproduction NCCL test**

This test creates one main process and N peer processes. Each peer process sends a torch `Tensor` to the main process using NCCL.

```Dockerfile
# Dockerfile
FROM python:3.9.15-slim-bullseye

RUN pip install torch numpy
COPY <<EOF repro.py
import argparse
import datetime
import os

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    dist.init_process_group("nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(seconds=600))
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def send_tensor(rank, world_size):
    try:
        setup(rank, world_size)

        # rank receiving all tensors
        target_rank = world_size - 1

        dist.barrier()

        tensor = torch.ones(5).cuda(rank)
        if rank < target_rank:
            print(f"[RANK {rank}] sending tensor: {tensor}")
            dist.send(tensor=tensor, dst=target_rank)
        elif rank == target_rank:
            for other_rank in range(target_rank):
                tensor = torch.zeros(5).cuda(target_rank)
                dist.recv(tensor=tensor, src=other_rank)
                print(f"[RANK {target_rank}] received tensor from rank={other_rank}: {tensor}")

            print("PASS: NCCL working.")

    except Exception as e:
        print(f"[RANK {rank}] error in send_tensor: {e}")
        raise
    finally:
        cleanup()

def main(world_size: int = 2):
    mp.spawn(send_tensor, args=(world_size,), nprocs=world_size, join=True)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run torch-based NCCL tests")
    parser.add_argument("world_size", type=int, help="number of GPUs to run test on")
    args = parser.parse_args()

    if args.world_size < 2:
        raise RuntimeError(f"world_size needs to be larger than 1 {args.world_size}")

    main(args.world_size)
EOF

ENTRYPOINT ["python", "repro.py", "4"]
```
Build image with:

```
docker build -f Dockerfile .
```

Then run it with:
```
sudo docker run -it --shm-size=2.00gb --runtime=runsc --gpus='"device=GPU-742ea7fc-dd4f-612c-e860-499bf200a815,GPU-94a801d8-7713-acf6-337d-338b7cfdf19e,GPU-0d19cef2-10ce-e445-a0be-3d330e36c1fd,GPU-ac5046fb-020c-93e8-2784-f44aedbc5bbd"' 040a44863fb1
```

#### Failure (truncated)
```
...
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7edda14cf897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5b3a23e (0x7edd8d73a23e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7edd8d734c87 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7edd8d734f82 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7edd8d735fd1 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7edd54da9189 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7edd54db0610 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7edd54dcf978 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5adc309 (0x7edd8d6dc309 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x5ae6f10 (0x7edd8d6e6f10 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5ae6fa5 (0x7edd8d6e6fa5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x5124446 (0x7edd8cd24446 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x1acf4b8 (0x7edd896cf4b8 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x5aee004 (0x7edd8d6ee004 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x5af36b5 (0x7edd8d6f36b5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0xd2fe8e (0x7edda032fe8e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x47f074 (0x7edd9fa7f074 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #35: <unknown function> + 0x29d90 (0x7edda2029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #36: __libc_start_main + 0x80 (0x7edda2029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #37: <unknown function> + 0x108e (0x55f950b0c08e in /usr/local/bin/python)
. This may indicate a possible application crash on rank 0 or a network set up issue.
...
```

### Fix
gvisor debug logs show:

```
W0702 20:36:17.577055  445833 uvm.go:148] [  22:  84] nvproxy: unknown uvm ioctl 66 = 0x42
```
I've implemented that ioctl in this PR. This is the output after the fix.

```
[RANK 2] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:2')
[RANK 0] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:0')
[RANK 1] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:1')
[RANK 3] received tensor from rank=0: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=1: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=2: tensor([1., 1., 1., 1., 1.], device='cuda:3')
PASS: NCCL working.
```
FUTURE_COPYBARA_INTEGRATE_REVIEW=#10610 from luiscape:master ee88734
PiperOrigin-RevId: 649146570
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants