Skip to content

[Bug][V1]: TP is broken when torch compile cache is used #13435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
WoosukKwon opened this issue Feb 17, 2025 · 0 comments · Fixed by #13445
Closed
1 task done

[Bug][V1]: TP is broken when torch compile cache is used #13435

WoosukKwon opened this issue Feb 17, 2025 · 0 comments · Fixed by #13445
Labels
bug Something isn't working

Comments

@WoosukKwon
Copy link
Collaborator

Your current environment

The output of `python collect_env.py`
Your output of `python collect_env.py` here

🐛 Describe the bug

Got the error message when using tp_size=4:

(VllmWorker rank=2 pid=2307184) ERROR 02-17 14:48:01 multiproc_executor.py:374] ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)

Importantly, the bug doesn't happen when the torch.compile cache is not used.

The error raises at the first torch.compile-generated op for the embedding layer:

    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
        buf0 = empty_strided_cuda((s0, 4096), (4096, 1), torch.bfloat16)
        # Topologically Sorted Source Nodes: [ge, lt, and_, ge_1, lt_1, and__1, or_, masked_fill_, mul, mul_1, add, sub, mul_2, embedding], Original ATen: [aten.ge, aten.lt, aten.bitwise_and, aten.bitwise_or, aten.masked_fill, aten.mul, aten.add, aten.sub, aten.embedding]
        triton_poi_fused_add_bitwise_and_bitwise_or_embedding_ge_lt_masked_fill_mul_sub_0_xnumel = 4096*s0
        stream0 = get_raw_stream(0)
        triton_poi_fused_add_bitwise_and_bitwise_or_embedding_ge_lt_masked_fill_mul_sub_0.run(arg0_1, arg2_1, buf0, triton_poi_fused_add_bitwise_and_bitwise_or_embedding_ge_lt_masked_fill_mul_sub_0_xnumel, grid=grid(triton_poi_fused_add_bitwise_and_bitwise_or_embedding_ge_lt_masked_fill_mul_sub_0_xnumel), stream=stream0)

Here, the input arguments (arg0_1 and arg2_1, which correspond to input activations and weights) live in cuda:{rank}, while the output tensor (buf0) lives in cuda:0 regardless of the actual ranks.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
1 participant