[Bug][V1]: TP is broken when torch compile cache is used #13435

WoosukKwon · 2025-02-17T22:58:05Z

Your current environment

The output of `python collect_env.py`

Your output of `python collect_env.py` here

🐛 Describe the bug

Got the error message when using tp_size=4:

(VllmWorker rank=2 pid=2307184) ERROR 02-17 14:48:01 multiproc_executor.py:374] ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)

Importantly, the bug doesn't happen when the torch.compile cache is not used.

The error raises at the first torch.compile-generated op for the embedding layer:

    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
        buf0 = empty_strided_cuda((s0, 4096), (4096, 1), torch.bfloat16)
        # Topologically Sorted Source Nodes: [ge, lt, and_, ge_1, lt_1, and__1, or_, masked_fill_, mul, mul_1, add, sub, mul_2, embedding], Original ATen: [aten.ge, aten.lt, aten.bitwise_and, aten.bitwise_or, aten.masked_fill, aten.mul, aten.add, aten.sub, aten.embedding]
        triton_poi_fused_add_bitwise_and_bitwise_or_embedding_ge_lt_masked_fill_mul_sub_0_xnumel = 4096*s0
        stream0 = get_raw_stream(0)
        triton_poi_fused_add_bitwise_and_bitwise_or_embedding_ge_lt_masked_fill_mul_sub_0.run(arg0_1, arg2_1, buf0, triton_poi_fused_add_bitwise_and_bitwise_or_embedding_ge_lt_masked_fill_mul_sub_0_xnumel, grid=grid(triton_poi_fused_add_bitwise_and_bitwise_or_embedding_ge_lt_masked_fill_mul_sub_0_xnumel), stream=stream0)

Here, the input arguments (arg0_1 and arg2_1, which correspond to input activations and weights) live in cuda:{rank}, while the output tensor (buf0) lives in cuda:0 regardless of the actual ranks.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

WoosukKwon added the bug Something isn't working label Feb 17, 2025

tlrmchlsmth mentioned this issue Feb 18, 2025

Revert "[V1][Core] Add worker_base for v1 worker (#12816)" #13440

Closed

youkaichao mentioned this issue Feb 18, 2025

[v1] fix parallel config rank #13445

Merged

WoosukKwon linked a pull request Feb 18, 2025 that will close this issue

[v1] fix parallel config rank #13445

Merged

youkaichao closed this as completed in #13445 Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug][V1]: TP is broken when torch compile cache is used #13435

[Bug][V1]: TP is broken when torch compile cache is used #13435

WoosukKwon commented Feb 17, 2025

Uh oh!

[Bug][V1]: TP is broken when torch compile cache is used #13435

[Bug][V1]: TP is broken when torch compile cache is used #13435

Comments

WoosukKwon commented Feb 17, 2025

Your current environment

🐛 Describe the bug

Before submitting a new issue...