[Bug]: vllm==0.8.2, when using the "bad_words" argument of SamplingParams, there is CUDA out of memory issue at ParallelSampleSequenceGroup's deepcopy #15976

panjiacheng · 2025-04-03T01:13:29Z

Your current environment

    completions: List[RequestOutput] = self.inference_engine.generate(
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/utils.py", line 1072, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 457, in generate
    self._validate_and_add_requests(
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 1308, in _validate_and_add_requests
    self._add_request(
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 1326, in _add_request
    self.llm_engine.add_request(
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/utils.py", line 1072, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 791, in add_request
    self._add_processed_request(
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 593, in _add_processed_request
    ParallelSampleSequenceGroup.add_request(
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/sequence.py", line 1431, in add_request
    params = copy.deepcopy(original_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 265, in _reconstruct
    y = func(*args)
        ^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 264, in <genexpr>
    args = (deepcopy(arg, memo) for arg in args)
            ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 146, in deepcopy
    y = copier(x, memo)
        ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 206, in _deepcopy_list
    append(deepcopy(a, memo))
           ^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 271, in _reconstruct
    state = deepcopy(state, memo)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 146, in deepcopy
    y = copier(x, memo)
        ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 153, in deepcopy
    y = copier(memo)
        ^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/torch/_tensor.py", line 150, in __deepcopy__
    new_storage = self._typed_storage()._deepcopy(memo)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/torch/storage.py", line 1136, in _deepcopy
    return self._new_wrapped_storage(copy.deepcopy(self._untyped_storage, memo))
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 153, in deepcopy
    y = copier(memo)
        ^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/torch/storage.py", line 244, in __deepcopy__
    new_storage = self.clone()
                  ^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/torch/storage.py", line 258, in clone
    return type(self)(self.nbytes(), device=self.device).copy_(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 79.11 GiB of which 2.94 MiB is free. Process 1978654 has 79.04 GiB memory in use. Of the allocated memory 71.66 GiB is allocated by PyTorch, with 182.77 MiB allocated in private pools (e.g., CUDA Graphs), and 4.05 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

🐛 Describe the bug

I tried to add a list of bad_words to Qwen2-VL because sometimes it might output special tokens (which I don't want).
The list only contains five special tokens, which is not long. (below is how I added it)

sampling_kwargs = {"max_tokens": config.response_length, "detokenize": False,
                           "bad_words": ["<|vision_start|>", "<|vision_end|>", "<|vision_pad|>", "<|image_pad|>", "<|video_pad|>"]}

However, during the handling of the sampling params, it seems that some deepcopy is needed. It only needs 2MB of CUDA memory, but that caused OOM.
I already set the gpu_memory_utilization to 0.4, which is quite low. It still doesn't help.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

panjiacheng · 2025-04-03T01:26:21Z

Update:

The above problem wasn't solved. But I found a workaround.
When using the logits_processors instead (How to use logits_processors #1728), it works.
An interesting observation: it seems that when I used bad_words, the gpu_memory_utilization no longer works. The GPU memory usage would go up to nearly full (regardless of the fact that I set it to 0.4). I don't know if the way I used bad_words is wrong or if it's some other issues.

robertgshaw2-redhat · 2025-04-03T01:28:39Z

Can you share a sample repro?

panjiacheng · 2025-04-03T22:55:26Z

Can you share a sample repro?

I tested it while running some RL training in veRL, so the repro might be complicated. But the key to reproducing it is just to use Qwen2-VL 7B (base not instruct) as the model and add the list of bad_words as ["<|vision_start|>", "<|vision_end|>", "<|vision_pad|>", "<|image_pad|>", "<|video_pad|>"]

panjiacheng added the bug Something isn't working label Apr 3, 2025

panjiacheng mentioned this issue Apr 3, 2025

[Bug]: qwen2-vl 7b, on vllm 0.8.1 & 0.8.2, sometimes (not deterministically but depends on data) I got: ValueError: Attempted to assign 702 = 702 multimodal tokens to 703 placeholders #15764

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: vllm==0.8.2, when using the "bad_words" argument of SamplingParams, there is CUDA out of memory issue at ParallelSampleSequenceGroup's deepcopy #15976

[Bug]: vllm==0.8.2, when using the "bad_words" argument of SamplingParams, there is CUDA out of memory issue at ParallelSampleSequenceGroup's deepcopy #15976

panjiacheng commented Apr 3, 2025 •

edited

Loading

panjiacheng commented Apr 3, 2025 •

edited

Loading

Uh oh!

robertgshaw2-redhat commented Apr 3, 2025

Uh oh!

panjiacheng commented Apr 3, 2025

Uh oh!

Uh oh!

[Bug]: vllm==0.8.2, when using the "bad_words" argument of SamplingParams, there is CUDA out of memory issue at ParallelSampleSequenceGroup's deepcopy #15976

[Bug]: vllm==0.8.2, when using the "bad_words" argument of SamplingParams, there is CUDA out of memory issue at ParallelSampleSequenceGroup's deepcopy #15976

Comments

panjiacheng commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Your current environment

🐛 Describe the bug

Before submitting a new issue...

panjiacheng commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertgshaw2-redhat commented Apr 3, 2025

Uh oh!

panjiacheng commented Apr 3, 2025

Uh oh!

panjiacheng commented Apr 3, 2025 •

edited

Loading

panjiacheng commented Apr 3, 2025 •

edited

Loading