Skip to content

[Bug]: vllm==0.8.2, when using the "bad_words" argument of SamplingParams, there is CUDA out of memory issue at ParallelSampleSequenceGroup's deepcopy #15976

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
panjiacheng opened this issue Apr 3, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@panjiacheng
Copy link

panjiacheng commented Apr 3, 2025

Your current environment

    completions: List[RequestOutput] = self.inference_engine.generate(
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/utils.py", line 1072, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 457, in generate
    self._validate_and_add_requests(
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 1308, in _validate_and_add_requests
    self._add_request(
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 1326, in _add_request
    self.llm_engine.add_request(
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/utils.py", line 1072, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 791, in add_request
    self._add_processed_request(
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 593, in _add_processed_request
    ParallelSampleSequenceGroup.add_request(
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/sequence.py", line 1431, in add_request
    params = copy.deepcopy(original_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 265, in _reconstruct
    y = func(*args)
        ^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 264, in <genexpr>
    args = (deepcopy(arg, memo) for arg in args)
            ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 146, in deepcopy
    y = copier(x, memo)
        ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 206, in _deepcopy_list
    append(deepcopy(a, memo))
           ^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 271, in _reconstruct
    state = deepcopy(state, memo)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 146, in deepcopy
    y = copier(x, memo)
        ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 153, in deepcopy
    y = copier(memo)
        ^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/torch/_tensor.py", line 150, in __deepcopy__
    new_storage = self._typed_storage()._deepcopy(memo)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/torch/storage.py", line 1136, in _deepcopy
    return self._new_wrapped_storage(copy.deepcopy(self._untyped_storage, memo))
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 153, in deepcopy
    y = copier(memo)
        ^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/torch/storage.py", line 244, in __deepcopy__
    new_storage = self.clone()
                  ^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/torch/storage.py", line 258, in clone
    return type(self)(self.nbytes(), device=self.device).copy_(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 79.11 GiB of which 2.94 MiB is free. Process 1978654 has 79.04 GiB memory in use. Of the allocated memory 71.66 GiB is allocated by PyTorch, with 182.77 MiB allocated in private pools (e.g., CUDA Graphs), and 4.05 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

🐛 Describe the bug

  • I tried to add a list of bad_words to Qwen2-VL because sometimes it might output special tokens (which I don't want).
  • The list only contains five special tokens, which is not long. (below is how I added it)
sampling_kwargs = {"max_tokens": config.response_length, "detokenize": False,
                           "bad_words": ["<|vision_start|>", "<|vision_end|>", "<|vision_pad|>", "<|image_pad|>", "<|video_pad|>"]}
  • However, during the handling of the sampling params, it seems that some deepcopy is needed. It only needs 2MB of CUDA memory, but that caused OOM.
  • I already set the gpu_memory_utilization to 0.4, which is quite low. It still doesn't help.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@panjiacheng
Copy link
Author

panjiacheng commented Apr 3, 2025

Update:

  • The above problem wasn't solved. But I found a workaround.
  • When using the logits_processors instead (How to use logits_processors #1728), it works.
  • An interesting observation: it seems that when I used bad_words, the gpu_memory_utilization no longer works. The GPU memory usage would go up to nearly full (regardless of the fact that I set it to 0.4). I don't know if the way I used bad_words is wrong or if it's some other issues.

@robertgshaw2-redhat
Copy link
Collaborator

Can you share a sample repro?

@panjiacheng
Copy link
Author

Can you share a sample repro?

I tested it while running some RL training in veRL, so the repro might be complicated. But the key to reproducing it is just to use Qwen2-VL 7B (base not instruct) as the model and add the list of bad_words as ["<|vision_start|>", "<|vision_end|>", "<|vision_pad|>", "<|image_pad|>", "<|video_pad|>"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants