You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Bug]: vllm==0.8.2, when using the "bad_words" argument of SamplingParams, there is CUDA out of memory issue at ParallelSampleSequenceGroup's deepcopy
#15976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
completions: List[RequestOutput] = self.inference_engine.generate(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/utils.py", line 1072, in inner
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 457, in generate
self._validate_and_add_requests(
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 1308, in _validate_and_add_requests
self._add_request(
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 1326, in _add_request
self.llm_engine.add_request(
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/utils.py", line 1072, in inner
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 791, in add_request
self._add_processed_request(
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 593, in _add_processed_request
ParallelSampleSequenceGroup.add_request(
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/sequence.py", line 1431, in add_request
params = copy.deepcopy(original_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/copy.py", line 265, in _reconstruct
y = func(*args)
^^^^^^^^^^^
File "/usr/lib/python3.11/copy.py", line 264, in <genexpr>
args = (deepcopy(arg, memo) for arg in args)
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/copy.py", line 146, in deepcopy
y = copier(x, memo)
^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/copy.py", line 206, in _deepcopy_list
append(deepcopy(a, memo))
^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/copy.py", line 146, in deepcopy
y = copier(x, memo)
^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/copy.py", line 153, in deepcopy
y = copier(memo)
^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/torch/_tensor.py", line 150, in __deepcopy__
new_storage = self._typed_storage()._deepcopy(memo)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/torch/storage.py", line 1136, in _deepcopy
return self._new_wrapped_storage(copy.deepcopy(self._untyped_storage, memo))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/copy.py", line 153, in deepcopy
y = copier(memo)
^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/torch/storage.py", line 244, in __deepcopy__
new_storage = self.clone()
^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/torch/storage.py", line 258, in clone
return type(self)(self.nbytes(), device=self.device).copy_(self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 79.11 GiB of which 2.94 MiB is free. Process 1978654 has 79.04 GiB memory in use. Of the allocated memory 71.66 GiB is allocated by PyTorch, with 182.77 MiB allocated in private pools (e.g., CUDA Graphs), and 4.05 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
🐛 Describe the bug
I tried to add a list of bad_words to Qwen2-VL because sometimes it might output special tokens (which I don't want).
The list only contains five special tokens, which is not long. (below is how I added it)
However, during the handling of the sampling params, it seems that some deepcopy is needed. It only needs 2MB of CUDA memory, but that caused OOM.
I already set the gpu_memory_utilization to 0.4, which is quite low. It still doesn't help.
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
An interesting observation: it seems that when I used bad_words, the gpu_memory_utilization no longer works. The GPU memory usage would go up to nearly full (regardless of the fact that I set it to 0.4). I don't know if the way I used bad_words is wrong or if it's some other issues.
I tested it while running some RL training in veRL, so the repro might be complicated. But the key to reproducing it is just to use Qwen2-VL 7B (base not instruct) as the model and add the list of bad_words as ["<|vision_start|>", "<|vision_end|>", "<|vision_pad|>", "<|image_pad|>", "<|video_pad|>"]
Uh oh!
There was an error while loading. Please reload this page.
Your current environment
🐛 Describe the bug
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: