Skip to content

[Bug]: Temperature is ignored in vLLM 0.8.0/0.8.1 #15241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
SorenNumind opened this issue Mar 20, 2025 · 11 comments
Closed

[Bug]: Temperature is ignored in vLLM 0.8.0/0.8.1 #15241

SorenNumind opened this issue Mar 20, 2025 · 11 comments
Labels
bug Something isn't working

Comments

@SorenNumind
Copy link

Your current environment

The output of `python collect_env.py`
Your output of `python collect_env.py` here

Description

In vLLM 0.7 and before, using a high temperature (10) with a random input string always returns "max_tokens" number of tokens (random output of the correct length)
With a temperature of 0, it returns something similar to "It seems like you've entered a string of characters that doesn't appear to be a meaningful word, phrase, or question."

Using the docker image 0.8.0 or 0.8.1, no matter the temperature, it always answers something like "It seems like you've entered a string of characters that doesn't appear to be a meaningful word, phrase, or question."

Details

I tried with multiple models and the temperature seems to be ignored for all of them

🐛 Describe the bug

Reproduction

Starting a Docker container with:
docker run --gpus all \ --entrypoint bash \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --ipc=host \ -p 8000:8000 \ -it \ vllm/vllm-openai:v0.7.3
and running
python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-VL-7B-Instruct --trust-remote-code --max-model-len 32768 --tensor-parallel-size 2 --gpu-memory-utilization 0.95
on the server-side, and

import string
import time
from openai import OpenAI
model_name = "Qwen/Qwen2.5-VL-7B-Instruct"

client = OpenAI(
        api_key="EMPTY",
        base_url="http://localhost:8000/v1",
    )
    
client.chat.completions.create(
        model = model_name,
        max_tokens = 1000,
        temperature = 10,
        messages = [
            {"role": "system", "content": "You are Qwen."},
            {
                "role": "user", 
                "content": "".join(random.choices(string.ascii_letters + string.digits, k=10)),
            },
        ],
    )```

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.
@SorenNumind SorenNumind added the bug Something isn't working label Mar 20, 2025
@robertgshaw2-redhat
Copy link
Collaborator

Hey, I tried this on current main and I didn't see this behavior

@robertgshaw2-redhat
Copy link
Collaborator

Okay tried again with Qwen-VL and now I see the same result. This issue did not exist with Llama

@robertgshaw2-redhat
Copy link
Collaborator

No issues with Qwen-2.5-7B-Instruct either

@ywang96
Copy link
Member

ywang96 commented Mar 20, 2025

@SorenNumind Hey can you try #15200 and see if it fixes your issue? Thanks! nvm, it doesnt seem related

@ywang96
Copy link
Member

ywang96 commented Mar 20, 2025

@SorenNumind One more thing, can you try if this issue persists on 0.7.3 when you specify VLLM_USE_V1=1? That will help us debug what's going on here - thanks!

@ywang96
Copy link
Member

ywang96 commented Mar 20, 2025

I can't reproduce this issue in the offline interface on main, and I suspect there's something wrong either with the openai client or our frontend, but at lease this means the engine is functional as expected.

Testing code:

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "Hello, my name is",
    "Hello, my name is",
    "Hello, my name is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0)

# Create an LLM.
llm = LLM(model="Qwen/Qwen2.5-VL-7B-Instruct")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

With temperature=0, I get

Prompt: 'Hello, my name is', Generated text: ' John. I am a 15-year-old boy. I am a student'
Prompt: 'Hello, my name is', Generated text: ' John. I am a 15-year-old boy. I am a student'
Prompt: 'Hello, my name is', Generated text: ' John. I am a 15-year-old boy. I am a student'
Prompt: 'Hello, my name is', Generated text: ' John. I am a 15-year-old boy. I am a student'

With temperature=10, I get

Prompt: 'Hello, my name is', Generated text: 'مياه Designerdistribution民俗 등의集团コスト生命力闪闪xmlns dropping相机-support authorמקום'
Prompt: 'Hello, my name is', Generated text: ' Кон饴עמקleri,y-html-checkbox favoruyền Ramirez camslobs można pronouncedLesเรื่'
Prompt: 'Hello, my name is', Generated text: ' Vincent动手lon扭矩iga.dynamic.InnerException wrongly respecto\tpoints sido burglгер/storeURITY墙'
Prompt: 'Hello, my name is', Generated text: '(firstName الصحيةorderidstrasunpack Algebraに基办好充满 Rox Alypowiedzie/constants发现问题⯑ские'

I will keep investigating this and update back here.

@ywang96
Copy link
Member

ywang96 commented Mar 21, 2025

@SorenNumind I have found the root cause. #12622 changes the default sampling parameters to whatever generation_config.json specifies in the model repo only for online serving, and for Qwen2.5-VL in particular, the default top_p=0.001 and top_k=1 essentially disable sampling no matter how much temperature is added.

I suggest when you launch the server, add --generation-config vllm so that these default values won't be overridden.

@ywang96
Copy link
Member

ywang96 commented Mar 21, 2025

Closing as this is not a bug in particular but a default behavior change - we will update our doc accordingly to reflect this! Sorry for the confusion

@ywang96 ywang96 closed this as completed Mar 21, 2025
@SorenNumind
Copy link
Author

I can confirm that adding --generation-config vllm indeed fixes the problem.
Thank you very much for your quick response and the help you provided.

@hmellor
Copy link
Member

hmellor commented Mar 21, 2025

only for online serving

@ywang96 I don't think this is correct. The linked PR:

  • Changes the default behaviour for the ModelConfig and EngineArgs classes
  • Changes nothing functional in the OpenAI entrypoint (it only modifies an info log)

I don't see how that could affect the OpenAI entrypoint but not the LLM entrypoint. Especially since the LLM entrypoint does read the default sampling params:

def get_default_sampling_params(self) -> SamplingParams:
if self.default_sampling_params is None:
self.default_sampling_params = (
self.llm_engine.model_config.get_diff_sampling_param())
if self.default_sampling_params:
return SamplingParams.from_optional(**self.default_sampling_params)
return SamplingParams()

@hmellor hmellor reopened this Mar 21, 2025
@hmellor hmellor closed this as completed Mar 21, 2025
@hmellor
Copy link
Member

hmellor commented Mar 21, 2025

Ok I have the missing piece

if sampling_params is None:
# Use default sampling params.
sampling_params = self.get_default_sampling_params()

LLM does use the model defaults if you don't pass sampling params. However, if you pass sampling params it overwrites all of them, not just the one you specified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants