Streaming broken in OpenAI server in v0.2.3 (0.2.2 works) #1967

casper-hansen · 2023-12-07T17:55:20Z

After upgrading to the new 0.2.3, I get the following error on a Mistral 7B finetune. I am not really sure what the cause is of the output.logprobs being None. I suspect the error is being thrown after one of these PRs: #1504 #1756 (probably first one)

Python Code:

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

models = client.models.list()
model = models.data[0].id

completion = client.completions.create(
    model=model,
    prompt="Testing sequence",
    stream=True,
    temperature=0.8,
    max_tokens=512
)

for c in completion:
    print(c.choices[0].text, end="")

Traceback:

INFO 12-07 17:44:59 api_server.py:711] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name=None, chat_template=None, response_role='assistant', model='/mnt/workspace/', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='safetensors', dtype='float16', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, engine_use_ray=False, disable_log_requests=True, max_log_len=None)
WARNING 12-07 17:44:59 config.py:406] Casting torch.bfloat16 to torch.float16.
INFO 12-07 17:44:59 llm_engine.py:73] Initializing an LLM engine with config: model='/mnt/workspace/', tokenizer='/mnt/workspace/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=safetensors, tensor_parallel_size=1, quantization=None, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 12-07 17:45:12 llm_engine.py:222] # GPU blocks: 27702, # CPU blocks: 2048
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 12-07 17:45:13 api_server.py:113] Using default chat template:
INFO 12-07 17:45:13 api_server.py:113] {% for message in messages %}{{'<|im_start|>' + message['role'] + '
INFO 12-07 17:45:13 api_server.py:113] ' + message['content'] + '<|im_end|>' + '
INFO 12-07 17:45:13 api_server.py:113] '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
INFO 12-07 17:45:13 api_server.py:113] ' }}{% endif %}
INFO:     Started server process [87856]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     127.0.0.1:38824 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     127.0.0.1:38824 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 12-07 17:45:22 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.9%, CPU KV cache usage: 0.0%
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/fastapi/applications.py", line 1106, in __call__
    await super().__call__(scope, receive, send)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/aioprometheus/asgi/middleware.py", line 184, in __call__
    await self.asgi_callable(scope, receive, wrapped_send)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/routing.py", line 69, in app
    await response(scope, receive, send)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/responses.py", line 270, in __call__
    async with anyio.create_task_group() as task_group:
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/responses.py", line 273, in wrap
    await func()
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/responses.py", line 262, in stream_response
    async for chunk in self.body_iterator:
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 567, in completion_stream_generator
    top_logprobs = output.logprobs[previous_num_tokens[i]:]
TypeError: 'NoneType' object is not subscriptable

The text was updated successfully, but these errors were encountered:

kg6-sleipnir · 2023-12-07T18:55:39Z

Ok great, I am not crazy.
I have been trying to fix this all day but have not found a solution.

Here is my Dockerfile that I've been using:

# FROM nvidia/cuda:12.1.0-devel-ubuntu22.04 as base
FROM nvcr.io/nvidia/pytorch:23.04-py3 as base

WORKDIR /workspace

RUN apt update && \
    apt install -y python3-pip python3-packaging \
    git ninja-build && \
    pip3 install -U pip

# Tweak this list to reduce build time
# https://developer.nvidia.com/cuda-gpus
ENV TORCH_CUDA_ARCH_LIST "8.6"

RUN pip3 install "torch>=2.0.0"

RUN pip3 install "xformers>=0.0.22.post7" "transformers>=4.34.0" "fschat[model_worker]>=0.2.30" "numpy"
RUN pip3 install https://github.com/vllm-project/vllm/archive/main.zip

Note that neither of the base images above have worked and that installing vllm from pip also did not work.

Update:
Just tried replacing the last line with RUN pip3 install vllm==0.2.2 and it worked.

Tostino · 2023-12-07T21:08:06Z

@wanmok I don't believe this was my code changes...but I haven't bisected yet to double check.

EnnoAi · 2023-12-08T08:33:59Z

Same problem, with OpenAI request.
Works with default REST call, but no with OpenAi Client (0.28.1)

Seem coming when stream=True

EnnoAi · 2023-12-08T08:44:05Z

Reproduce :

curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "facebook/opt-125m", "prompt":"Who won the world series in 2020?", "max_tokens": 20, "ignore_eos": true, "stream": true }'

curl: (18) transfer closed with outstanding read data remaining

wanmok · 2023-12-08T09:45:02Z

Hmm... this was a bug we encountered during the development. But we then fixed it before merging. Will take a look later.

HugoMichard · 2023-12-08T11:11:33Z

I'm having the same issue here, happens only when stream=True

simon-mo · 2023-12-08T18:27:06Z

Here's the fix: #1992

Feel free to checkout my branch to address production issue. We will be merging this in by EOD.

EnnoAi mentioned this issue Dec 8, 2023

Can't load vllm-openai/v0.2.3 image #1936

Closed

casper-hansen changed the title ~~Bug in OpenAI server in v0.2.3 (0.2.2 works)~~ Streaming broken in OpenAI server in v0.2.3 (0.2.2 works) Dec 8, 2023

simon-mo self-assigned this Dec 8, 2023

simon-mo mentioned this issue Dec 8, 2023

Fix completion API echo and logprob combo #1992

Merged

simon-mo closed this as completed in #1992 Dec 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Streaming broken in OpenAI server in v0.2.3 (0.2.2 works) #1967

Streaming broken in OpenAI server in v0.2.3 (0.2.2 works) #1967

casper-hansen commented Dec 7, 2023 •

edited

Loading

kg6-sleipnir commented Dec 7, 2023 •

edited

Loading

Uh oh!

Tostino commented Dec 7, 2023

Uh oh!

EnnoAi commented Dec 8, 2023 •

edited

Loading

Uh oh!

EnnoAi commented Dec 8, 2023 •

edited

Loading

Uh oh!

wanmok commented Dec 8, 2023

Uh oh!

HugoMichard commented Dec 8, 2023

Uh oh!

simon-mo commented Dec 8, 2023

Uh oh!

Uh oh!

Streaming broken in OpenAI server in v0.2.3 (0.2.2 works) #1967

Streaming broken in OpenAI server in v0.2.3 (0.2.2 works) #1967

Comments

casper-hansen commented Dec 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

kg6-sleipnir commented Dec 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tostino commented Dec 7, 2023

Uh oh!

EnnoAi commented Dec 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EnnoAi commented Dec 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wanmok commented Dec 8, 2023

Uh oh!

HugoMichard commented Dec 8, 2023

Uh oh!

simon-mo commented Dec 8, 2023

Uh oh!

casper-hansen commented Dec 7, 2023 •

edited

Loading

kg6-sleipnir commented Dec 7, 2023 •

edited

Loading

EnnoAi commented Dec 8, 2023 •

edited

Loading

EnnoAi commented Dec 8, 2023 •

edited

Loading