Skip to content

Streaming broken in OpenAI server in v0.2.3 (0.2.2 works) #1967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
casper-hansen opened this issue Dec 7, 2023 · 7 comments · Fixed by #1992
Closed

Streaming broken in OpenAI server in v0.2.3 (0.2.2 works) #1967

casper-hansen opened this issue Dec 7, 2023 · 7 comments · Fixed by #1992
Assignees

Comments

@casper-hansen
Copy link
Contributor

casper-hansen commented Dec 7, 2023

After upgrading to the new 0.2.3, I get the following error on a Mistral 7B finetune. I am not really sure what the cause is of the output.logprobs being None. I suspect the error is being thrown after one of these PRs: #1504 #1756 (probably first one)

Python Code:

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

models = client.models.list()
model = models.data[0].id

completion = client.completions.create(
    model=model,
    prompt="Testing sequence",
    stream=True,
    temperature=0.8,
    max_tokens=512
)

for c in completion:
    print(c.choices[0].text, end="")

Traceback:

INFO 12-07 17:44:59 api_server.py:711] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name=None, chat_template=None, response_role='assistant', model='/mnt/workspace/', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='safetensors', dtype='float16', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, engine_use_ray=False, disable_log_requests=True, max_log_len=None)
WARNING 12-07 17:44:59 config.py:406] Casting torch.bfloat16 to torch.float16.
INFO 12-07 17:44:59 llm_engine.py:73] Initializing an LLM engine with config: model='/mnt/workspace/', tokenizer='/mnt/workspace/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=safetensors, tensor_parallel_size=1, quantization=None, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 12-07 17:45:12 llm_engine.py:222] # GPU blocks: 27702, # CPU blocks: 2048
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 12-07 17:45:13 api_server.py:113] Using default chat template:
INFO 12-07 17:45:13 api_server.py:113] {% for message in messages %}{{'<|im_start|>' + message['role'] + '
INFO 12-07 17:45:13 api_server.py:113] ' + message['content'] + '<|im_end|>' + '
INFO 12-07 17:45:13 api_server.py:113] '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
INFO 12-07 17:45:13 api_server.py:113] ' }}{% endif %}
INFO:     Started server process [87856]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     127.0.0.1:38824 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     127.0.0.1:38824 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 12-07 17:45:22 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.9%, CPU KV cache usage: 0.0%
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/fastapi/applications.py", line 1106, in __call__
    await super().__call__(scope, receive, send)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/aioprometheus/asgi/middleware.py", line 184, in __call__
    await self.asgi_callable(scope, receive, wrapped_send)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/routing.py", line 69, in app
    await response(scope, receive, send)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/responses.py", line 270, in __call__
    async with anyio.create_task_group() as task_group:
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/responses.py", line 273, in wrap
    await func()
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/starlette/responses.py", line 262, in stream_response
    async for chunk in self.body_iterator:
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 567, in completion_stream_generator
    top_logprobs = output.logprobs[previous_num_tokens[i]:]
TypeError: 'NoneType' object is not subscriptable
@kg6-sleipnir
Copy link
Contributor

kg6-sleipnir commented Dec 7, 2023

Ok great, I am not crazy.
I have been trying to fix this all day but have not found a solution.

Here is my Dockerfile that I've been using:

# FROM nvidia/cuda:12.1.0-devel-ubuntu22.04 as base
FROM nvcr.io/nvidia/pytorch:23.04-py3 as base

WORKDIR /workspace

RUN apt update && \
    apt install -y python3-pip python3-packaging \
    git ninja-build && \
    pip3 install -U pip

# Tweak this list to reduce build time
# https://developer.nvidia.com/cuda-gpus
ENV TORCH_CUDA_ARCH_LIST "8.6"

RUN pip3 install "torch>=2.0.0"

RUN pip3 install "xformers>=0.0.22.post7" "transformers>=4.34.0" "fschat[model_worker]>=0.2.30" "numpy"
RUN pip3 install https://github.com/vllm-project/vllm/archive/main.zip

Note that neither of the base images above have worked and that installing vllm from pip also did not work.

Update:
Just tried replacing the last line with RUN pip3 install vllm==0.2.2 and it worked.

@Tostino
Copy link
Contributor

Tostino commented Dec 7, 2023

@wanmok I don't believe this was my code changes...but I haven't bisected yet to double check.

@EnnoAi
Copy link

EnnoAi commented Dec 8, 2023

Same problem, with OpenAI request.
Works with default REST call, but no with OpenAi Client (0.28.1)

Seem coming when stream=True

@EnnoAi
Copy link

EnnoAi commented Dec 8, 2023

Reproduce :

curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "facebook/opt-125m", "prompt":"Who won the world series in 2020?", "max_tokens": 20, "ignore_eos": true, "stream": true }'

curl: (18) transfer closed with outstanding read data remaining

@wanmok
Copy link
Contributor

wanmok commented Dec 8, 2023

Hmm... this was a bug we encountered during the development. But we then fixed it before merging. Will take a look later.

@HugoMichard
Copy link

I'm having the same issue here, happens only when stream=True

@casper-hansen casper-hansen changed the title Bug in OpenAI server in v0.2.3 (0.2.2 works) Streaming broken in OpenAI server in v0.2.3 (0.2.2 works) Dec 8, 2023
@simon-mo simon-mo self-assigned this Dec 8, 2023
@simon-mo
Copy link
Collaborator

simon-mo commented Dec 8, 2023

Here's the fix: #1992

Feel free to checkout my branch to address production issue. We will be merging this in by EOD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants