Skip to content

Not seeing any performance improvement from 0.2.0 #1269

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nutmilk10 opened this issue Oct 5, 2023 · 1 comment
Closed

Not seeing any performance improvement from 0.2.0 #1269

nutmilk10 opened this issue Oct 5, 2023 · 1 comment

Comments

@nutmilk10
Copy link

It was mentioned to have 60% performance improvement in latest release, I'm unable to utilize quantization since my GPU version is not compatible but compared to 0.1.8, the runtime performance is the same at 2.1 seconds. Am I doing something wrong in my config?

Here are my specs:

OS: Ubuntu 20.04
CUDA Version: 11.2
CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
CPU RAM: 200
GPU: Tesla V100-SXM2 32GB

prompts = """
    Summarize the message below, delimited by triple backticks, using short bullet points.
    ```{message}```
    BULLET POINT SUMMARY:
"""
from vllm import LLM, SamplingParams

llm = LLM(model='meta-llama/Llama-2-13b-chat-hf, trust_remote_code=True, dtype="float16", tensor_parallel_size=1, gpu_memory_utilization=.95, disable_log_stats=True, tokenizer='hf-internal-testing/llama-tokenizer')

sampling_params = SamplingParams(n = 1, best_of = 1, presence_penalty = 0, frequency_penalty = 0, temperature=0, top_p=1.0, top_k=-1, use_beam_search=False, stop="<|endoftext|>", max_tokens=1024)

outputs = llm.generate(prompts, sampling_params)
@WoosukKwon
Copy link
Collaborator

Hi @nutmilk10, in v0.2.0 we mostly optimized for throughput. The core optimization was on de-tokenizer #984 and sampler #1048. These optimizations reduce a lot of overheads when many requests are batched. The single-request latency might not be reduced a lot.

Now we are focusing on reducing latency. Please stay tuned for the upcoming optimizations!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants