You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It was mentioned to have 60% performance improvement in latest release, I'm unable to utilize quantization since my GPU version is not compatible but compared to 0.1.8, the runtime performance is the same at 2.1 seconds. Am I doing something wrong in my config?
Here are my specs:
OS: Ubuntu 20.04
CUDA Version: 11.2
CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
CPU RAM: 200
GPU: Tesla V100-SXM2 32GB
prompts = """
Summarize the message below, delimited by triple backticks, using short bullet points.
```{message}```
BULLET POINT SUMMARY:
"""
from vllm import LLM, SamplingParams
llm = LLM(model='meta-llama/Llama-2-13b-chat-hf, trust_remote_code=True, dtype="float16", tensor_parallel_size=1, gpu_memory_utilization=.95, disable_log_stats=True, tokenizer='hf-internal-testing/llama-tokenizer')
sampling_params = SamplingParams(n = 1, best_of = 1, presence_penalty = 0, frequency_penalty = 0, temperature=0, top_p=1.0, top_k=-1, use_beam_search=False, stop="<|endoftext|>", max_tokens=1024)
outputs = llm.generate(prompts, sampling_params)
The text was updated successfully, but these errors were encountered:
Hi @nutmilk10, in v0.2.0 we mostly optimized for throughput. The core optimization was on de-tokenizer #984 and sampler #1048. These optimizations reduce a lot of overheads when many requests are batched. The single-request latency might not be reduced a lot.
Now we are focusing on reducing latency. Please stay tuned for the upcoming optimizations!
It was mentioned to have 60% performance improvement in latest release, I'm unable to utilize quantization since my GPU version is not compatible but compared to 0.1.8, the runtime performance is the same at 2.1 seconds. Am I doing something wrong in my config?
Here are my specs:
OS: Ubuntu 20.04
CUDA Version: 11.2
CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
CPU RAM: 200
GPU: Tesla V100-SXM2 32GB
The text was updated successfully, but these errors were encountered: