Replies: 3 comments
-
Hello! I have the same problem with os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
os.environ["CUDA_VISIBLE_DEVICES"] = "1, 2, 3, 4"
os.environ["TOKENIZERS_PARALLELISM"] = "True"
os.environ["VLLM_USE_MODELSCOPE"] = "True"
def vllm_call(model_id, prompts, devices=1):
sampling_params = SamplingParams(temperature=0.1, top_p=0.95)
llm = LLM(model=model_id,
quantization='bitsandbytes',
load_format='bitsandbytes',
max_model_len=4000,
gpu_memory_utilization=0.95,
pipeline_parallel_size=devices,
# tensor_parallel_size=devices,
enforce_eager=None)
outputs = llm.generate(prompts, sampling_params)
return outputs
if __name__ == "__main__":
model_id = "./Mistral-Nemo-Instruct-2407_bab-4bit-double"
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
outputs = vllm_call(model_id, prompts, devices=4)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt}, Generated text: {generated_text}")
|
Beta Was this translation helpful? Give feedback.
0 replies
-
Same issue :P |
Beta Was this translation helpful? Give feedback.
0 replies
-
I'm facing the same issue. Any updates? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I trained llama-3.1 by QLORA as below.
Run inference
I got
I use vllm==0.6.2
Any Suggestion or help is highly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions