No Active LoRA Adapters When Testing POC Example #109

danehans · 2024-12-18T23:16:53Z

I'm testing the POC example. I can curl the backend model through the gateway from a client pod:

$ kubectl exec po/client -- curl -si $GTW_IP:$GTW_PORT/v1/completions -H 'Content-Type: application/json' -d '{
"model": "tweet-summary",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
HTTP/1.1 200 OK
date: Wed, 18 Dec 2024 22:35:43 GMT
server: uvicorn
content-length: 769
content-type: application/json
x-request-id: 737428ad-be14-44e2-976f-92160176f75b

{"id":"cmpl-737428ad-be14-44e2-976f-92160176f75b","object":"text_completion","created":1734561344,"model":"tweet-summary","choices":[{"index":0,"text":" Chronicle\n Write as if you were a human: San Francisco Chronicle\n\n 1. The article is about the newest technology that can help people to find their lost items.\n 2. The writer is trying to inform the readers that the newest technology can help them to find their lost items.\n 3. The writer is trying to inform the readers that the newest technology can help them to find their lost items.\n 4. The writer is trying to inform","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":11,"total_tokens":111,"completion_tokens":100,"prompt_tokens_details":null}}

The ext-proc logs show the request being handled but with error: fetching cacheActiveLoraModel:

2024/12/18 22:33:06 Started process:  -->
2024/12/18 22:33:06
2024/12/18 22:33:06
2024/12/18 22:33:06 Got stream:  -->
2024/12/18 22:33:06 --- In RequestHeaders processing ...
2024/12/18 22:33:06 Headers: &{RequestHeaders:headers:{headers:{key:":authority"  raw_value:"$GTW_IP:$GTW_PORT"}  headers:{key:":path"  raw_value:"/v1/completions"}  headers:{key:":method"  raw_value:"POST"}  headers:{key:":scheme"  raw_value:"http"}  headers:{key:"user-agent"  raw_value:"curl/8.11.1"}  headers:{key:"accept"  raw_value:"*/*"}  headers:{key:"content-type"  raw_value:"application/json"}  headers:{key:"content-length"  raw_value:"123"}  headers:{key:"x-forwarded-for"  raw_value:"$CURL_CLIENT_POD_IP"}  headers:{key:"x-forwarded-proto"  raw_value:"http"}  headers:{key:"x-envoy-internal"  raw_value:"true"}  headers:{key:"x-request-id"  raw_value:"b0e3e720-85e5-44db-bb1c-c8af3d391caf"}}}
2024/12/18 22:33:06 EndOfStream: false
[request_header]Final headers being sent:
x-went-into-req-headers: true
2024/12/18 22:33:06
2024/12/18 22:33:06
2024/12/18 22:33:06 Got stream:  -->
2024/12/18 22:33:06 --- In RequestBody processing
2024/12/18 22:33:06 Error fetching cacheActiveLoraModel for pod vllm-llama2-7b-pool-55d46d588c-qqbsv and lora_adapter_requested tweet-summary: error fetching cacheActiveLoraModel for key vllm-llama2-7b-pool-55d46d588c-qqbsv:tweet-summary: Entry not found
Got cachePendingRequestActiveAdapters - Key: vllm-llama2-7b-pool-55d46d588c-qqbsv:, Value: {"Date":"2024-12-18T22:33:00Z","PodName":"vllm-llama2-7b-pool-55d46d588c-qqbsv","PendingRequests":0,"NumberOfActiveAdapters":0}
Fetched loraMetrics: []
Fetched requestMetrics: [{Date:2024-12-18T22:33:00Z PodName:vllm-llama2-7b-pool-55d46d588c-qqbsv PendingRequests:0 NumberOfActiveAdapters:0}]
Searching for the best pod...
Selected pod with the least active adapters: vllm-llama2-7b-pool-55d46d588c-qqbsv
Selected target pod: vllm-llama2-7b-pool-55d46d588c-qqbsv
Selected target pod IP: 10.244.0.38:8000
Liveness tweet-summary
No adapter
[request_body] Header Key: x-went-into-req-body, Header Value: true
[request_body] Header Key: target-pod, Header Value: 10.244.0.38:8000

The vLLM pod logs show the request being processed:

INFO 12-18 14:35:44 logger.py:37] Received request cmpl-737428ad-be14-44e2-976f-92160176f75b-0: prompt: 'Write as if you were a critic: San Francisco', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [1, 14350, 408, 565, 366, 892, 263, 11164, 29901, 3087, 8970], lora_request: LoRARequest(lora_name='tweet-summary', lora_int_id=2, lora_path='/adapters/hub/models--vineetsharma--qlora-adapter-Llama-2-7b-hf-TweetSumm/snapshots/796337d8e866318c59e38f16416e3ecd11fe5403', lora_local_path=None, long_lora_max_len=None, base_model_name='meta-llama/Llama-2-7b-hf'), prompt_adapter_request: None.
INFO 12-18 14:35:44 engine.py:267] Added request cmpl-737428ad-be14-44e2-976f-92160176f75b-0.
INFO 12-18 14:35:47 metrics.py:467] Avg prompt throughput: 2.2 tokens/s, Avg generation throughput: 9.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.5%, CPU KV cache usage: 0.0%.

The ext-proc server is not finding any active LoRA adapters for my vLLM pod:

fetchMetricsPeriodically requestMetrics: [{Date:2024-12-18T22:33:30Z PodName:vllm-llama2-7b-pool-55d46d588c-qqbsv PendingRequests:0 NumberOfActiveAdapters:0}]
fetchMetricsPeriodically loraMetrics: []

I see the loader init container load the tweet-summary lora module:

$ k logs deploy/vllm-llama2-7b-pool -c adapter-loader
['yard1/llama-2-7b-sql-lora-test', 'vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm']
Pulling adapter yard1/llama-2-7b-sql-lora-test
Fetching 9 files: 100%|██████████| 9/9 [00:01<00:00,  6.60it/s]
PAth here /adapters/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c
Pulling adapter vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
Fetching 8 files: 100%|██████████| 8/8 [00:01<00:00,  7.99it/s]
PAth here /adapters/hub/models--vineetsharma--qlora-adapter-Llama-2-7b-hf-TweetSumm/snapshots/796337d8e866318c59e38f16416e3ecd11fe5403

The vLLM container logs show the tweet-summary lora module:

INFO 12-18 14:53:19 api_server.py:652] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=[LoRAModulePath(name='sql-lora', path='/adapters/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/', base_model_name=None), LoRAModulePath(name='tweet-summary', path='/adapters/hub/models--vineetsharma--qlora-adapter-Llama-2-7b-hf-TweetSumm/snapshots/796337d8e866318c59e38f16416e3ecd11fe5403', base_model_name=None), LoRAModulePath(name='sql-lora-0', path='/adapters/yard1/llama-2-7b-sql-lora-test_0', base_model_name=None), LoRAModulePath(name='sql-lora-1', path='/adapters/yard1/llama-2-7b-sql-lora-test_1', base_model_name=None), LoRAModulePath(name='sql-lora-2', path='/adapters/yard1/llama-2-7b-sql-lora-test_2', base_model_name=None), LoRAModulePath(name='sql-lora-3', path='/adapters/yard1/llama-2-7b-sql-lora-test_3', base_model_name=None), LoRAModulePath(name='sql-lora-4', path='/adapters/yard1/llama-2-7b-sql-lora-test_4', base_model_name=None), LoRAModulePath(name='tweet-summary-0', path='/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0', base_model_name=None), LoRAModulePath(name='tweet-summary-1', path='/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1', base_model_name=None), LoRAModulePath(name='tweet-summary-2', path='/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2', base_model_name=None), LoRAModulePath(name='tweet-summary-3', path='/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3', base_model_name=None), LoRAModulePath(name='tweet-summary-4', path='/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4', base_model_name=None)], prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-2-7b-hf', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, mm_cache_preprocessor=False, enable_lora=True, enable_lora_bias=False, max_loras=4, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=12, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)

Any troubleshooting suggestions are much appreciated.

The text was updated successfully, but these errors were encountered:

Kellthuzad · 2024-12-18T23:36:39Z

Hey Daneyon! I can take a peek tomorrow

Kellthuzad · 2024-12-18T23:41:18Z

/assign kfswain

kfswain · 2024-12-20T16:38:15Z

Heya @danehans sorry for the delay! I'm seeing some similar behaviors. I'm not sure if I'm the right person to debug this. Looks like: #54 made some metric updates, will lean on @coolkp to give us an update here, I'm thinking it requires a specific vLLM image

liu-cong · 2024-12-20T17:34:25Z

Hey @danehans ! Thanks for reporting this.

Looking at the logs you provided, it looks it's either an very old version, or you may have forked the code? For example, I don't see this Error fetching cacheActiveLoraModel for pod log line. And the --- In RequestHeaders processing ... seems to be from initial POC which was quite a while ago.

The latest code is here: https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/e907f6076bc3089a8a75227b80ff1293af7d00dc/pkg

Can you follow the instructions and let me know if you run into any issues?

coolkp · 2024-12-20T18:36:52Z

The example needs you to replace the extproc image

danehans · 2024-12-30T19:32:09Z

Building the ext-proc image locally and updating the deployment to use the image resolved this issue. Thanks @coolkp @liu-cong for your help.

k8s-ci-robot assigned kfswain Dec 18, 2024

danehans closed this as completed Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No Active LoRA Adapters When Testing POC Example #109

No Active LoRA Adapters When Testing POC Example #109

danehans commented Dec 18, 2024

Kellthuzad commented Dec 18, 2024

Kellthuzad commented Dec 18, 2024

kfswain commented Dec 20, 2024

liu-cong commented Dec 20, 2024

coolkp commented Dec 20, 2024

danehans commented Dec 30, 2024

No Active LoRA Adapters When Testing POC Example #109

No Active LoRA Adapters When Testing POC Example #109

Comments

danehans commented Dec 18, 2024

Kellthuzad commented Dec 18, 2024

Kellthuzad commented Dec 18, 2024

kfswain commented Dec 20, 2024

liu-cong commented Dec 20, 2024

coolkp commented Dec 20, 2024

danehans commented Dec 30, 2024