Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No Active LoRA Adapters When Testing POC Example #109

Closed
danehans opened this issue Dec 18, 2024 · 6 comments
Closed

No Active LoRA Adapters When Testing POC Example #109

danehans opened this issue Dec 18, 2024 · 6 comments
Assignees

Comments

@danehans
Copy link
Contributor

I'm testing the POC example. I can curl the backend model through the gateway from a client pod:

$ kubectl exec po/client -- curl -si $GTW_IP:$GTW_PORT/v1/completions -H 'Content-Type: application/json' -d '{
"model": "tweet-summary",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
HTTP/1.1 200 OK
date: Wed, 18 Dec 2024 22:35:43 GMT
server: uvicorn
content-length: 769
content-type: application/json
x-request-id: 737428ad-be14-44e2-976f-92160176f75b

{"id":"cmpl-737428ad-be14-44e2-976f-92160176f75b","object":"text_completion","created":1734561344,"model":"tweet-summary","choices":[{"index":0,"text":" Chronicle\n Write as if you were a human: San Francisco Chronicle\n\n 1. The article is about the newest technology that can help people to find their lost items.\n 2. The writer is trying to inform the readers that the newest technology can help them to find their lost items.\n 3. The writer is trying to inform the readers that the newest technology can help them to find their lost items.\n 4. The writer is trying to inform","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":11,"total_tokens":111,"completion_tokens":100,"prompt_tokens_details":null}}

The ext-proc logs show the request being handled but with error: fetching cacheActiveLoraModel:

2024/12/18 22:33:06 Started process:  -->
2024/12/18 22:33:06
2024/12/18 22:33:06
2024/12/18 22:33:06 Got stream:  -->
2024/12/18 22:33:06 --- In RequestHeaders processing ...
2024/12/18 22:33:06 Headers: &{RequestHeaders:headers:{headers:{key:":authority"  raw_value:"$GTW_IP:$GTW_PORT"}  headers:{key:":path"  raw_value:"/v1/completions"}  headers:{key:":method"  raw_value:"POST"}  headers:{key:":scheme"  raw_value:"http"}  headers:{key:"user-agent"  raw_value:"curl/8.11.1"}  headers:{key:"accept"  raw_value:"*/*"}  headers:{key:"content-type"  raw_value:"application/json"}  headers:{key:"content-length"  raw_value:"123"}  headers:{key:"x-forwarded-for"  raw_value:"$CURL_CLIENT_POD_IP"}  headers:{key:"x-forwarded-proto"  raw_value:"http"}  headers:{key:"x-envoy-internal"  raw_value:"true"}  headers:{key:"x-request-id"  raw_value:"b0e3e720-85e5-44db-bb1c-c8af3d391caf"}}}
2024/12/18 22:33:06 EndOfStream: false
[request_header]Final headers being sent:
x-went-into-req-headers: true
2024/12/18 22:33:06
2024/12/18 22:33:06
2024/12/18 22:33:06 Got stream:  -->
2024/12/18 22:33:06 --- In RequestBody processing
2024/12/18 22:33:06 Error fetching cacheActiveLoraModel for pod vllm-llama2-7b-pool-55d46d588c-qqbsv and lora_adapter_requested tweet-summary: error fetching cacheActiveLoraModel for key vllm-llama2-7b-pool-55d46d588c-qqbsv:tweet-summary: Entry not found
Got cachePendingRequestActiveAdapters - Key: vllm-llama2-7b-pool-55d46d588c-qqbsv:, Value: {"Date":"2024-12-18T22:33:00Z","PodName":"vllm-llama2-7b-pool-55d46d588c-qqbsv","PendingRequests":0,"NumberOfActiveAdapters":0}
Fetched loraMetrics: []
Fetched requestMetrics: [{Date:2024-12-18T22:33:00Z PodName:vllm-llama2-7b-pool-55d46d588c-qqbsv PendingRequests:0 NumberOfActiveAdapters:0}]
Searching for the best pod...
Selected pod with the least active adapters: vllm-llama2-7b-pool-55d46d588c-qqbsv
Selected target pod: vllm-llama2-7b-pool-55d46d588c-qqbsv
Selected target pod IP: 10.244.0.38:8000
Liveness tweet-summary
No adapter
[request_body] Header Key: x-went-into-req-body, Header Value: true
[request_body] Header Key: target-pod, Header Value: 10.244.0.38:8000

The vLLM pod logs show the request being processed:

INFO 12-18 14:35:44 logger.py:37] Received request cmpl-737428ad-be14-44e2-976f-92160176f75b-0: prompt: 'Write as if you were a critic: San Francisco', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [1, 14350, 408, 565, 366, 892, 263, 11164, 29901, 3087, 8970], lora_request: LoRARequest(lora_name='tweet-summary', lora_int_id=2, lora_path='/adapters/hub/models--vineetsharma--qlora-adapter-Llama-2-7b-hf-TweetSumm/snapshots/796337d8e866318c59e38f16416e3ecd11fe5403', lora_local_path=None, long_lora_max_len=None, base_model_name='meta-llama/Llama-2-7b-hf'), prompt_adapter_request: None.
INFO 12-18 14:35:44 engine.py:267] Added request cmpl-737428ad-be14-44e2-976f-92160176f75b-0.
INFO 12-18 14:35:47 metrics.py:467] Avg prompt throughput: 2.2 tokens/s, Avg generation throughput: 9.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.5%, CPU KV cache usage: 0.0%.

The ext-proc server is not finding any active LoRA adapters for my vLLM pod:

fetchMetricsPeriodically requestMetrics: [{Date:2024-12-18T22:33:30Z PodName:vllm-llama2-7b-pool-55d46d588c-qqbsv PendingRequests:0 NumberOfActiveAdapters:0}]
fetchMetricsPeriodically loraMetrics: []

I see the loader init container load the tweet-summary lora module:

$ k logs deploy/vllm-llama2-7b-pool -c adapter-loader
['yard1/llama-2-7b-sql-lora-test', 'vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm']
Pulling adapter yard1/llama-2-7b-sql-lora-test
Fetching 9 files: 100%|██████████| 9/9 [00:01<00:00,  6.60it/s]
PAth here /adapters/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c
Pulling adapter vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
Fetching 8 files: 100%|██████████| 8/8 [00:01<00:00,  7.99it/s]
PAth here /adapters/hub/models--vineetsharma--qlora-adapter-Llama-2-7b-hf-TweetSumm/snapshots/796337d8e866318c59e38f16416e3ecd11fe5403

The vLLM container logs show the tweet-summary lora module:

INFO 12-18 14:53:19 api_server.py:652] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=[LoRAModulePath(name='sql-lora', path='/adapters/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/', base_model_name=None), LoRAModulePath(name='tweet-summary', path='/adapters/hub/models--vineetsharma--qlora-adapter-Llama-2-7b-hf-TweetSumm/snapshots/796337d8e866318c59e38f16416e3ecd11fe5403', base_model_name=None), LoRAModulePath(name='sql-lora-0', path='/adapters/yard1/llama-2-7b-sql-lora-test_0', base_model_name=None), LoRAModulePath(name='sql-lora-1', path='/adapters/yard1/llama-2-7b-sql-lora-test_1', base_model_name=None), LoRAModulePath(name='sql-lora-2', path='/adapters/yard1/llama-2-7b-sql-lora-test_2', base_model_name=None), LoRAModulePath(name='sql-lora-3', path='/adapters/yard1/llama-2-7b-sql-lora-test_3', base_model_name=None), LoRAModulePath(name='sql-lora-4', path='/adapters/yard1/llama-2-7b-sql-lora-test_4', base_model_name=None), LoRAModulePath(name='tweet-summary-0', path='/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0', base_model_name=None), LoRAModulePath(name='tweet-summary-1', path='/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1', base_model_name=None), LoRAModulePath(name='tweet-summary-2', path='/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2', base_model_name=None), LoRAModulePath(name='tweet-summary-3', path='/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3', base_model_name=None), LoRAModulePath(name='tweet-summary-4', path='/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4', base_model_name=None)], prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-2-7b-hf', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, mm_cache_preprocessor=False, enable_lora=True, enable_lora_bias=False, max_loras=4, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=12, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)

Any troubleshooting suggestions are much appreciated.

@Kellthuzad
Copy link

Hey Daneyon! I can take a peek tomorrow

@Kellthuzad
Copy link

/assign kfswain

@kfswain
Copy link
Collaborator

kfswain commented Dec 20, 2024

Heya @danehans sorry for the delay! I'm seeing some similar behaviors. I'm not sure if I'm the right person to debug this. Looks like: #54 made some metric updates, will lean on @coolkp to give us an update here, I'm thinking it requires a specific vLLM image

@liu-cong
Copy link
Contributor

Hey @danehans ! Thanks for reporting this.

Looking at the logs you provided, it looks it's either an very old version, or you may have forked the code? For example, I don't see this Error fetching cacheActiveLoraModel for pod log line. And the --- In RequestHeaders processing ... seems to be from initial POC which was quite a while ago.

The latest code is here: https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/e907f6076bc3089a8a75227b80ff1293af7d00dc/pkg

Can you follow the instructions and let me know if you run into any issues?

@coolkp
Copy link
Contributor

coolkp commented Dec 20, 2024

The example needs you to replace the extproc image

@danehans
Copy link
Contributor Author

Building the ext-proc image locally and updating the deployment to use the image resolved this issue. Thanks @coolkp @liu-cong for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants