How to pass cache prompt to openai chat completion module or how to achive token level streaming using post api? #8271

Raul824 · 2024-07-03T07:51:40Z

Raul824
Jul 3, 2024

I know this might be pretty basic but I am not able to find a solution.

I am using llama server to run the llama-3 model using below command.

/apps/model/llama_src/llama.cpp/llama-server --model /apps/model/llm_model/Meta-Llama-3-8B-Instruct-Q5_K_M.gguf --port 5050 --system-prompt-file /apps/model/data/template/system_template.txt

Have tried below 2 solutions.

When I use openai python module I am able to stream token by token but am not able to pass cache_prompt as true.

client = openai.OpenAI(
  base_url="http://localhost:5050/v1", # "http://<Your api-server IP>:port"
  api_key = "sk-no-key-required"
  )
  response = client.chat.completions.create(
  model="Meta-Llama-3-8B-Instruct",
  messages= [
              {
              "role": "user",
              "content": "Hello!"
              },
              {
              "role": "system",
              "content": system_prompt
              },
              {
              "role": "user",
              "content": user_prompt
              }
  ],
  temperature=0,
  stream=True
  )
  return response

When I use post to directly post to /v1/chat/completions api I am able to pass "cache_prompt" : True and cache works but I am not able to stream per token as I tried iterating over response but it doesn't seem to provide the seamless token by token output as the 1st solution.

    import requests
    base_url="http://localhost:5050/v1" # "http://<Your api-server IP>:port"
    chat_url = f"{base_url}/chat/completions"
    headers = {"Context-Type": "text/json", "Authorization": f"Bearer Bearer no-key"}
    data = {
    "model": "Meta-Llama-3-8B-Instruct",
    "cache_prompt" : True,
    "messages" : [
                {
                "role": "user",
                "content": "Hello!"
                },
                {
                "role": "system",
                "content": system_prompt
                },
                {
                "role": "user",
                "content": user_prompt
                }
    ],
    "params" : {
        "temperature": 0,
        "max_tokens": max_tokens
    }
    }
    response = requests.post(url=chat_url,headers=headers,json=data,stream=True)

Note: I also tried running the server with --prompt-cache-all and --prompt-cache-ro but in response I am not able to see it reading from cache but if I use api and pass cache_prompt to true I can see cache being used and the response is much faster.

Answered by Raul824

Jul 3, 2024

in chat completion
extra_body={"cache_prompt": True}

View full answer

Raul824 · 2024-07-03T09:04:49Z

Raul824
Jul 3, 2024
Author

in chat completion
extra_body={"cache_prompt": True}

1 reply

ThachNgocTran Apr 17, 2025

According to this page, cache_prompt is TRUE by default. This means that the client doesn't need to indicate this flag in its request, and still have the cached prompt feature enabled.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to pass cache prompt to openai chat completion module or how to achive token level streaming using post api? #8271

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How to pass cache prompt to openai chat completion module or how to achive token level streaming using post api? #8271

Raul824 Jul 3, 2024

Replies: 1 comment · 1 reply

Raul824 Jul 3, 2024 Author

ThachNgocTran Apr 17, 2025

Raul824
Jul 3, 2024

Replies: 1 comment 1 reply

Raul824
Jul 3, 2024
Author