Skip to content

Anthropic's prompt caching in langchain does not work with ChatPromptTemplate. #26701

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks done
raajChit opened this issue Sep 20, 2024 · 12 comments · May be fixed by Nilesh3105/langchain#1
Open
2 tasks done

Anthropic's prompt caching in langchain does not work with ChatPromptTemplate. #26701

raajChit opened this issue Sep 20, 2024 · 12 comments · May be fixed by Nilesh3105/langchain#1
Assignees
Labels
🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder investigate Flagged for investigation.

Comments

@raajChit
Copy link

URL

https://python.langchain.com/docs/how_to/llm_caching/

Checklist

  • I added a very descriptive title to this issue.
  • I included a link to the documentation page I am referring to (if applicable).

Issue with current documentation:

I have not found any documentation for prompt caching in the langchain documentation. There seems to be only one post on twitter regarding prompt caching in langchain. I am trying to implement prompt caching in my rag system. I am using history aware retriever.

I have instantiated the model like this:

llm_claude = ChatAnthropic(
model="claude-3-5-sonnet-20240620",
temperature=0.1,
extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"}
)

And using the ChatPromptTemplate like this:

contextualize_q_prompt = ChatPromptTemplate.from_messages(
[
("system", contextualize_q_system_prompt),
("human", "{input}"),
]
)

I am not able to find a way to include prompt caching with this.
I tried making the prompt like this, but still doesnt work.

prompt = ChatPromptTemplate.from_messages([
SystemMessage(content=contextualize_q_system_prompt, additional_kwargs={"cache_control": {"type": "ephemeral"}}),
HumanMessage(content= "{input}")
])

Please help me with how I should enable prompt caching in langchain.

Idea or request for content:

Langchain documentation should be updated with how to use prompt caching with different prompt templates. And especially with a RAG system.

@langcarl langcarl bot added the investigate Flagged for investigation. label Sep 20, 2024
@dosubot dosubot bot added the 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder label Sep 20, 2024
@raajChit raajChit changed the title DOC: Anthropic's prompt caching in langchain does not work with ChatPromptTemplate. Anthropic's prompt caching in langchain does not work with ChatPromptTemplate. Sep 20, 2024
Copy link

dosubot bot commented Dec 20, 2024

Hi, @raajChit. I'm Dosu, and I'm helping the LangChain team manage their backlog. I'm marking this issue as stale.

Issue Summary:

  • You raised a concern about the lack of documentation for implementing prompt caching in LangChain.
  • Specifically, you are looking for guidance on using ChatPromptTemplate with Anthropic's models.
  • Your goal is to integrate prompt caching into a retrieval-augmented generation (RAG) system.
  • You suggested updating the LangChain documentation to include instructions for different prompt templates.
  • There have been no further comments or developments on this issue.

Next Steps:

  • Please let us know if this issue is still relevant to the latest version of the LangChain repository by commenting here.
  • If there is no further activity, this issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Dec 20, 2024
@drorm
Copy link

drorm commented Dec 23, 2024

This is a major issue. Without caching it's just too slow and too expensive. I tried:

  • Adding additional_kwargs={"cache_control": {"type": "ephemeral"}
  • Adding {"cache_control": {"type": "ephemeral"} to the message
    and caching doesn't happen.

The only thing that does get cached is the system prompt.

@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Dec 23, 2024
Copy link

dosubot bot commented Dec 23, 2024

@eyurtsev, the user has indicated that the issue regarding prompt caching is still a major concern, as it significantly impacts performance and cost. Could you please assist them with this matter?

@kornicameister
Copy link

@raajChit @drorm @eyurtsev ; I'm struggling with this one too. I have 80k token prompt that has some placeholders there. The moment I use SystemMessage, I get caching to work but I'm loosing the tool filling in the placeholders. If I do replace that with SystemMessagePromptTemplate, placeholders receive values but caching is not working.

It is not possible by design to cache that because it is not static prompt??

@DonghaeSuh
Copy link

I've posted a discussion on #29747 to enable the feature mentioned in this issue.

@Nilesh3105
Copy link

After initial investigation, it appears the only method for passing arbitrary key-value pairs (e.g.,
cache_control) in chat or system messages to the Anthropic client is by structuring the message content as a list[dict] when creating an instance of a BaseMessage subclass.

An example as a reference:

messages = [
    HumanMessage(
        content=[{
            "type": "text", 
            "text": TRANSCRIPT,
            "cache_control": {
                "type": "ephemeral",
            }
        }],
    ),
    HumanMessage(
        content="Summarize the transcript in 2-3 sentences.",
    ),
]

response = model.invoke(messages)
response.usage_metadata

Now when working with _StringImageMessagePromptTemplate subclasses (eg. HumanMessagePromptTemplate) this doesn't work. I looked at the code and found the reason in the following code blocks:

in _StringImageMessagePromptTemplate.from_template method

elif isinstance(template, list):
if (partial_variables is not None) and len(partial_variables) > 0:
msg = "Partial variables are not supported for list of templates."
raise ValueError(msg)
prompt = []
for tmpl in template:
if isinstance(tmpl, str) or isinstance(tmpl, dict) and "text" in tmpl:
if isinstance(tmpl, str):
text: str = tmpl
else:
text = cast(_TextTemplateParam, tmpl)["text"] # type: ignore[assignment]
prompt.append(
PromptTemplate.from_template(
text, template_format=template_format
)
)

in _StringImageMessagePromptTemplate.format method

if isinstance(self.prompt, StringPromptTemplate):
text = self.prompt.format(**kwargs)
return self._msg_class(
content=text, additional_kwargs=self.additional_kwargs
)
else:
content: list = []
for prompt in self.prompt:
inputs = {var: kwargs[var] for var in prompt.input_variables}
if isinstance(prompt, StringPromptTemplate):
formatted: Union[str, ImageURL] = prompt.format(**inputs)
content.append({"type": "text", "text": formatted})

So you can see that when it receives a list of dict with text key, it converts it to a StringPromptTemplate and drops all the additional properties/kwargs. In the format method, it creates a new dict but it cannot and doesn't push additional properties/kwargs present in the original message.


One solution I can think of with minimal change is to store these additional properties while creating an instance of PromptTemplate and access them in the format method before sending back the response.

@baskaryan @ccurme If this sounds good then I can potentially open a PR for this

@drorm
Copy link

drorm commented Mar 4, 2025

I have a partial solution. It reduced my cost by 2/3, but there's still some work to do here.
I created my own version of https://github.com/langchain-ai/langchain/blob/master/libs/partners/anthropic/langchain_anthropic/chat_models.py
at
https://github.com/drorm/vmpilot/blob/main/src/vmpilot/caching/chat_models.py
and then in https://github.com/drorm/vmpilot/blob/main/src/vmpilot/ you can see how I mark blocks as ephemeral:

agent.py:79:                    block["cache_control"] = {"type": "ephemeral"}
agent.py:83:                message.additional_kwargs["cache_control"] = {"type": "ephemeral"}
agent.py:175:            system_content["cache_control"] = {"type": "ephemeral"}
vmpilot.py:261:                    "type": "ephemeral"

This results in:

INFO:     157.131.22.45:0 - "POST /chat/completions HTTP/1.1" 200 OK                                                                                                                                                                   
{'cache_creation_input_tokens': 1694, 'cache_read_input_tokens': 2521, 'input_tokens': 4, 'output_tokens': 172}                                                                                                                        
{'cache_creation_input_tokens': 208, 'cache_read_input_tokens': 4215, 'input_tokens': 6, 'output_tokens': 317}                                                                                                                         
{'cache_creation_input_tokens': 331, 'cache_read_input_tokens': 4423, 'input_tokens': 6, 'output_tokens': 98}                                                                                                                          
{'cache_creation_input_tokens': 352, 'cache_read_input_tokens': 4754, 'input_tokens': 6, 'output_tokens': 475}                                                                                                                         
{'cache_creation_input_tokens': 514, 'cache_read_input_tokens': 5106, 'input_tokens': 6, 'output_tokens': 157}                                                                                                                         
{'cache_creation_input_tokens': 195, 'cache_read_input_tokens': 5620, 'input_tokens': 6, 'output_tokens': 309}                                                                                                                         
{'cache_creation_input_tokens': 347, 'cache_read_input_tokens': 5815, 'input_tokens': 6, 'output_tokens': 155}                                                                                                                         
{'cache_creation_input_tokens': 193, 'cache_read_input_tokens': 6162, 'input_tokens': 6, 'output_tokens': 330}                                                                                                                         
{'cache_creation_input_tokens': 368, 'cache_read_input_tokens': 6355, 'input_tokens': 6, 'output_tokens': 118}                                                                                                                         
{'cache_creation_input_tokens': 174, 'cache_read_input_tokens': 6723, 'input_tokens': 6, 'output_tokens': 102}                                                                                                                         
{'cache_creation_input_tokens': 298, 'cache_read_input_tokens': 6897, 'input_tokens': 6, 'output_tokens': 107}                                                                                                                         
{'cache_creation_input_tokens': 332, 'cache_read_input_tokens': 7195, 'input_tokens': 6, 'output_tokens': 211}                                                                                                                         
ne 25 steps in a row. Let me know if you'd like me to continue.                                                                                                                                                                        
INFO:     157.131.22.45:0 - "GET /models HTTP/1.1" 200 OK                                                                                                                                                                              
INFO:     157.131.22.45:0 - "GET /models HTTP/1.1" 200 OK                                                                                                                                                                              
vmpilot.anthropic                                                                                                                                                                                                                      
vmpilot.anthropic                                                                                                                                                                                                                      
INFO:     157.131.22.45:0 - "POST /chat/completions HTTP/1.1" 200 OK                                                                                                                                                                   
{'cache_creation_input_tokens': 980, 'cache_read_input_tokens': 4215, 'input_tokens': 4, 'output_tokens': 732}                                                                                                                         
{'cache_creation_input_tokens': 743, 'cache_read_input_tokens': 5195, 'input_tokens': 6, 'output_tokens': 99}                                                                                                                          
{'cache_creation_input_tokens': 324, 'cache_read_input_tokens': 5938, 'input_tokens': 6, 'output_tokens': 488}                                                                                                                         
{'cache_creation_input_tokens': 525, 'cache_read_input_tokens': 6262, 'input_tokens': 6, 'output_tokens': 107}                                                                                                                         
{'cache_creation_input_tokens': 345, 'cache_read_input_tokens': 6787, 'input_tokens': 6, 'output_tokens': 86}                                                                                                                          
{'cache_creation_input_tokens': 498, 'cache_read_input_tokens': 7132, 'input_tokens': 6, 'output_tokens': 86}                                                                                                                          
{'cache_creation_input_tokens': 329, 'cache_read_input_tokens': 7630, 'input_tokens': 6, 'output_tokens': 85}                                                                                                                          
{'cache_creation_input_tokens': 472, 'cache_read_input_tokens': 7959, 'input_tokens': 6, 'output_tokens': 85}                                                                                                                          
{'cache_creation_input_tokens': 577, 'cache_read_input_tokens': 8431, 'input_tokens': 6, 'output_tokens': 84}                                                                                                                          
{'cache_creation_input_tokens': 289, 'cache_read_input_tokens': 9008, 'input_tokens': 6, 'output_tokens': 435}                                                                                                                         
{'cache_creation_input_tokens': 472, 'cache_read_input_tokens': 9297, 'input_tokens': 6, 'output_tokens': 401}  

I'm planning on fixing this in drorm/vmpilot#27. Subscribe if you want to keep track. I expect an improvement of another 20%-40%, but won't know for sure till I've implemented it.

I didn't submit this as a pull request because this needs much more work and testing for a generalized solution. For my purpose, this works fine.

@0ca
Copy link

0ca commented Mar 27, 2025

@Nilesh3105
Copy link

Nilesh3105 commented Mar 28, 2025

@0ca this was working before as well but great to see it being added to the official docs as an example.

This issue actually focused on it working with ChatPromptTemplate. Like I mentioned above ChatPromptTempate right now only looks at type and text fields and ignores additional fields like cache_control and recreates the msg again.

Right now to work around this we manually call format_messages first and then add back the cache_control to the output messages

Nilesh3105 pushed a commit to Nilesh3105/langchain that referenced this issue Mar 28, 2025
…Template

The ChatPromptTemplate previously did not preserve additional fields like
`cache_control` when formatting messages, which prevented using Anthropic's
prompt caching feature. This change:

- Introduces `_PromptBlockWrapper` class to preserve additional fields in content blocks
- Modifies message template formatting to maintain field structure
- Preserves arbitrary fields (e.g. cache_control) in both text and image content
- Works with async operations and partial variables
- Adds comprehensive test coverage for field preservation

Fixes langchain-ai#26701
@drorm
Copy link

drorm commented Mar 28, 2025

This is all working correctly for me now, but I had to hack the code. Here's a log demonstrating it:

INFO:     157.131.22.45:0 - "POST /chat/completions HTTP/1.1" 200 OK
2025-03-27 19:04:35,385 - vmpilot.chat - INFO - Changed to project directory: /home/dror/vmpilot
2025-03-27 19:04:35,388 - vmpilot.exchange - INFO - New exchange started for chat_id: 5384alGfieQq
2025-03-27 19:04:35,402 - vmpilot.agent - INFO - Started new chat session with thread_id: 5384alGfieQq
2025-03-27 19:04:38,571 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 2634, 'cache_read_input_tokens': 0, 'input_tokens': 4, 'output_tokens': 107}
2025-03-27 19:04:40,969 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 398, 'cache_read_input_tokens': 2634, 'input_tokens': 6, 'output_tokens': 144}
2025-03-27 19:04:43,338 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 861, 'cache_read_input_tokens': 3032, 'input_tokens': 6, 'output_tokens': 108}
2025-03-27 19:04:45,100 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 534, 'cache_read_input_tokens': 3893, 'input_tokens': 6, 'output_tokens': 103}
2025-03-27 19:04:47,716 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 2335, 'cache_read_input_tokens': 4427, 'input_tokens': 6, 'output_tokens': 135}
2025-03-27 19:04:50,531 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 419, 'cache_read_input_tokens': 6762, 'input_tokens': 6, 'output_tokens': 122}
2025-03-27 19:04:53,134 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 435, 'cache_read_input_tokens': 7181, 'input_tokens': 6, 'output_tokens': 133}
2025-03-27 19:04:55,857 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 1018, 'cache_read_input_tokens': 7616, 'input_tokens': 6, 'output_tokens': 138}
2025-03-27 19:05:07,889 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 325, 'cache_read_input_tokens': 8634, 'input_tokens': 6, 'output_tokens': 1086}
2025-03-27 19:05:14,317 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 1126, 'cache_read_input_tokens': 8959, 'input_tokens': 6, 'output_tokens': 431}
2025-03-27 19:05:18,895 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 472, 'cache_read_input_tokens': 10085, 'input_tokens': 6, 'output_tokens': 347}
2025-03-27 19:05:22,480 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 388, 'cache_read_input_tokens': 10557, 'input_tokens': 6, 'output_tokens': 123}
2025-03-27 19:05:26,969 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 344, 'cache_read_input_tokens': 10945, 'input_tokens': 6, 'output_tokens': 122}
2025-03-27 19:05:32,816 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 249, 'cache_read_input_tokens': 11289, 'input_tokens': 6, 'output_tokens': 300}
INFO:     157.131.22.45:0 - "GET /models HTTP/1.1" 200 OK
INFO:     157.131.22.45:0 - "GET /models HTTP/1.1" 200 OK
INFO:     157.131.22.45:0 - "GET /models HTTP/1.1" 200 OK
INFO:     157.131.22.45:0 - "GET /models HTTP/1.1" 200 OK
INFO:     157.131.22.45:0 - "GET /models HTTP/1.1" 200 OK
INFO:     157.131.22.45:0 - "GET /models HTTP/1.1" 200 OK
vmpilot2.anthropic
vmpilot2.anthropic
INFO:     157.131.22.45:0 - "POST /chat/completions HTTP/1.1" 200 OK
2025-03-27 19:07:03,426 - vmpilot.exchange - INFO - New exchange started for chat_id: 5384alGfieQq
2025-03-27 19:07:03,438 - vmpilot.agent - INFO - Retrieved previous conversation state with 28 messages for thread_id: 5384alGfieQq
2025-03-27 19:07:07,353 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 325, 'cache_read_input_tokens': 11538, 'input_tokens': 4, 'output_tokens': 172}
2025-03-27 19:07:10,519 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 293, 'cache_read_input_tokens': 11863, 'input_tokens': 6, 'output_tokens': 131}
2025-03-27 19:07:13,566 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 1010, 'cache_read_input_tokens': 12156, 'input_tokens': 6, 'output_tokens': 132}
2025-03-27 19:07:16,565 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 653, 'cache_read_input_tokens': 13166, 'input_tokens': 6, 'output_tokens': 128}
2025-03-27 19:07:19,667 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 441, 'cache_read_input_tokens': 13819, 'input_tokens': 6, 'output_tokens': 134}
2025-03-27 19:07:22,972 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 548, 'cache_read_input_tokens': 14260, 'input_tokens': 6, 'output_tokens': 118}
2025-03-27 19:07:25,680 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 171, 'cache_read_input_tokens': 14808, 'input_tokens': 6, 'output_tokens': 114}
2025-03-27 19:07:29,306 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 347, 'cache_read_input_tokens': 14979, 'input_tokens': 6, 'output_tokens': 152}
2025-03-27 19:07:38,938 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 434, 'cache_read_input_tokens': 15326, 'input_tokens': 6, 'output_tokens': 574}
2025-03-27 19:07:42,775 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 615, 'cache_read_input_tokens': 15760, 'input_tokens': 6, 'output_tokens': 179}
2025-03-27 19:07:45,667 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 193, 'cache_read_input_tokens': 16375, 'input_tokens': 6, 'output_tokens': 110}
2025-03-27 19:07:49,393 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 288, 'cache_read_input_tokens': 16568, 'input_tokens': 6, 'output_tokens': 170}
2025-03-27 19:07:54,977 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 211, 'cache_read_input_tokens': 16856, 'input_tokens': 6, 'output_tokens': 261}
2025-03-27 19:08:02,024 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 302, 'cache_read_input_tokens': 17067, 'input_tokens': 6, 'output_tokens': 440}
INFO:     157.131.22.45:0 - "GET /models HTTP/1.1" 200 OK

Notice how the second exchange continues 'cache_read_input_tokens': 11538, which is the sum of the last message in the previous exchange.

What I did:
I created my own version of
https://github.com/langchain-ai/langchain/blob/master/libs/partners/anthropic/langchain_anthropic/chat_models.py
in
https://github.com/drorm/vmpilot/blob/main/src/vmpilot/caching/chat_models.py
and then in
https://github.com/drorm/vmpilot/blob/main/src/vmpilot/
look at agent.py and vmpilot.py
You can see how I I set ephemeral.
(The code is a bit messy right now. Claude and I are refactoring it :-)).

I've gone up to 50K cached tokens, but starting around 20K - 30K the quality starts degrading and the speed becomes painful and I start seeing timeouts.

@glenilame21
Copy link

@ccurme
Copy link
Collaborator

ccurme commented Apr 23, 2025

This should be closed by #30967 upon release. Here's an example:

Define prompt:

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate(
    [
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are a technology expert.",
                },
                {
                    "type": "text",
                    "text": "{context}",
                    "cache_control": {"type": "ephemeral"},
                },
            ],
        },
        {
            "role": "user",
            "content": "{query}",
        },
    ]
)

Usage:

import requests
from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-3-7-sonnet-20250219")

# Pull LangChain readme
get_response = requests.get(
    "https://raw.githubusercontent.com/langchain-ai/langchain/master/README.md"
)
readme = get_response.text

chain = prompt | llm

response_1 = chain.invoke(
    {
        "context": readme,
        "query": "What's LangChain, according to its README?",
    }
)
response_2 = chain.invoke(
    {
        "context": readme,
        "query": "Extract a link to the LangChain tutorials.",
    }
)

usage_1 = response_1.usage_metadata["input_token_details"]
usage_2 = response_2.usage_metadata["input_token_details"]

print(f"First invocation:\n{usage_1}")
print(f"\nSecond:\n{usage_2}")

# First invocation:
# {'cache_read': 0, 'cache_creation': 1519}

# Second:
# {'cache_read': 1519, 'cache_creation': 0}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder investigate Flagged for investigation.
Projects
None yet
9 participants