Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Add caching to MLX model #999

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

kilianyp
Copy link

@kilianyp kilianyp commented Mar 16, 2025

Run without caching: image
Run with caching: image

Also fixes a bug with found_stop_sequence

EDIT: I just noticed the sum(token) with caching doesn't match the run without caching ⚠️ Not sure why, my first attempt was to cache based by number of messages but that lead to different behaviour. Is there a unit test I can use?

There's another bug when multiple prompts are passed to the agent.

@kilianyp kilianyp changed the title Add caching to MLX model Draft: Add caching to MLX model Mar 16, 2025
@kilianyp
Copy link
Author

kilianyp commented Mar 17, 2025

I think the integration here doesn't really make sense. How should caching be handeld? purely on the model side? @g-eoj (as you added the original implenetation, do you have an idea?)

@g-eoj
Copy link
Contributor

g-eoj commented Mar 17, 2025

@kilianyp the way you implemented caching makes sense to me. I don't know what sum(token) is referring to, can you describe the issue more?

@g-eoj
Copy link
Contributor

g-eoj commented Mar 17, 2025

I get unexpected output when a prompt cache is used. I don't understand the bug yet but it is clear that my output is different (and wrong) when using prompt cache vs. no prompt cache.

My setup has multiple agents using the same model. It looks like the cache reuses the first context it was given for a model, even if the context changes. I'm not sure, but it looks like the wrong output I get is always caused by the first agent's context.

@g-eoj
Copy link
Contributor

g-eoj commented Mar 17, 2025

If I modify smolagents so I can pass the cache at the agent level, I don't see errors:

CodeAgent(
    model=model,
    tools=[ ],
    prompt_cache=mlx_lm.models.cache.make_prompt_cache(model.model),
)

Basically each agent needs its own cache. Not sure if it'll address the issues seen "when multiple prompts are passed to the agent". @kilianyp if you want to paste a repro here, I'll take a look at it.

@kilianyp
Copy link
Author

kilianyp commented Apr 7, 2025

@g-eoj thanks for investigating 🙏 I wonder what the general strategy for caching here is. I don't see any references in the code base. @aymeric-roucher could you comment how KV caching should be integrated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants