Draft: Add caching to MLX model #999

kilianyp · 2025-03-16T18:16:35Z

Run without caching:

Run with caching:

Also fixes a bug with found_stop_sequence

EDIT: I just noticed the sum(token) with caching doesn't match the run without caching ⚠️ Not sure why, my first attempt was to cache based by number of messages but that lead to different behaviour. Is there a unit test I can use?

There's another bug when multiple prompts are passed to the agent.

kilianyp · 2025-03-17T10:16:35Z

I think the integration here doesn't really make sense. How should caching be handeld? purely on the model side? @g-eoj (as you added the original implenetation, do you have an idea?)

g-eoj · 2025-03-17T15:30:00Z

@kilianyp the way you implemented caching makes sense to me. I don't know what sum(token) is referring to, can you describe the issue more?

g-eoj · 2025-03-17T18:39:52Z

I get unexpected output when a prompt cache is used. I don't understand the bug yet but it is clear that my output is different (and wrong) when using prompt cache vs. no prompt cache.

My setup has multiple agents using the same model. It looks like the cache reuses the first context it was given for a model, even if the context changes. I'm not sure, but it looks like the wrong output I get is always caused by the first agent's context.

g-eoj · 2025-03-17T21:43:50Z

If I modify smolagents so I can pass the cache at the agent level, I don't see errors:

CodeAgent(
    model=model,
    tools=[ ],
    prompt_cache=mlx_lm.models.cache.make_prompt_cache(model.model),
)

Basically each agent needs its own cache. Not sure if it'll address the issues seen "when multiple prompts are passed to the agent". @kilianyp if you want to paste a repro here, I'll take a look at it.

kilianyp · 2025-04-07T22:00:27Z

@g-eoj thanks for investigating 🙏 I wonder what the general strategy for caching here is. I don't see any references in the code base. @aymeric-roucher could you comment how KV caching should be integrated?

Add caching to MLX model

e870436

kilianyp changed the title ~~Add caching to MLX model~~ Draft: Add caching to MLX model Mar 16, 2025

Reset num_processed token

796ebb2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Add caching to MLX model #999

Draft: Add caching to MLX model #999

kilianyp commented Mar 16, 2025 •

edited

Loading

kilianyp commented Mar 17, 2025 •

edited

Loading

g-eoj commented Mar 17, 2025

g-eoj commented Mar 17, 2025

g-eoj commented Mar 17, 2025

kilianyp commented Apr 7, 2025

Draft: Add caching to MLX model #999

Are you sure you want to change the base?

Draft: Add caching to MLX model #999

Conversation

kilianyp commented Mar 16, 2025 • edited Loading

kilianyp commented Mar 17, 2025 • edited Loading

g-eoj commented Mar 17, 2025

g-eoj commented Mar 17, 2025

g-eoj commented Mar 17, 2025

kilianyp commented Apr 7, 2025

kilianyp commented Mar 16, 2025 •

edited

Loading

kilianyp commented Mar 17, 2025 •

edited

Loading