Skip to content

chat example (command line) #277

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

chat example (command line) #277

wants to merge 2 commits into from

Conversation

davidkoski
Copy link
Collaborator

@davidkoski davidkoski commented Apr 21, 2025

@ibrahimcetin here is the command line chat

img

./mlx-run llm-tool chat --model mlx-community/Qwen2-VL-2B-Instruct-4bit --image /Users/dkoski/Desktop/IMG_0912.jpeg --resize 512 --system "You are a helpful assistant who answers questions in English."

> what animal is in the picture?
The animal in the picture is a dog.

> what is behind the dog?
Behind the dog is a Christmas tree.

> what else?
Yes, there is also a fireplace with a Christmas tree and decorations behind the dog.

queries = rotaryEmbedding(queries, offset: offset)
keys = rotaryEmbedding(keys, offset: offset)

if let cache {
(keys, values) = cache.update(keys: keys, values: values)
}

let mask = mask?[.ellipsis, 0 ..< keys.dim(-2)]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that the dimensions were mismatched on a second message -- the keys dimension needs to be considered after the KVCache. In mlx-vlm it works out ok because:

  1. it looks like KVCache isn't used persistently
  2. the KVCache implementation doesn't window on the cache

/// Command line arguments for controlling generation of text.
struct GenerateArguments: ParsableArguments, Sendable {

struct PromptArguments: ParsableArguments, Sendable {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split this out because chat doesn't need a prompt.

size = CGSize(width: v0, height: v1)
}
userInput.processing.resize = size
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The processing part is the only thing that is interesting I think -- this should probably go into reusable arguments anyway

let modelContainer = try await memory.start { [args] in
do {
return try await args.load(
defaultModel: defaultModel.name, modelFactory: LLMModelFactory.shared)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how do we know VLM vs LLM? It could probe the registry but what if the id isn't registered? We can't rely on an image in the parameters because the user may load it at chat-time.


// TODO: need to figure out the proper ownrship for this -- maybe the loop
// below needs to go inside the context?
var cache: [KVCache]?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs work, but I think a chat application should show how to use a KVCache

if chat.count >= 2 {
chatWithMedia[1].images = images
chatWithMedia[1].videos = videos
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ibrahimcetin

Still working on this -- it works great with a single set of media, but the model / image processing code doesn't expect two messages with independent sets of media.

On the swift side:

public struct LMInput {
    public let text: Text
    public let image: ProcessedImage?
    public let video: ProcessedVideo?

Tokens + one set of media. The python code in mlx-vlm has a similar setup. I think maybe that should be:

public struct LMInput {
    public let text: Text
    public let image: [ProcessedImage]
    public let video: [ProcessedVideo]

and then the model can inject the appropriate piece.

The Hashable conformance in the UserInput was me thinking that perhaps we should cache the ProcessedImage / Video so it doesn't need to be reprocessed.

Another option is to use the KVCache -- it will capture the image embedding and we could then omit the image from subsequent generations. Except we need to represent it space-wise so the size of the context remains consistent (we can't just drop the image).

FWIW @Blaizzy mlx-vlm's chat implementation has a problem here I think:

The code only presents the image for the first message and then discards it. It doesn't use a KVCache so it actually "forgets" what the image is, though the model is pretty good at faking it:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
what animal is in the picture.  answer in english<|im_end|>
<|im_start|>assistant
The animal in the picture is a dog.<|im_end|>
<|im_start|>user
what is the dog wearing?<|im_end|>
<|im_start|>assistant
The dog is wearing a collar.<|im_end|>
<|im_start|>user
what is behind the dog?<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant
There is a person behind the dog.<|im_end|>

img

There are maybe two problems here:

  • dropping the image data
  • the fact that the image is attached to the last message -- I found that if the image data is present it needs to be associated with the first message that refers to it

Anyway, with just a single image I think this is pretty close. It needs to be cleaned up, but it works. Maybe we only deal with one image set in the thread for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ibrahimcetin an example of the multi-image in chat issue:

image

so perhaps we document that as an issue (or disable to ability to add multiple media sets) until we figure it out. I don't think it needs to block the chat examples.

// TODO: figure out ownership here
if cache == nil {
cache = context.model.newCache(parameters: generate.generateParameters)
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache is mutable and I am playing fast and loose with ownership here. To be fixed.

// it should be a paramter to the higher level generate?
var iterator = try TokenIterator(
input: input, model: context.model, cache: cache,
parameters: generate.generateParameters)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't care about the iterator -- I think the generate() call should probably take an optional cache.

@davidkoski davidkoski requested a review from awni April 23, 2025 20:18
@@ -729,10 +730,10 @@ public func generate(
/// }
/// ```
public func generate(
input: LMInput, parameters: GenerateParameters, context: ModelContext
input: LMInput, cache: [KVCache]? = nil, parameters: GenerateParameters, context: ModelContext
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easily pass the optional cache in to the iterator

@@ -97,4 +97,7 @@ public class KVCacheSimple: KVCache, Evaluatable {
)
}

public var debugDescription: String {
"\(String(describing: Self.self)) \(Unmanaged.passUnretained(self).toOpaque()), offset: \(offset), step: \(step), keys: \(keys?.shape.description ?? "-"), values: \(values?.shape.description ?? "-")"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a convenience I needed while debugging Qwen2

processing: media.processing,
images: media.images, videos: media.videos,
chat: [.system(generate.system)],
cache: context.model.newCache(parameters: parameters))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want a follow up PR for #276 to add the KVCache, but without the change to generate() it is hard to use.

}

/// Argument package for supplying media files
struct MediaArguments: ParsableArguments, Sendable {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also split this out because we use it in eval and chat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant