-
Notifications
You must be signed in to change notification settings - Fork 230
Implement Structured Chat Messages #257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Structured Chat Messages #257
Conversation
Libraries/MLXLMCommon/Chat.swift
Outdated
public let images: [UserInput.Image] | ||
|
||
/// Array of video data associated with the message. | ||
public let videos: [UserInput.Video] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought -- this seems like a better structure than UserInput
(which is too flat). I think UserInput
was largely a reflection of what mlx-vlm
(python) had for the command line tool, and while it was OK for that it is too simplistic.
We don't have to solve it all in one PR, but if we adopt this I think we should start deprecating pieces in UserInput that overlap with this. Possibly we can cover these values (perhaps referring to the first or last message), but we shouldn't have two ways of representing images in a conversation.
At the very least, I think part of this PR should mark the images
etc. properties in UserInput
as deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also be a good idea to try and integrate this into the examples (specifically as a multi-turn chat) -- it will help us make sure this covers the use cases. Again, that doesn't have to happen in this PR, I think we can iterate on it a bit before we cut tags for a release.
b70781e
to
9cf39e7
Compare
@ibrahimcetin sorry to take so long to get back to this! I have some suggested changes based on what I was saying / we were discussing about the structure of UserInput -- helping it convert fully to Chat.Message. How would you like to proceed?
I think the first option is easiest and will let us land this in a more complete state, but please let me know what you think! |
@davidkoski That works for me too, please feel free to proceed in the way that’s easiest for you. I'm also sorry for the delay. I didn’t expect it to take this long. Starting this weekend I’ll be available again and I’m working on a project that uses these changes. Thanks a lot for your patience and for following up! |
I was thinking that we should somehow force the prompt to be in the new format, but after playing with that I think that isn't quite right. I think what you have right now is the right approach. We have:
Getting rid of the raw prompt isn't right, but we can convert the examples to use the structured prompt -- it is easier to read and portable across models. |
Yes, I agree with you and I believe these changes are mostly ready. We should also add some unit tests to ensure everything works as expected. I also think the examples should be updated, so I’d suggest handling that in a separate PR. Let me know if you have any other thoughts or suggestions. |
I am concerned that the UserInput.images and prompt can get out of sync -- I think we can make that work using the same thing you did in the init for the Chat (that is what I am playing with now). |
Trying to set up tests but the way the project is set up is making it difficult. I think we want tests to be define in the Package.swift, but those are not callable from the enclosing xcodeproj. I tried making a shadow unit test target that referenced the same files but it is unable to link against the local Package.swift (e.g. MLXLMCommon). |
OK, got the tests building -- just something trivial to start with. |
/// public func prepare(input: UserInput) async throws -> LMInput { | ||
/// let messages = Qwen2VLMessageGenerator().generate(from: input) | ||
/// ... | ||
/// ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some documentation
/// | ||
/// If the ``prompt-swift.property`` is a ``Prompt-swift.enum/chat(_:)`` this will | ||
/// collect the images from the chat messages, otherwise these are the stored images with the ``UserInput``. | ||
public var images: [Image] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ibrahimcetin I made two changes of note, see what you think of them. The first is to make images
and video
reflect the chat messages OR store the values. This way you can mutate the prompt and these properties will still contain the right contents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The computed property compute images every time. To avoid this, we may use didSet
on prompt
property like this:
public var prompt: Prompt {
didSet {
switch prompt {
case .text, .messages:
break
case .chat(let messages):
_images = messages.reduce(into: []) { result, message in
result.append(contentsOf: message.images)
}
_videos = messages.reduce(into: []) { result, message in
result.append(contentsOf: message.videos)
}
}
}
}
public var images: [Image] {
get {
_images
}
set {
switch prompt {
case .text, .messages:
_images = newValue
case .chat:
break
}
}
}
public var videos: [Video] {
get {
_videos
}
set {
switch prompt {
case .text, .messages:
_videos = newValue
case .chat:
break
}
}
}
And we can update the computed properties like this.
Even, maybe we can just use the didSet
and keep the images
and videos
as they were.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a thought. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, good idea -- I think that would simplify things
I am going to rebase on main and then write some tests |
01b974f
to
5f439e9
Compare
@ibrahimcetin I really like what you have built! Take a look at the couple of changes I made and see what you think. I also added tests and some documentation. I think in a follow-on PR(s) we can:
and then perhaps:
|
@davidkoski Thank you for these changes, and most of the changes look good to me. Please review my comments on the code. I agree with you on following-on PRs and I can update the examples as soon as this PR is merged. I also consider to update ToolSpec similar to the ollama-swift tool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome addition, thank you!
I didn't see any new comments -- anything left before we merge? |
@@ -221,7 +221,7 @@ public class SmolVLMProcessor: UserInputProcessor { | |||
} | |||
|
|||
public func prepare(input: UserInput) async throws -> LMInput { | |||
let messages = input.prompt.asMessages() | |||
let messages = Qwen2VLMessageGenerator().generate(from: input) // TODO: Create SmolVLM2MessageGenerator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I temporarily used Qwen2VLMessageGenerator
, but I’m not certain if this is correct. Therefore, we might need to consider adding SmolVLM2MessageGenerator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it is roughly equivalent to what we had before with the raw messages. Perhaps we address it with the followup:
/// | ||
/// If the ``prompt-swift.property`` is a ``Prompt-swift.enum/chat(_:)`` this will | ||
/// collect the images from the chat messages, otherwise these are the stored images with the ``UserInput``. | ||
public var images: [Image] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The computed property compute images every time. To avoid this, we may use didSet
on prompt
property like this:
public var prompt: Prompt {
didSet {
switch prompt {
case .text, .messages:
break
case .chat(let messages):
_images = messages.reduce(into: []) { result, message in
result.append(contentsOf: message.images)
}
_videos = messages.reduce(into: []) { result, message in
result.append(contentsOf: message.videos)
}
}
}
}
public var images: [Image] {
get {
_images
}
set {
switch prompt {
case .text, .messages:
_images = newValue
case .chat:
break
}
}
}
public var videos: [Video] {
get {
_videos
}
set {
switch prompt {
case .text, .messages:
_videos = newValue
case .chat:
break
}
}
}
And we can update the computed properties like this.
Even, maybe we can just use the didSet
and keep the images
and videos
as they were.
My mistake, sorry. Now you are able to see. |
Good feedback! I made the changes to the images/videos properties as you suggested -- ready to merge? |
Everything looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent addition, thank you!
This PR introduces structured chat messages, addressing issue #234.
I initially made these changes a few days ago but held off on opening a PR because I wanted to create an example app to test them. However, I haven't had time recently, so I’m submitting this now to continue the discussion.
For the implementation, I chose to add a new
chat
case to thePrompt
enum, as I believe it is the simplest and most compatible approach.I plan to create a project to demonstrate these changes when I have time.
Looking forward to feedback!