Skip to content

Structured Chat Messages #234

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ibrahimcetin opened this issue Mar 10, 2025 · 12 comments
Closed

Structured Chat Messages #234

ibrahimcetin opened this issue Mar 10, 2025 · 12 comments

Comments

@ibrahimcetin
Copy link
Contributor

I recently investigating ollama-swift package and I like its chat handling. Here is a part of it:

messages: [
    .system("You are a helpful assistant."),
    .user("In which city is Apple Inc. located?")
]

I think, we can create an abstraction similar to this on the current implementation.

Example usage:

let messages: [Qwen2VLChat.Message] = [
    .system("You are a helpful assistant."),
    .user("In which city is Apple Inc. located?"),
    .assistant("Cupertino, San Francisco.")
]

Then we can convert this messages to UserInput to generate.

I am not an expert and I am still learning but I assume that every model architecture have its own message template. For example, for Qwen2VL requires this:

[
    "role": role.rawValue,
    "content": [
        ["type": "text", "text": content]
    ]
        // Messages format for Qwen 2 VL, Qwen 2.5 VL. May need to be adapted for other models.
        + images.map { _ in
            ["type": "image"]
        }
        + videos.map { _ in
            ["type": "video"]
        },
]

If I am wrong please let me know.

I will continue based on this information from now. So, unlike ollama-swift implementation, we should create a protocol called Message and this protocol will have role, content, images, videos and maybe tool. And we can define a method like toRawMessage to convert it to [String: Any] format. The key is toRawMessage method. We will implement this method and it converts the messages.

Here is a basic definition, just for giving an idea:

enum Chat {
    public protocol Message {
        /// The role of the message sender.
        var role: Chat.MessageRole { get }

        /// The content of the message.
        var content: String { get }

        /// Array of image data associated with the message.
        var images: [UserInput.Image] { get }

        /// Array of video data associated with the message.
        var videos: [UserInput.Video] { get }

        static func system(_ content: String, images: [UserInput.Image], videos: [UserInput.Video]) -> Self

        static func assistant(_ content: String, images: [UserInput.Image], videos: [UserInput.Video]) -> Self

        static func user(_ content: String, images: [UserInput.Image], videos: [UserInput.Video]) -> Self

        func toRawMessage() -> [String: Any]
    }

    public enum MessageRole: String {
        case user
        case assistant
        case system
    }
}

extension UserInput {
    init(messages: [Chat.Message]) {
        var encoded: [[String: Any]] = []
        var images: [UserInput.Image] = []
        var videos: [UserInput.Video] = []

        for message in messages {
            encoded.append(message.encode())
            images += message.images
            videos += message.videos
        }

        self.init(messages: encoded, images: images, videos: videos)
    }
}

Here, as you can see, we can create UserInput from chat messages.

Finally, this is the implementation:

public enum Qwen2VLChat {
    public struct Message: Chat.Message {
        /* properties and static methods */

        func toRawMessage() -> [String : Any] {
            [
                "role": role.rawValue,
                "content": [
                    ["type": "text", "text": content]
                ]
                    // Messages format for Qwen 2 VL, Qwen 2.5 VL. May need to be adapted for other models.
                    + images.map { _ in
                        ["type": "image"]
                    }
                    + videos.map { _ in
                        ["type": "video"]
                    },
            ]
        }
    }
}

I want to discuss on this idea with you. What do you think?

@davidkoski
Copy link
Collaborator

I like the look of it -- I think this is the direction we should go. I wonder if the structured form is what UserInput should be?

@DePasqualeOrg any thoughts here? I think you have more experience with chat & chat history than I do.

@davidkoski
Copy link
Collaborator

I wonder if the structured form is what UserInput should be?

In particular the conversion to the dictionary form (to feed to the template) is model specific, so I think the UserInput should be generic and thus this structured form would make more sense.

@DePasqualeOrg
Copy link
Contributor

I think it will be okay, but it's hard for me to say without trying it in practice. Can we have an overload initializer on UserInput that accepts messages in the format [String: Any] as an escape hatch?

@davidkoski
Copy link
Collaborator

I think it will be okay, but it's hard for me to say without trying it in practice. Can we have an overload initializer on UserInput that accepts messages in the format [String: Any] as an escape hatch?

I think we could make something that would implement the protocol and hold a [String:Any] internally, so I think this would work fine.

I agree, it would be good to see this in practice, I wonder what is the best way to accomplish that? I guess it depends what you have time/inclination for.

Or @ibrahimcetin if you are interested you could try building it out and see how it works in your own apps or in the example apps.

@DePasqualeOrg
Copy link
Contributor

DePasqualeOrg commented Mar 11, 2025

@ibrahimcetin or whoever else is interested can take this on. I'm going to focus on vision models and improvements in swift-transformers in the near future. I'll be happy to test it out though.

@ibrahimcetin
Copy link
Contributor Author

I would happily work on this but first we have to decide on the design. I have another idea:

Instead Chat.Message protocol, we can make it a struct like this:

public enum Chat {
    public struct Message {
        /// The role of the message sender.
        var role: Role

        /// The content of the message.
        var content: String

        /// Array of image data associated with the message.
        var images: [UserInput.Image]

        /// Array of video data associated with the message.
        var videos: [UserInput.Video]

        static func system(_ content: String, images: [UserInput.Image], videos: [UserInput.Video]) -> Self {
            Self(role: .system, content: content, images: images, videos: videos)
        }

        static func assistant(_ content: String, images: [UserInput.Image], videos: [UserInput.Video]) -> Self {
            Self(role: .assistant, content: content, images: images, videos: videos)
        }

        static func user(_ content: String, images: [UserInput.Image], videos: [UserInput.Video]) -> Self {
            Self(role: .user, content: content, images: images, videos: videos)
        }

        public enum Role: String {
            case user
            case assistant
            case system
        }
    }
}
  • This also will avoid us declaring same static methods for every implementation.

Next, we can implement MessageGenerator protocol for each model architecture to generate [String: Any]:

public protocol MessageGenerator {
    /// Returns [String: Any] aka Message
    static func generate(message: Chat.Message) -> Message
}

public extension MessageGenerator {
    /// Returns array of [String: Any] aka Message
    static func generate(messages: [Chat.Message]) -> [Message] {
        var rawMessages: [Message] = []

        for message in messages {
            let raw = generate(message: message)
            rawMessages.append(raw)
        }

        return rawMessages
    }
}

It only has one method to implement: func generate(message: Chat.Message) -> Message. Here is an example:

public enum Qwen2VLMessageGenerator: MessageGenerator {
    public static func generate(message: Chat.Message) -> Message {
        [
            "role": message.role.rawValue,
            "content": [
                ["type": "text", "text": message.content]
            ]
            // Messages format for Qwen 2 VL, Qwen 2.5 VL. May need to be adapted for other models.
            + message.images.map { _ in
                ["type": "image"]
            }
            + message.videos.map { _ in
                ["type": "video"]
            },
        ]
    }
}

I think this way more clear than the first. But, I cannot find a way to use it within UserInput for now. What do you think about this? and Can we use it with current implementation?

Lastly, I can create an example chat app for this. Also, there is no example chat app in the repo currently, so it would be beneficial.

@davidkoski
Copy link
Collaborator

I think the API is roughly the same between the protocol and struct, which is what I was looking at. The protocol would allow applications to conform their own types. If we had the protocol we would probably want a default implementation of it so that users wouldn't have to implement their own.

The struct only approach would be more along the lines of what we have now: a structure to be filled in by the caller. This is fine too, but I think this is where it would be useful to see which was more useful.

The MessageGenerator protocol looks like the right approach. Perhaps UserInputProcessor could require a property / function that would return one (that ties it to the model but allows for multiple models to use the same one). The protocol could have a default implementation.

As to how the would fit into UserInput, we could potentially leave that until we are satisfied with the API. It doesn't have to be in UserInput but I don't know if there is value in having a third layer. UserInput vs LMInput makes sense to me: one is what the application wants to deal with and the other is what the model requires. Anyway, here are the important bits of UserInput:

public typealias Message = [String: Any]

    public enum Prompt: Sendable, CustomStringConvertible {
        case text(String)
        case messages([Message])

        public func asMessages() -> [Message] {

    public var prompt: Prompt

    public init(
        prompt: String, images: [Image] = [Image](), videos: [Video] = [Video](),

    public init(
        messages: [Message], images: [Image] = [Image](), videos: [Video] = [Video](),

    public init(
        prompt: Prompt, images: [Image] = [Image](), processing: Processing = .init(),

Roughly the prompt is either a single string or an array of dictionaries. Callers should be able to initialize a UserInput with a single string or an array of dictionaries.

After integrating your API we would want (I think):

  • be able to initialize with Chat (if this is going to be the container) or [Chat.Message] (if it is just a namespace)
  • the internal storage would be Chat / [Chat.Message] as a new property, e.g. chat or messages
  • the above init methods would construct [Chat.Message] as needed
  • prompt and Prompt would be deprecated but would still exist and would convert from [Chat.Message] into the current format

This would let users of the old API continue to work without change and let us switch to the new API gracefully.

Something along those lines anyway.

@ibrahimcetin
Copy link
Contributor Author

Maybe we can add new case to Prompt named chat([Chat.Message]) to be compatible with the current code (the name may be change). Then, UserInputProcessor takes appropriate MessageGenerator and generates the array of dictionaries. (Even further, we may don't need to toMessages() method and remove it. (?))

public enum Prompt: Sendable, CustomStringConvertible {
        case text(String)
        case messages([Message])
        case chat([Chat.Message])

       public init(chat: [Chat.Message], images: [Image] = [Image](), videos: [Video] = [Video]())
}

My goal is to make UserInput generic, type safe and easy to use. I think the message generation should be an implementation detail. So, my second approach is better to achieve these goals.

UserInput vs LMInput makes sense to me

I agree with you. Adding another layer may be unnecessary. We should continue to use UserInput. I am thinking that we should create and maintain MessageGenerators (as Models) and find a way to choose the correct generator to pass UserInputProcessor to generate correct array of dictionaries and we can use it directly.

Using Chat as a container will make it complex and unnecessary IMO. Keeping it as a namespace is simpler.

In conclusion, what do you think adding a new case to Prompt and using MessageGenerator types for each model which maintained by us to generate raw messages (array of dictionaries)?

@davidkoski
Copy link
Collaborator

I think we still need toMessages() (or equivalent) because the Jinja template processor expects an array of dictionaries.

@davidkoski
Copy link
Collaborator

Adding it to Prompt could work too -- I figured it might be obsolete. If we have easy ways to generate Chat.Message then ending up with a single backing representation that the MessageGenerator needs to handle it probably best.

I guess I would approach it as: if you were designing this from scratch what would you do? Let's build that and figure out how to make the old API work for any clients that still use it and we can mark those as deprecated. I think the bridge to map to the old API will be straightforward.

@ibrahimcetin
Copy link
Contributor Author

Thanks for your responses. I will give it a try.

@johnmai-dev
Copy link
Contributor

johnmai-dev commented Mar 26, 2025

(_ content: String, images: [UserInput.Image], videos: [UserInput.Video])
I believe the above method cannot confirm the order between text, images, and videos. I suggest adhering to the transformers' standards. Although I haven't found information on whether the order of content arrays affects the model, I think there might be in the future.

https://huggingface.co/docs/transformers/chat_templating_multimodal
https://huggingface.co/docs/transformers/chat_templating

Based on code standards and considerations for future scalability, I've made the following optimizations to the code:
Image

import Foundation
import Testing

@testable import Settings

typealias Image = URL
typealias Video = URL
    
enum Chat {
    enum Content: CustomStringConvertible {
        case text(String)
        case image(Image)
        case video(Video)
        
        var description: String {
            switch self {
            case .text(let text):
                text
            case .image(let image):
                // example
                image.absoluteString
            case .video(let video):
                // example
                video.absoluteString
            }
        }
    }
    
    public protocol MessageProtocol {
        var role: MessageRole { get }
        var content: [Content] { get }

        static func system(_ content: String) -> Self
        static func system(_ contents: Content...) -> Self
        static func assistant(_ content: String) -> Self
        static func assistant(_ contents: Content...) -> Self
        static func user(_ content: String) -> Self
        static func user(_ contents: Content...) -> Self
        func toRawMessage() -> [String: Any]
    }

    public enum MessageRole: String {
        case user
        case assistant
        case system
    }
    
    struct Message: MessageProtocol {
        var role: MessageRole
        
        var content: [Content]
        
        static func system(_ content: String) -> Message {
            Message(role: .system, content: [.text(content)])
        }
        
        static func system(_ content: Content...) -> Message {
            Message(role: .system, content: content)
        }
        
        static func assistant(_ content: String) -> Message {
            Message(role: .assistant, content: [.text(content)])
        }
        
        static func assistant(_ content: Content...) -> Message {
            Message(role: .assistant, content: content)
        }
        
        static func user(_ content: String) -> Message {
            Message(role: .user, content: [.text(content)])
        }
        
        static func user(_ content: Content...) -> Message {
            Message(role: .user, content: content)
        }
        
        func toRawMessage() -> [String: Any] {
            if content.count == 1, case .text(let text) = content[0] {
                return [
                    "role": role.rawValue,
                    "content": text
                ]
            }
                
            return [
                "role": role.rawValue,
                "content": content.map { content in
                    switch content {
                    case .text(let text):
                        ["type": "text", "text": text]
                    case .image(let image):
                        ["type": "image", "image": image]
                    case .video(let video):
                        ["type": "video", "video": video]
                    }
                }
            ]
        }
    }
}

@Test func example() async throws {
    let messages: [Chat.Message] = [
        .system("You are assistant."),
        .user("Hello"),
        .assistant("Hi"),
        .user(
            .image(URL(string: "https://example.com/image.jpg")!),
            .text("Image?")
        ),
        .assistant("I'm fine."),
        .user(
            .text("Video?"),
            .video(URL("https://example.com/video.mp4")!)
        )
    ]
    
    let rawMessages: [[String: Any]] = messages.map { $0.toRawMessage() }
    
    print(rawMessages)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants