-
Notifications
You must be signed in to change notification settings - Fork 229
Structured Chat Messages #234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I like the look of it -- I think this is the direction we should go. I wonder if the structured form is what UserInput should be? @DePasqualeOrg any thoughts here? I think you have more experience with chat & chat history than I do. |
In particular the conversion to the dictionary form (to feed to the template) is model specific, so I think the UserInput should be generic and thus this structured form would make more sense. |
I think it will be okay, but it's hard for me to say without trying it in practice. Can we have an overload initializer on |
I think we could make something that would implement the protocol and hold a I agree, it would be good to see this in practice, I wonder what is the best way to accomplish that? I guess it depends what you have time/inclination for. Or @ibrahimcetin if you are interested you could try building it out and see how it works in your own apps or in the example apps. |
@ibrahimcetin or whoever else is interested can take this on. I'm going to focus on vision models and improvements in swift-transformers in the near future. I'll be happy to test it out though. |
I would happily work on this but first we have to decide on the design. I have another idea: Instead public enum Chat {
public struct Message {
/// The role of the message sender.
var role: Role
/// The content of the message.
var content: String
/// Array of image data associated with the message.
var images: [UserInput.Image]
/// Array of video data associated with the message.
var videos: [UserInput.Video]
static func system(_ content: String, images: [UserInput.Image], videos: [UserInput.Video]) -> Self {
Self(role: .system, content: content, images: images, videos: videos)
}
static func assistant(_ content: String, images: [UserInput.Image], videos: [UserInput.Video]) -> Self {
Self(role: .assistant, content: content, images: images, videos: videos)
}
static func user(_ content: String, images: [UserInput.Image], videos: [UserInput.Video]) -> Self {
Self(role: .user, content: content, images: images, videos: videos)
}
public enum Role: String {
case user
case assistant
case system
}
}
}
Next, we can implement public protocol MessageGenerator {
/// Returns [String: Any] aka Message
static func generate(message: Chat.Message) -> Message
}
public extension MessageGenerator {
/// Returns array of [String: Any] aka Message
static func generate(messages: [Chat.Message]) -> [Message] {
var rawMessages: [Message] = []
for message in messages {
let raw = generate(message: message)
rawMessages.append(raw)
}
return rawMessages
}
} It only has one method to implement: public enum Qwen2VLMessageGenerator: MessageGenerator {
public static func generate(message: Chat.Message) -> Message {
[
"role": message.role.rawValue,
"content": [
["type": "text", "text": message.content]
]
// Messages format for Qwen 2 VL, Qwen 2.5 VL. May need to be adapted for other models.
+ message.images.map { _ in
["type": "image"]
}
+ message.videos.map { _ in
["type": "video"]
},
]
}
} I think this way more clear than the first. But, I cannot find a way to use it within Lastly, I can create an example chat app for this. Also, there is no example chat app in the repo currently, so it would be beneficial. |
I think the API is roughly the same between the protocol and struct, which is what I was looking at. The protocol would allow applications to conform their own types. If we had the protocol we would probably want a default implementation of it so that users wouldn't have to implement their own. The struct only approach would be more along the lines of what we have now: a structure to be filled in by the caller. This is fine too, but I think this is where it would be useful to see which was more useful. The As to how the would fit into UserInput, we could potentially leave that until we are satisfied with the API. It doesn't have to be in UserInput but I don't know if there is value in having a third layer. UserInput vs LMInput makes sense to me: one is what the application wants to deal with and the other is what the model requires. Anyway, here are the important bits of UserInput: public typealias Message = [String: Any]
public enum Prompt: Sendable, CustomStringConvertible {
case text(String)
case messages([Message])
public func asMessages() -> [Message] {
public var prompt: Prompt
public init(
prompt: String, images: [Image] = [Image](), videos: [Video] = [Video](),
public init(
messages: [Message], images: [Image] = [Image](), videos: [Video] = [Video](),
public init(
prompt: Prompt, images: [Image] = [Image](), processing: Processing = .init(), Roughly the prompt is either a single string or an array of dictionaries. Callers should be able to initialize a UserInput with a single string or an array of dictionaries. After integrating your API we would want (I think):
This would let users of the old API continue to work without change and let us switch to the new API gracefully. Something along those lines anyway. |
Maybe we can add new case to public enum Prompt: Sendable, CustomStringConvertible {
case text(String)
case messages([Message])
case chat([Chat.Message])
public init(chat: [Chat.Message], images: [Image] = [Image](), videos: [Video] = [Video]())
} My goal is to make
I agree with you. Adding another layer may be unnecessary. We should continue to use Using In conclusion, what do you think adding a new case to |
I think we still need |
Adding it to Prompt could work too -- I figured it might be obsolete. If we have easy ways to generate I guess I would approach it as: if you were designing this from scratch what would you do? Let's build that and figure out how to make the old API work for any clients that still use it and we can mark those as deprecated. I think the bridge to map to the old API will be straightforward. |
Thanks for your responses. I will give it a try. |
https://huggingface.co/docs/transformers/chat_templating_multimodal Based on code standards and considerations for future scalability, I've made the following optimizations to the code: import Foundation
import Testing
@testable import Settings
typealias Image = URL
typealias Video = URL
enum Chat {
enum Content: CustomStringConvertible {
case text(String)
case image(Image)
case video(Video)
var description: String {
switch self {
case .text(let text):
text
case .image(let image):
// example
image.absoluteString
case .video(let video):
// example
video.absoluteString
}
}
}
public protocol MessageProtocol {
var role: MessageRole { get }
var content: [Content] { get }
static func system(_ content: String) -> Self
static func system(_ contents: Content...) -> Self
static func assistant(_ content: String) -> Self
static func assistant(_ contents: Content...) -> Self
static func user(_ content: String) -> Self
static func user(_ contents: Content...) -> Self
func toRawMessage() -> [String: Any]
}
public enum MessageRole: String {
case user
case assistant
case system
}
struct Message: MessageProtocol {
var role: MessageRole
var content: [Content]
static func system(_ content: String) -> Message {
Message(role: .system, content: [.text(content)])
}
static func system(_ content: Content...) -> Message {
Message(role: .system, content: content)
}
static func assistant(_ content: String) -> Message {
Message(role: .assistant, content: [.text(content)])
}
static func assistant(_ content: Content...) -> Message {
Message(role: .assistant, content: content)
}
static func user(_ content: String) -> Message {
Message(role: .user, content: [.text(content)])
}
static func user(_ content: Content...) -> Message {
Message(role: .user, content: content)
}
func toRawMessage() -> [String: Any] {
if content.count == 1, case .text(let text) = content[0] {
return [
"role": role.rawValue,
"content": text
]
}
return [
"role": role.rawValue,
"content": content.map { content in
switch content {
case .text(let text):
["type": "text", "text": text]
case .image(let image):
["type": "image", "image": image]
case .video(let video):
["type": "video", "video": video]
}
}
]
}
}
}
@Test func example() async throws {
let messages: [Chat.Message] = [
.system("You are assistant."),
.user("Hello"),
.assistant("Hi"),
.user(
.image(URL(string: "https://example.com/image.jpg")!),
.text("Image?")
),
.assistant("I'm fine."),
.user(
.text("Video?"),
.video(URL("https://example.com/video.mp4")!)
)
]
let rawMessages: [[String: Any]] = messages.map { $0.toRawMessage() }
print(rawMessages)
}
|
I recently investigating ollama-swift package and I like its chat handling. Here is a part of it:
I think, we can create an abstraction similar to this on the current implementation.
Example usage:
Then we can convert this messages to
UserInput
to generate.I am not an expert and I am still learning but I assume that every model architecture have its own message template. For example, for Qwen2VL requires this:
If I am wrong please let me know.
I will continue based on this information from now. So, unlike ollama-swift implementation, we should create a protocol called
Message
and this protocol will haverole
,content
,images
,videos
and maybetool
. And we can define a method liketoRawMessage
to convert it to [String: Any] format. The key istoRawMessage
method. We will implement this method and it converts the messages.Here is a basic definition, just for giving an idea:
Here, as you can see, we can create
UserInput
from chat messages.Finally, this is the implementation:
I want to discuss on this idea with you. What do you think?
The text was updated successfully, but these errors were encountered: