-
Notifications
You must be signed in to change notification settings - Fork 397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSC2516: Add a new message type for voice messages #2516
Changes from all commits
3a7d67f
4607d6d
a85b80d
e93cf14
87ac8dd
442b900
f9ac5d2
30100fa
19d4d5c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,66 @@ | ||||||||||
# Add a separate message type for voice messages | ||||||||||
|
||||||||||
In the matrix spec right now, there is a message type `m.audio` for audio files. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. link to https://matrix.org/docs/spec/client_server/r0.6.1#m-audio would be helpful here. |
||||||||||
In other messaging apps, there is also a special type for voice memos, | ||||||||||
since they carry a different meaning and inflict different behaviour. | ||||||||||
This MSC calls for the introduction of an `m.voice` message type. | ||||||||||
|
||||||||||
Even if it's not the primary mode of communication for nerds, | ||||||||||
ludwigbald marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
voice memos are very important to a lot of users of modern instant messaging services. | ||||||||||
In order to provide awesome voice messages, they need to be treated differently from generic audio files. | ||||||||||
|
||||||||||
For example, WhatsApp renders them differently, to highlight that they are a way of communication. | ||||||||||
WhatsApp also always force-downloads them, because like a text message, | ||||||||||
they should be available to consume as early as possible. | ||||||||||
This lets the recipient know at a glance that they are being expected | ||||||||||
to listen to the voice messages now, instead of later. | ||||||||||
ludwigbald marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
The presentation of voice messages should reinforce the authenticity | ||||||||||
and potential urgency of the audio content. | ||||||||||
|
||||||||||
## Proposal | ||||||||||
|
||||||||||
I propose to introduce a new message type `m.voice` with the same | ||||||||||
contents as `m.audio`. | ||||||||||
Voice messages MUST be OGG files, Opus encoded. Other files can be | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm starting to wonder about the choice of ogg here: it looks like webm might be edging out ogg as a more widely supported container format, plus all the common browsers can produce it natively (chrome can't mux into ogg). Would be good to at least note why we picked ogg. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would also prefer something more easily accessible on client platforms, which indeed seems to be WebM, at least for the web browser case. (I assume for native platforms, the various containers are roughly identical in implementation complexity.) What's the reason for selecting Ogg? Could we use WebM instead? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd really like us to stick with opus for several reasons, some of which are already discussed in #matrix-spec (and I thought translated here, but apparently not):
...and a couple more less important reasons that hurt to type on a phone, but that's the gist of it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be clear: I'm just questioning the container format, not the codec, ie. Opus in WebM, so the file size would be basically the same. Ogg does seem to be what other messaging platforms use though, so yeah, bridges would have to remux. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah, fair enough. We could probably get away with webm though I think we'd also have to have a good reason for going against the grain, imo |
||||||||||
sent as `m.audio`or `m.file`. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another thought here: can we add something about audio format, ie sample rate / channel count here? I think not mandating one is probably fine, but if so we should make it clear that clients should expect anything. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fwiw the more I think about this the more I draw a conclusion that the spec shouldn't care about a mandated set of values, but it should probably recommend some sane defaults (for those who just want to whack in some libraries and call it good). Clients expecting anything is fairly on-par with the latest directions of Matrix, anyhow. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||||||
|
||||||||||
ludwigbald marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
### Related links: | ||||||||||
- [A long-standing issue on Riot Web that calls for voice messages | ||||||||||
](https://github.com/vector-im/riot-web/issues/1358) | ||||||||||
- [An earlier proposal to send m.typing-like status codes when recording | ||||||||||
](https://github.com/matrix-org/matrix-doc/pull/310) | ||||||||||
- [Telegram API for voice messages | ||||||||||
](https://core.telegram.org/bots/api#sendvoice) | ||||||||||
|
||||||||||
## Potential issues | ||||||||||
|
||||||||||
Introducing a new message type means that client developers will have to | ||||||||||
do work to implement it, or their users won't be able to use the feature. | ||||||||||
|
||||||||||
## Alternatives | ||||||||||
ludwigbald marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
Alternatively, a flag `voice = true` or `type = "voice"` could be created inside of the `m.audio` event. | ||||||||||
I'm not sure what the more canonical way of doing things would be here. | ||||||||||
ludwigbald marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
This alternative version (extending the m.audio message type) has the benefit | ||||||||||
that it comes with backwards compatibility for free. However, we should keep | ||||||||||
types as simple as possible. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we should try and overload There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not trying to repeat the discussion we had in #spec, but instead just documenting it, so that it doesn't get lost: Well, this isn't necessarily a spec defined representation, but a content hint, that may change the client side representation. A lot of clients may opt to represent them the same way initially. We already have a info object for most media types, that describes the content more exactly by adding a filename, thumbnail, size, duration, etc. Using a new message type will regress the UX for users first, since most clients will not understand the msgtype and will default to showing the voice message as text. Some clients may not focus on implementing them in the near future (location messages, proper reply rendering, and similar events are not necessarily implemented in clients currently for example), which will make for a worse experience when sending a voice message over an audio message. Each new message type also increases the implementation burden for new clients. So basically we have, in my eyes:
I guess choosing one over the other is a trade off, I prefer the flag though, since it is easier to implement and will cause less bug reports (because one side doesn't support it). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm honestly not convinced that we should be complicating the spec with a third way of representing events. Yes, older clients will not be able to render the events properly however realistically they should be updated shortly after a spec release to handle this sort of thing, even if it is just adding I would not anticipate an overwhelming number of bug reports for lack of support given historical cases of similar things (largely within the Element clients, but occasionally from the spec too). For example, we've not seen an overwhelming number of bug reports on Matrix projects relating to SSO despite major servers offering it as the only login option. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that it's a question of whether there's a semantic difference between voice messages and audio files, or if it's just an aesthetic difference. That is, are they different things, or are they the same thing but merely rendered differently? If they are different things, then it should be a different msgtype (in the same way that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess that is a good point. They probably have more of a semantic difference to the user and some clients will render them differently, but they are mostly the same on the backend. This is similar to stickers, which are usually represented very similarly and behave the same way in the backend (apart from the image source mostly), but we have different event types for them. For consistency it makes sense to have a different msgtype then, because they are different semantically imo. It just sucks for backwards compatibility, but if we want to be forward thinking, a different msgtype is certainly cleaner. Especially considering that notices and voice messages usually also cause a change in notification behaviour. (At least I could imagine setting a different notification sound or treating them more like a one sided call.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the semantic difference is pretty weak. We also see some talk here about including metadata for music files separately. At this point it seems quite weird to have 3 different "styles" of inline-playable audio clips that happen to have all or mostly the same parameters. Even in the examples showed from other messaging services the UI differences are minor. They appear to be the same type of widget with minor tweaks. That being said, I agree it isn't major enough to block the RFC, but it does seem funny to add all of the spec and discussion for minor behaviour tweaks. |
||||||||||
|
||||||||||
There is also #1767 (display hints) which tackles the same issue more generally, | ||||||||||
but it is not ready, and voice messages should come first. | ||||||||||
|
||||||||||
## Security considerations | ||||||||||
|
||||||||||
@uhoreg offers: | ||||||||||
> Auto-downloading of files (if clients follow WhatsApp's example) sounds | ||||||||||
like it could be a security issue. (e.g. DoS by using up users' bandwidth, | ||||||||||
could cause malicious content to be automatically downloaded) | ||||||||||
Comment on lines
+55
to
+58
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this is a concern.
|
||||||||||
|
||||||||||
This could be solved by having clients handle auto-download responsibly, | ||||||||||
e.g. only auto-download voice messages from trusted contacts. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be great to better understand what "trusted contacts" means here - verified devices maybe? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think it would necessarily be the same as verified users, since they are "trusted" in different ways. "Trusted" here means more that you trust the other person to not behave badly (e.g. by not sending you a massive file, or a file that might trigger a vulnerability in your audio player), whereas verification indicates that you trust that the person is who they claim to be. But you could verify someone without trusting that they are not malicious (e.g. your friend who loves playing bad practical jokes, or an employee of a rival company that you're collaborating with but is still a rival). It would probably be up to the client to figure out how to ask the user who is considered "trusted". Maybe it could do something like what mail clients do with images, where they ask you if you want to load the images in a message, and allow you to select whether to always allow from that sender. |
||||||||||
|
||||||||||
## Unstable prefix | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ftr Element Web is going to use |
||||||||||
|
||||||||||
While this MSC is not considered a stable part of the specification, | ||||||||||
implementations should use `org.matrix.msc2516.voice` in place of `m.voice`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having worked with this for a while now, it seems it would in fact be best to go with an extensible events format on top of
m.audio
. This might block the MSC behind extensible events, but the intention would be to land with anm.voice
event type that hascontent
containingm.message
,m.audio
, andm.voice
. As a fallback, implementations would use a regularm.room.message
event formsgtype: "m.audio"
and"m.voice": {}
in thecontent
(org.matrix.msc2516.voice
during unstable implementation).Does this sound sane? If it's too different from what you're comfortable with, let me know and I can open a new MSC to describe it in detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've since split it out: #3245