Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC2516: Add a new message type for voice messages #2516

Closed
wants to merge 9 commits into from
66 changes: 66 additions & 0 deletions proposals/2516-new-type-for-voice-messages.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Add a separate message type for voice messages
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having worked with this for a while now, it seems it would in fact be best to go with an extensible events format on top of m.audio. This might block the MSC behind extensible events, but the intention would be to land with an m.voice event type that has content containing m.message, m.audio, and m.voice. As a fallback, implementations would use a regular m.room.message event for msgtype: "m.audio" and "m.voice": {} in the content (org.matrix.msc2516.voice during unstable implementation).

Does this sound sane? If it's too different from what you're comfortable with, let me know and I can open a new MSC to describe it in detail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've since split it out: #3245


In the matrix spec right now, there is a message type `m.audio` for audio files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other messaging apps, there is also a special type for voice memos,
since they carry a different meaning and inflict different behaviour.
This MSC calls for the introduction of an `m.voice` message type.

Even if it's not the primary mode of communication for nerds,
voice memos are very important to a lot of users of modern instant messaging services.
In order to provide awesome voice messages, they need to be treated differently from generic audio files.

For example, WhatsApp renders them differently, to highlight that they are a way of communication.
WhatsApp also always force-downloads them, because like a text message,
they should be available to consume as early as possible.
This lets the recipient know at a glance that they are being expected
to listen to the voice messages now, instead of later.

The presentation of voice messages should reinforce the authenticity
and potential urgency of the audio content.

## Proposal

I propose to introduce a new message type `m.voice` with the same
contents as `m.audio`.
Voice messages MUST be OGG files, Opus encoded. Other files can be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to wonder about the choice of ogg here: it looks like webm might be edging out ogg as a more widely supported container format, plus all the common browsers can produce it natively (chrome can't mux into ogg). Would be good to at least note why we picked ogg.

Copy link
Contributor

@jryans jryans Mar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also prefer something more easily accessible on client platforms, which indeed seems to be WebM, at least for the web browser case. (I assume for native platforms, the various containers are roughly identical in implementation complexity.)

What's the reason for selecting Ogg? Could we use WebM instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like us to stick with opus for several reasons, some of which are already discussed in #matrix-spec (and I thought translated here, but apparently not):

  1. File size is small in nearly every case.
  2. Playback is supported on all modern environments - anything not modern is unlikely to be using voice messages (IoT) or will be unable to do other things anyways.
  3. It's what all the other platforms do, making it easy/trivial for compatibility. If we had to transcode like bridges already do for video then we've somewhat failed to maintain our interoperability flag. (Video is transcoded because remote networks use insane formats, but opus/ogg isn't that insane)

...and a couple more less important reasons that hurt to type on a phone, but that's the gist of it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: I'm just questioning the container format, not the codec, ie. Opus in WebM, so the file size would be basically the same. Ogg does seem to be what other messaging platforms use though, so yeah, bridges would have to remux.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, fair enough. We could probably get away with webm though I think we'd also have to have a good reason for going against the grain, imo

sent as `m.audio`or `m.file`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought here: can we add something about audio format, ie sample rate / channel count here? I think not mandating one is probably fine, but if so we should make it clear that clients should expect anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw the more I think about this the more I draw a conclusion that the spec shouldn't care about a mandated set of values, but it should probably recommend some sane defaults (for those who just want to whack in some libraries and call it good). Clients expecting anything is fairly on-par with the latest directions of Matrix, anyhow.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbkr #3052 has that feature.


### Related links:
- [A long-standing issue on Riot Web that calls for voice messages
](https://github.com/vector-im/riot-web/issues/1358)
- [An earlier proposal to send m.typing-like status codes when recording
](https://github.com/matrix-org/matrix-doc/pull/310)
- [Telegram API for voice messages
](https://core.telegram.org/bots/api#sendvoice)

## Potential issues

Introducing a new message type means that client developers will have to
do work to implement it, or their users won't be able to use the feature.

## Alternatives

Alternatively, a flag `voice = true` or `type = "voice"` could be created inside of the `m.audio` event.
I'm not sure what the more canonical way of doing things would be here.

This alternative version (extending the m.audio message type) has the benefit
that it comes with backwards compatibility for free. However, we should keep
types as simple as possible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since m.audio has quite a bit of metadata defined for it, and this type basically copies it, I don't see how much adding a single flag (or even specifying a MIME type) would complicate the type. Meanwhile, if made another type as suggested here, readers of the spec will be left with a task of visually comparing the two types to figure out what the difference between the two is; and client authors will basically be forced to add some code that would determine this type in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should try and overload m.audio here - it would be the third attempt at spec-defined representations of events (first being msgtypes themselves, second being m.sticker). I really think we should just use the system we have (msgtype) and let some other MSC like MSC1767 solve it properly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not trying to repeat the discussion we had in #spec, but instead just documenting it, so that it doesn't get lost:

Well, this isn't necessarily a spec defined representation, but a content hint, that may change the client side representation. A lot of clients may opt to represent them the same way initially. We already have a info object for most media types, that describes the content more exactly by adding a filename, thumbnail, size, duration, etc.

Using a new message type will regress the UX for users first, since most clients will not understand the msgtype and will default to showing the voice message as text. Some clients may not focus on implementing them in the near future (location messages, proper reply rendering, and similar events are not necessarily implemented in clients currently for example), which will make for a worse experience when sending a voice message over an audio message. Each new message type also increases the implementation burden for new clients.

So basically we have, in my eyes:

msgtype flag
future-proof backwards compatible
duplication multiple roles for one type
flat type hierarchy deeply nested type tree

I guess choosing one over the other is a trade off, I prefer the flag though, since it is easier to implement and will cause less bug reports (because one side doesn't support it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly not convinced that we should be complicating the spec with a third way of representing events. Yes, older clients will not be able to render the events properly however realistically they should be updated shortly after a spec release to handle this sort of thing, even if it is just adding || content['msgtype'] === 'm.voice' to their ladder.

I would not anticipate an overwhelming number of bug reports for lack of support given historical cases of similar things (largely within the Element clients, but occasionally from the spec too). For example, we've not seen an overwhelming number of bug reports on Matrix projects relating to SSO despite major servers offering it as the only login option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it's a question of whether there's a semantic difference between voice messages and audio files, or if it's just an aesthetic difference. That is, are they different things, or are they the same thing but merely rendered differently? If they are different things, then it should be a different msgtype (in the same way that m.text and m.notice are both text messages, but have semantic differences and are different msgtypes). But if they are the same thing with different renderings, then I think it should be a flag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that is a good point. They probably have more of a semantic difference to the user and some clients will render them differently, but they are mostly the same on the backend. This is similar to stickers, which are usually represented very similarly and behave the same way in the backend (apart from the image source mostly), but we have different event types for them. For consistency it makes sense to have a different msgtype then, because they are different semantically imo. It just sucks for backwards compatibility, but if we want to be forward thinking, a different msgtype is certainly cleaner. Especially considering that notices and voice messages usually also cause a change in notification behaviour. (At least I could imagine setting a different notification sound or treating them more like a one sided call.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the semantic difference is pretty weak. We also see some talk here about including metadata for music files separately. At this point it seems quite weird to have 3 different "styles" of inline-playable audio clips that happen to have all or mostly the same parameters. Even in the examples showed from other messaging services the UI differences are minor. They appear to be the same type of widget with minor tweaks.

That being said, I agree it isn't major enough to block the RFC, but it does seem funny to add all of the spec and discussion for minor behaviour tweaks.


There is also #1767 (display hints) which tackles the same issue more generally,
but it is not ready, and voice messages should come first.

## Security considerations

@uhoreg offers:
> Auto-downloading of files (if clients follow WhatsApp's example) sounds
like it could be a security issue. (e.g. DoS by using up users' bandwidth,
could cause malicious content to be automatically downloaded)
Comment on lines +55 to +58
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a concern.

  • The client should consider the size before deciding to download or cut off the download past an "acceptable size" (AKA cache the prefix of the file) if bandwidth usage is a concern. This can be applied to all media content.
  • Most client will already auto-download things like images, so if the payload can be triggered just by downloading this adds no new attack surface. (Assuming that the file isn't decoded before play is pressed).


This could be solved by having clients handle auto-download responsibly,
e.g. only auto-download voice messages from trusted contacts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to better understand what "trusted contacts" means here - verified devices maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it would necessarily be the same as verified users, since they are "trusted" in different ways. "Trusted" here means more that you trust the other person to not behave badly (e.g. by not sending you a massive file, or a file that might trigger a vulnerability in your audio player), whereas verification indicates that you trust that the person is who they claim to be. But you could verify someone without trusting that they are not malicious (e.g. your friend who loves playing bad practical jokes, or an employee of a rival company that you're collaborating with but is still a rival).

It would probably be up to the client to figure out how to ask the user who is considered "trusted". Maybe it could do something like what mail clients do with images, where they ask you if you want to load the images in a message, and allow you to select whether to always allow from that sender.


## Unstable prefix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftr Element Web is going to use org.matrix.experimental.* to see how this MSC can benefit from something like #1767


While this MSC is not considered a stable part of the specification,
implementations should use `org.matrix.msc2516.voice` in place of `m.voice`.