Skip to content

Commit a7c28fd

Browse files
authored
Merge pull request #3 from matrix-org/human-id-rules
Proposal for human ID rules.
2 parents d80a019 + aebfcda commit a7c28fd

File tree

1 file changed

+114
-63
lines changed

1 file changed

+114
-63
lines changed

drafts/human-id-rules.rst

Lines changed: 114 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -1,81 +1,132 @@
1-
This document outlines the format for human-readable IDs within matrix.
1+
Abstract
2+
========
23

3-
Overview
4-
--------
5-
UTF-8 is quickly becoming the standard character encoding set on the web. As
6-
such, Matrix requires that all strings MUST be encoded as UTF-8. However,
4+
This document outlines the format for human-readable IDs within Matrix.
5+
6+
Background
7+
----------
8+
UTF-8 is the dominant character encoding for Unicode on the web. However,
79
using Unicode as the character set for human-readable IDs is troublesome. There
810
are many different characters which appear identical to each other, but would
9-
identify different users. In addition, there are non-printable characters which
10-
cannot be rendered by the end-user. This opens up a security vulnerability with
11+
produce different IDs. In addition, there are non-printable characters which
12+
cannot be rendered by the end-user. This creates an opportunity for
1113
phishing/spoofing of IDs, commonly known as a homograph attack.
1214

13-
Web browers encountered this problem when International Domain Names were
15+
Web browsers encountered this problem when International Domain Names were
1416
introduced. A variety of checks were put in place in order to protect users. If
1517
an address failed the check, the raw punycode would be displayed to
16-
disambiguate the address. Similar checks are performed by homeservers in
17-
Matrix. However, Matrix does not use punycode representations, and so does not
18-
show raw punycode on a failed check. Instead, homeservers must outright reject
19-
these misleading IDs.
18+
disambiguate the address.
2019

21-
Types of human-readable IDs
22-
---------------------------
23-
There are two main human-readable IDs in question:
20+
The human-readable IDs in Matrix are Room Aliases and User IDs.
21+
Room aliases look like ``#localpart:domain``. These aliases point to opaque
22+
non human-readable room IDs. These pointers can change to point at a different
23+
room ID at any time. User IDs look like ``@localpart:domain``. These represent
24+
actual end-users (there is no indirection).
2425

25-
- Room aliases
26-
- User IDs
26+
Proposal
27+
========
2728

28-
Room aliases look like ``#localpart:domain``. These aliases point to opaque
29-
non human-readable room IDs. These pointers can change, so there is already an
30-
issue present with the same ID pointing to a different destination at a later
31-
date.
32-
33-
User IDs look like ``@localpart:domain``. These represent actual end-users, and
34-
unlike room aliases, there is no layer of indirection. This presents a much
35-
greater concern with homograph attacks.
36-
37-
Checks
38-
------
39-
- Similar to web browsers.
40-
- blacklisted chars (e.g. non-printable characters)
41-
- mix of language sets from 'preferred' language not allowed.
42-
- Language sets from CLDR dataset.
43-
- Treated in segments (localpart, domain)
44-
- Additional restrictions for ease of processing IDs.
45-
46-
- Room alias localparts MUST NOT have ``#`` or ``:``.
47-
- User ID localparts MUST NOT have ``@`` or ``:``.
48-
49-
Rejecting
50-
---------
51-
- Homeservers MUST reject room aliases which do not pass the check, both on
52-
GETs and PUTs.
53-
- Homeservers MUST reject user ID localparts which do not pass the check, both
54-
on creation and on events.
55-
- Any homeserver whose domain does not pass this check, MUST use their punycode
56-
domain name instead of the IDN, to prevent other homeservers rejecting you.
57-
- Error code is ``M_FAILED_HUMAN_ID_CHECK``. (generic enough for both failing
58-
due to homograph attacks, and failing due to including ``:`` s, etc)
59-
- Error message MAY go into further information about which characters were
60-
rejected and why.
61-
- Error message SHOULD contain a ``failed_keys`` key which contains an array
62-
of strings which represent the keys which failed the check e.g::
63-
64-
failed_keys: [ user_id, room_alias ]
65-
66-
Other considerations
67-
--------------------
68-
- Basic security: Informational key on the event attached by HS to say "unsafe
29+
User IDs and Room Aliases MUST be Unicode as UTF-8. Checks are performed on
30+
these IDs by homeservers to protect users from phishing/spoofing attacks.
31+
These checks are:
32+
33+
User ID Localparts:
34+
- MUST NOT contain a ``:`` or start with a ``@`` or ``.``
35+
- MUST NOT contain one of the 107 blacklisted characters on this list:
36+
http://kb.mozillazine.org/Network.IDN.blacklist_chars
37+
- After stripping " 0-9, +, -, [, ], _, and the space character it MUST NOT
38+
contain characters from >1 language, defined by the `exemplar characters`_
39+
on http://cldr.unicode.org/
40+
41+
.. _exemplar characters: http://cldr.unicode.org/translation/characters#TOC-Exemplar-Characters
42+
43+
Room Alias Localparts:
44+
- MUST NOT contain a ``:``
45+
- MUST NOT contain one of the 107 blacklisted characters on this list:
46+
http://kb.mozillazine.org/Network.IDN.blacklist_chars
47+
- After stripping " 0-9, +, -, [, ], _, and the space character it MUST NOT
48+
contain characters from >1 language, defined by the `exemplar characters`_
49+
on http://cldr.unicode.org/
50+
51+
.. _exemplar characters: http://cldr.unicode.org/translation/characters#TOC-Exemplar-Characters
52+
53+
In the event of a failed user ID check, well behaved homeservers MUST:
54+
- Rewrite user IDs in the offending events to be punycode with an additional ``@``
55+
prefix **before** delivering them to clients. There are no guarantees for
56+
consistency between homeserver ID checking implementations. As a result, user
57+
IDs MUST be sent in their *original* form over federation. This can be done in
58+
a stateless manner as the punycode form has no information loss.
59+
60+
In the event of a failed room alias check, well behaved homeservers MUST:
61+
- Send an HTTP status code 400 with an ``errcode`` of ``M_FAILED_HUMAN_ID_CHECK``
62+
to the client if the client is attempting to *create* this alias.
63+
- Send an HTTP status code 400 with an ``errcode`` of ``M_FAILED_HUMAN_ID_CHECK``
64+
to the client if the client is attempting to *join* a room via this alias.
65+
66+
Examples::
67+
68+
@ebаy:domain.com (Cyrillic 'a', everything else English)
69+
@@xn--eby-7cd:domain.com (Punycode with additional '@')
70+
71+
Homeservers SHOULD NOT allow two user IDs that differ only by case. This
72+
SHOULD be applied based on the capitalisation rules in the CLDR dataset:
73+
http://cldr.unicode.org/
74+
75+
This check SHOULD be applied when the user ID is created, in order to prevent
76+
registration with the same name and different capitalisations, e.g.
77+
``@foo:bar`` vs ``@Foo:bar`` vs ``@FOO:bar``. Homeservers MAY canonicalise
78+
the user ID to be completely lower-case if desired.
79+
80+
Rationale
81+
=========
82+
83+
Each ID is split into segments (localpart/domain) around the ``:``. For
84+
this reason, ``:`` is a reserved character and cannot be a localpart character.
85+
The 107 blacklisted characters are used to prevent non-printable characters and
86+
spaces from being used. The decision to ban characters from more than 1 language
87+
matches the behaviour of `Google Chrome for IDN handling`_. This is to protect
88+
against common homograph attacks such as ebаy.com (Cyrillic "a", rest is
89+
English). This would always result in a failed check. Even with this though
90+
there are limitations. For example, сахар is entirely Cyrillic, whereas caxap is
91+
entirely Latin.
92+
93+
.. _Google Chrome for IDN handling: https://www.chromium.org/developers/design-documents/idn-in-google-chrome
94+
95+
User ID localparts cannot start with ``@`` so that a namespace of localparts
96+
beginning with ``@`` can be created. This namespace is used for user IDs which
97+
fail the ID checks. A failed ID could look like ``@@xn--c1yn36f:domain.com``.
98+
99+
If a user ID fails the check, the user ID on the event is renamed. This doesn't
100+
require extra work for clients, and users will see an odd user ID rather than a
101+
spoofed name. Renaming is done in order to protect users of a given HS, so if a
102+
malicious HS doesn't rename their IDs, it doesn't affect any other HS.
103+
104+
Room aliases cannot be rewritten as punycode and sent to the HS the alias is
105+
referring to as the HS will not necessarily understand the rewritten alias.
106+
107+
Other rejected solutions for failed checks
108+
------------------------------------------
109+
- Additional key: Informational key on the event attached by HS to say "unsafe
69110
ID". Problem: clients can just ignore it, and since it will appear only very
70111
rarely, easy to forget when implementing clients.
71-
- Moderate security: Requires client handshake. Forces clients to implement
112+
- Require client handshake: Forces clients to implement
72113
a check, else they cannot communicate with the misleading ID. However, this
73114
is extra overhead in both client implementations and round-trips.
74-
- High security: Outright rejection of the ID at the point of creation /
115+
- Reject event: Outright rejection of the ID at the point of creation /
75116
receiving event. Point of creation rejection is preferable to avoid the ID
76117
entering the system in the first place. However, malicious HSes can just
77118
allow the ID. Hence, other homeservers must reject them if they see them in
78119
events. Client never sees the problem ID, provided the HS is correctly
79-
implemented.
80-
- High security decided; client doesn't need to worry about it, no additional
81-
protocol complexity aside from rejection of an event.
120+
implemented. However, it is difficult to ensure that ALL HSes will come to the
121+
same conclusion (given the CLDR dataset does come out with new versions).
122+
123+
Outstanding Problems
124+
====================
125+
126+
Capitalisation
127+
--------------
128+
129+
The capitalisation rules outlined above are nice but do not fully resolve issues
130+
where ``@alice:example.com`` tries to speak with ``@bob:domain.com`` using
131+
``@Bob:domain.com``. It is up to ``domain.com`` to map ``Bob`` to ``bob`` in
132+
a sensible way.

0 commit comments

Comments
 (0)