Skip to content
This repository was archived by the owner on Apr 26, 2024. It is now read-only.

Commit 89a71e7

Browse files
authored
Fix a rare bug where initial /syncs would fail (#15383)
This change fixes a rare bug where initial /syncs would fail with a `KeyError` under the following circumstances: 1. A user fast joins a remote room. 2. The user is kicked from the room before the room's full state has been synced. 3. A second local user fast joins the room. 4. Events are backfilled into the room with a higher topological ordering than the original user's leave. They are assigned a negative stream ordering. It's not clear how backfill happened here, since it is expected to be equivalent to syncing the full state. 5. The second local user leaves the room before the room's full state has been synced. The homeserver does not complete the sync. 6. The original user performs an initial /sync with lazy_load_members enabled. * Because they were kicked from the room, the room is included in the /sync response even though the include_leave option is not specified. * To populate the room's timeline, `_load_filtered_recents` / `get_recent_events_for_room` fetches events with a lower stream ordering than the leave event and picks the ones with the highest topological orderings (which are most recent). This captures the backfilled events after the leave, since they have a negative stream ordering. These events are filtered out of the timeline, since the user was not in the room at the time and cannot view them. The sync code ends up with an empty timeline for the room that notably does not include the user's leave event. This seems buggy, but at least we don't disclose events the user isn't allowed to see. * Normally, `compute_state_delta` would fetch the state at the start and end of the room's timeline to generate the sync response. Since the timeline is empty, it fetches the state at `min(now, last event in the room)`, which corresponds with the second user's leave. The state during the entirety of the second user's membership does not include the membership for the first user because of partial state. This part is also questionable, since we are fetching state from outside the bounds of the user's membership. * `compute_state_delta` then tries and fails to find the user's membership in the auth events of timeline events. Because there is no timeline event whose auth events are expected to contain the user's membership, a `KeyError` is raised. Also contains a drive-by fix for a separate unlikely race condition. Signed-off-by: Sean Quah <[email protected]>
1 parent c0772b4 commit 89a71e7

File tree

2 files changed

+20
-5
lines changed

2 files changed

+20
-5
lines changed

changelog.d/15383.bugfix

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Fix a rare bug introduced in Synapse 1.66.0 where initial syncs would fail when the user had been kicked from a faster joined room that had not finished syncing.

synapse/handlers/sync.py

+19-5
Original file line numberDiff line numberDiff line change
@@ -943,6 +943,8 @@ async def compute_state_delta(
943943

944944
timeline_state = {}
945945

946+
# Membership events to fetch that can be found in the room state, or in
947+
# the case of partial state rooms, the auth events of timeline events.
946948
members_to_fetch = set()
947949
first_event_by_sender_map = {}
948950
for event in batch.events:
@@ -964,9 +966,19 @@ async def compute_state_delta(
964966
# (if we are) to fix https://github.com/vector-im/riot-web/issues/7209
965967
# We only need apply this on full state syncs given we disabled
966968
# LL for incr syncs in #3840.
967-
members_to_fetch.add(sync_config.user.to_string())
968-
969-
state_filter = StateFilter.from_lazy_load_member_list(members_to_fetch)
969+
# We don't insert ourselves into `members_to_fetch`, because in some
970+
# rare cases (an empty event batch with a now_token after the user's
971+
# leave in a partial state room which another local user has
972+
# joined), the room state will be missing our membership and there
973+
# is no guarantee that our membership will be in the auth events of
974+
# timeline events when the room is partial stated.
975+
state_filter = StateFilter.from_lazy_load_member_list(
976+
members_to_fetch.union((sync_config.user.to_string(),))
977+
)
978+
else:
979+
state_filter = StateFilter.from_lazy_load_member_list(
980+
members_to_fetch
981+
)
970982

971983
# We are happy to use partial state to compute the `/sync` response.
972984
# Since partial state may not include the lazy-loaded memberships we
@@ -988,7 +1000,9 @@ async def compute_state_delta(
9881000
# sync's timeline and the start of the current sync's timeline.
9891001
# See the docstring above for details.
9901002
state_ids: StateMap[str]
991-
1003+
# We need to know whether the state we fetch may be partial, so check
1004+
# whether the room is partial stated *before* fetching it.
1005+
is_partial_state_room = await self.store.is_partial_state_room(room_id)
9921006
if full_state:
9931007
if batch:
9941008
state_at_timeline_end = (
@@ -1119,7 +1133,7 @@ async def compute_state_delta(
11191133
# If we only have partial state for the room, `state_ids` may be missing the
11201134
# memberships we wanted. We attempt to find some by digging through the auth
11211135
# events of timeline events.
1122-
if lazy_load_members and await self.store.is_partial_state_room(room_id):
1136+
if lazy_load_members and is_partial_state_room:
11231137
assert members_to_fetch is not None
11241138
assert first_event_by_sender_map is not None
11251139

0 commit comments

Comments
 (0)