Skip to content

Sender Data: Background task to retry fetching sender data for megolm sessions #3546

Closed
@andybalaam

Description

@andybalaam

Latest update: we think we don't need a background task at all: there is only a point re-trying if we get back more information about devices or users from a /keys/query request, so we will simply process any retries due whenever a /keys/query request ends.

Launch a background task when we start that retries fetching sender data for megolm sessions.

Update: originally we expected this background task not to handle "slow lane" (new sessions for which we don't have data but we suspect it may appear from the next keys/query response) items, but we now expect that it will, so we need make sure that we kick off a retry cycle whenever receive_keys_query_response completes. (This should not clash with normal retries so we need a lock of some kind.)

Part of #3544 which is part of Invisible Crypto.

The algorithm below mentions jumping to certain steps. This is referring to the algorithm in #3543 .

Algorithm

This handles all of the following scenarios:

  • Legacy sessions - these effectively have missing device info.
    • These may already exist before this code is deployed, or be created by importing keys that are missing device info, or by restoring backups that are missing device info.
  • Non-legacy sessions with missing device info - we are waiting for it to appear.
  • Sessions with device info that is signed by a cross-signing key we don't recognise - we are waiting to see whether it will appear.
  • Sessions that we received but were interrupted before we finished processing them.
    • (Note: we don't store the session immediately we receive it, but if we have not finished with it by the time the sync is finished, we will have stored it. If we were interrupted before the sync is finished, we will see this to-device message again the next time we sync.)

In each case we will query the store for these sessions and retry them, updating their retry count if we need to continue waiting.

The background job begins on startup and repeatedly does the following:

  • Query the sessions in the store that are due for a retry because either:
    • they are in state UnknownDevice and their next_retry_time_ms property is before now
    • or they are in state DeviceInfo and their next_retry_time_ms property is before now
  • Order by next_retry_time_ms ascending and take the first $BATCH_SIZE (~200?).
  • For each
    • if state==UnknownDevice, [take the lock, bail out if not], jump to step C.3 (Within the algorithm above, we don't technically need to spawn additional tasks in step F.2, but it won't do any harm to do so - they probably complete immediately because we're probably not waiting for a keys/query given we just waited for one in C.3.)
    • else, state==DeviceInfo, [take the lock, bail out if not], jump to F.3
  • Await all of these tasks - they will complete quickly.
  • Repeat. If there were no pending jobs… go to sleep for a while

Each session gets its own independent background task, which is OK because tasks are light and they all just wait for the same keys/query to finish. After that, they may save to the store, which might cause some contention but is probably fine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions