Do not add an internal timeout for "extra" fetches used to compute directory size in the unsharding algorithm #100
Description
Continuation and expansion of #94 (comment). Discussing in its own issue as I think this might be too long for a review thread.
The final advice there was to remove the timeout and block on the fetch. I'm not challenging the given advice but giving the full context to make sure we're aligned here since this can have a potentially negative UX effect.
Brief background on the unsharding algorithm:
- The user modifies the HAMT directory (adds/removes an entry).
- The size has potentially changed. We need to evaluate if it has fallen below the threshold to switch to a basic directory. We do this in the same add/remove operation; we do not defer it to a later stage like the extraction of the CID in
GetNode
. - We don't want to fetch all the entries so we don't compute the total/exact size and instead just try to fetch as little as possible to prove we are still above the threshold.
- If some shards are unavailable in the network (high delay, missing/removed, etc.) we might not be able to retrieve enough shards to make sure we are above the threshold but we can't also claim we are below it.
- Since the priority is to have a deterministic relation between directory size and CID in case of doubt we fail the operation, even if we were able to perform the original add/remove.
- We can never really be sure when a shard is missing or actually has a long delay so we use a timeout to decide when to fail (for the reason explained above).
Here I'd say: the user has asked us to do an operation, we should keep trying until they cancel.
The user asked to do an add/remove operation but it may not be aware that it entails a potentially big fetch of shards well beyond the ones that were needed for the original operation.
-
Following this advice we would then remove the timeout and depend entirely on the client when to stop the operation the same way it had to, even before the sharding feature, as an add/remove operation can always be blocking if we don't find the necessary shards to operate on.
-
The advantage is the reduction of complexity as explained by Steven in the original comment.
-
The potential disadvantage is depending on a user-defined timeout that was designed for specific add/remove operations and not for
massivethese "extra" fetches that might exceed the ones needed originally by a noticeable degree.
If we're comfortable going this way a simple 'go for it' is enough; no need to keep expanding on the motivation to remove the timeout.