Description
The internal:cluster/coordination/join/validate
and internal:cluster/coordination/publish_state
messages both carry a (compressed) representation of the whole cluster state, which can in theory be arbitrarily large. In practice there's a 2GiB limit which, if hit, will break the cluster, but even messages under the 2GiB limit can still be much larger than we should be allocating and sending over the transport layer in one go.
I'm proposing we instead spill the cluster state representation to a temporary file on disk and send it out in chunks, bounding the size of the individual messages on the wire. The receiving node can also accumulate the data in a temporary file on disk and then deserialize it from there.
sequenceDiagram
Master->>Peer: internal:cluster/coordination/publish_state request (first chunk)
note right of Peer: store chunk to temp file
loop with some amount of parallelism
Peer-->>Master: internal:cluster/coordination/publish_state/chunk request
note left of Master: read chunk from temp file
Master->>Peer: internal:cluster/coordination/publish_state/chunk response (next chunk)
end
note right of Peer: deserialize entire message from temp file
Peer->>Master: internal:cluster/coordination/publish_state response
We would also likely need to do #80400 to avoid hitting an unnecessary timeout during this process.