Introduce ChunkedBlobOutputStream #74620

original-brownbear · 2021-06-28T11:52:22Z

Extracted the chunked output stream logic from #74313 and added tests for it to make it easier to review.

Extracted the chunked output stream logic from elastic#74313 and added tests for it to make it easier to review.

elasticmachine · 2021-06-28T11:52:25Z

Pinging @elastic/es-distributed (Team:Distributed)

tlrx

LGTM - I don't find the usage of this class really obvious and I wonder if we should instead keep the parts list and buffer private (and to be passed as parameters to the flush buffer method) but given the deadlines here I'm fine with moving forward as it is.

tlrx · 2021-06-28T13:23:04Z

server/src/main/java/org/elasticsearch/repositories/blobstore/ChunkedBlobOutputStream.java

+
+    protected ChunkedBlobOutputStream(BigArrays bigArrays, long maxBytesToBuffer) {
+        this.bigArrays = bigArrays;
+        this.maxBytesToBuffer = maxBytesToBuffer;


we should check maxBytesToBuffer > 0

++ added a check

tlrx · 2021-06-28T13:31:13Z

server/src/main/java/org/elasticsearch/repositories/blobstore/ChunkedBlobOutputStream.java

+    public final void close() throws IOException {
+        if (closed) {
+            assert false : "this output stream should only be closed once";
+            throw new AlreadyClosedException("already closed");


I wonder if we should just ignore double closing? I think that's what is usually done in many streams (but we should keep the assertion)

It's kind of weird to have an assertion here but quietly skip over double closing otherwise? If we had a bug here that would only show in some corner case we'd never see it in logs otherwise?

tlrx · 2021-06-28T13:31:47Z

server/src/main/java/org/elasticsearch/repositories/blobstore/ChunkedBlobOutputStream.java

+    /**
+     * Mark all blob bytes as properly received by {@link #write}, indicating that {@link #close} may finalize the blob.
+     */
+    public final void markSuccess() {


Really a suggestion:

Suggested change

public final void markSuccess() {

public final void done() {

tlrx · 2021-06-28T13:46:57Z

server/src/test/java/org/elasticsearch/repositories/blobstore/ChunkedBlobOutputStreamTests.java

+
+            @Override
+            protected void onCompletion() throws IOException {
+                if (buffer.size() > 0) {


This should be handled by ChunkedBlobOutputStream itself I think

Unfortunately, I couldn't find a neat way of encapsulating this (without adding even more complexity at least) because of the way the onCompletion will do a normal write if nothing has been buffered where it needs access to the buffer anyway.

tlrx · 2021-06-28T13:47:59Z

server/src/test/java/org/elasticsearch/repositories/blobstore/ChunkedBlobOutputStreamTests.java

+                final BytesReference bytes = buffer.bytes();
+                bytes.writeTo(out);
+                writtenBytesCounter.addAndGet(bytes.length());
+                finishPart(partIdSupplier.incrementAndGet());


I wonder if we could make flushBuffer() return the part identifier and have the logic that exist in finishPart(() being private in ChunkedBlobOutputStream too.

There's a bit of complexity to doing this I think with all the error handling and special casing so different across implementations, but I'll try and see if I can do something nicer in the main PR (don't want to break the API here now and I'm having ab it of a hard time to reason through all edge cases around API changes from the main PR). But in general I'm ++ to the idea if possible.

…t-stream

fcofdez

LGTM, I agree with Tanguy, the usage of this class is a bit difficult to follow, but as we discussed it's difficult to handle logic failures not related to the upload itself.

original-brownbear · 2021-06-28T16:15:19Z

Thanks Tanguy and Francisco!

Extracted the chunked output stream logic from elastic#74313 and added tests for it to make it easier to review.

This PR adds a new API for doing streaming serialization writes to a repository to enable repository metadata of arbitrary size and at bounded memory during writing. The existing write-APIs require knowledge of the eventual blob size beforehand. This forced us to materialize the serialized blob in memory before writing, costing a lot of memory in case of e.g. very large RepositoryData (and limiting us to 2G max blob size). With this PR the requirement to fully materialize the serialized metadata goes away and the memory overhead becomes completely bounded by the outbound buffer size of the repository implementation. As we move to larger repositories this makes master node stability a lot more predictable since writing out RepositoryData does not take as much memory any longer (same applies to shard level metadata), enables aggregating multiple metadata blobs into a single larger blobs without massive overhead and removes the 2G size limit on RepositoryData. backport of #74313 and #74620

See #53119 for more context about why those tests are muted on JDK8. They start failing more often recently now #74313 and #74620 have been merged, as reported in #74739.

See elastic#53119 for more context about why those tests are muted on JDK8. They start failing more often recently now elastic#74313 and elastic#74620 have been merged, as reported in elastic#74739.

See #53119 for more context about why those tests are muted on JDK8. They start failing more often recently now #74313 and #74620 have been merged, as reported in #74739. Co-authored-by: Tanguy Leroux <[email protected]>

Introduce ChunkedBlobOutputStream

3bb0e5a

Extracted the chunked output stream logic from elastic#74313 and added tests for it to make it easier to review.

original-brownbear added >non-issue :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.14.0 labels Jun 28, 2021

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jun 28, 2021

original-brownbear requested review from tlrx and fcofdez June 28, 2021 11:52

tlrx approved these changes Jun 28, 2021

View reviewed changes

original-brownbear added 2 commits June 28, 2021 16:15

ensure param positive

337e172

Merge remote-tracking branch 'elastic/master' into chunked-blob-outpu…

e960e9d

…t-stream

fcofdez approved these changes Jun 28, 2021

View reviewed changes

original-brownbear merged commit f77f87e into elastic:master Jun 28, 2021

original-brownbear deleted the chunked-blob-output-stream branch June 28, 2021 16:15

original-brownbear added backport pending and removed backport pending labels Jun 28, 2021

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jun 29, 2021

Introduce ChunkedBlobOutputStream (elastic#74620)

5aa0968

Extracted the chunked output stream logic from elastic#74313 and added tests for it to make it easier to review.

original-brownbear mentioned this pull request Jun 29, 2021

Save Memory on Large Repository Metadata Blob Writes #74693

Merged

tlrx mentioned this pull request Jun 30, 2021

[7.x] Skip Google Cloud Storage tests on JDK #74763

Merged

benwtrent mentioned this pull request Jun 30, 2021

[7.14] Skip Google Cloud Storage tests on JDK (#74763) #74793

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

original-brownbear restored the chunked-blob-output-stream branch April 18, 2023 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce ChunkedBlobOutputStream #74620

Introduce ChunkedBlobOutputStream #74620

original-brownbear commented Jun 28, 2021

elasticmachine commented Jun 28, 2021

tlrx left a comment

tlrx Jun 28, 2021

original-brownbear Jun 28, 2021

tlrx Jun 28, 2021

original-brownbear Jun 28, 2021 •

edited

Loading

tlrx Jun 28, 2021

tlrx Jun 28, 2021

original-brownbear Jun 28, 2021

tlrx Jun 28, 2021

original-brownbear Jun 28, 2021

fcofdez left a comment

original-brownbear commented Jun 28, 2021

Introduce ChunkedBlobOutputStream #74620

Introduce ChunkedBlobOutputStream #74620

Conversation

original-brownbear commented Jun 28, 2021

elasticmachine commented Jun 28, 2021

tlrx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear Jun 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fcofdez left a comment

Choose a reason for hiding this comment

original-brownbear commented Jun 28, 2021

original-brownbear Jun 28, 2021 •

edited

Loading