Keeps index commits up to the current global checkpoint #27367

dnhatn · 2017-11-14T02:29:15Z

We need to keep index commits and translog operations up to the current global checkpoint for the operation-based recovery. These can be done by introducing a new deletion policy. The new policy keeps the oldest commit whose local checkpoint is not greater than the current global checkpoint, and also keeps all subsequent commits. Once those commits are kept, a CombinedDeletionPolicy will retain translog operations at least up to the current global checkpoint.

Relates to #10708

We need to keep index commits and translog operations up to the current global checkpoint for the operation-based recovery. These can be done by introducing a new deletion policy. The new policy keeps the latest (eg. youngest) commit whose local checkpoint is not greater than the current global checkpoint, and also keeps all subsequent commits. Once those commits are kept, a CombinedDeletionPolicy will retain translog operations at least up to the current global checkpoint.

s1monw

looks great I left some comments

s1monw · 2017-11-14T08:56:16Z

core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

+                    engineConfig.getIndexSettings().getTranslogRetentionSize().getBytes(),
+                    engineConfig.getIndexSettings().getTranslogRetentionAge().getMillis()
+                );
+                this.deletionPolicy = new CombinedDeletionPolicy(


that you had to do this sucks. I think we should make createWriter static and pass all the things to it that it needs rather than expecting members to be initialized. this should be done in a followup

Yeah. I will address this in a followup.

s1monw · 2017-11-14T09:16:10Z

core/src/main/java/org/elasticsearch/index/engine/KeepUntilGlobalCheckpointDeletionPolicy.java

+    @Override
+    public void onCommit(List<? extends IndexCommit> commits) throws IOException {
+        final long globalCheckpoint = globalCheckpointSupplier.getAsLong();
+        for (int i = commits.size() - 1; i >= 0; i--) {


can we keep this simple and just try to find the offset we need to delete from and then exit the loop. Then do the delete in a second loop outside of it. It would make it much simpler to read. Also I think we can safely iterate from 0 to N for simplicity. This is not perf critical in that place.

I pushed 9ee44d2

dnhatn · 2017-11-14T14:16:59Z

@s1monw, I have addressed your comments. Could you please take another look? Thank you.

jasontedor

I left a few comments.

jasontedor · 2017-11-14T19:30:54Z

core/src/main/java/org/elasticsearch/index/engine/KeepUntilGlobalCheckpointDeletionPolicy.java

+import java.util.function.LongSupplier;
+
+/**
+ * An {@link IndexDeletionPolicy} keeps the latest (eg. youngest) commit whose local checkpoint is not


I think the wording in this Javadoc is confusing. We want to keep the oldest commits (not the latest, not the youngest).

jasontedor · 2017-11-14T19:31:23Z

core/src/main/java/org/elasticsearch/index/engine/KeepUntilGlobalCheckpointDeletionPolicy.java

+        }
+    }
+
+    // commits are sorted by age (the 0th one is the oldest commit).


This comment is better placed inside the method.

jasontedor · 2017-11-14T19:32:43Z

core/src/main/java/org/elasticsearch/index/engine/KeepUntilGlobalCheckpointDeletionPolicy.java

+        return -1;
+    }
+
+    private static long localCheckpoint(IndexCommit commit) throws IOException {


I think this should be called localCheckpointFromCommit yet I question if a method is really needed?

jasontedor · 2017-11-14T19:39:33Z

...c/test/java/org/elasticsearch/index/engine/KeepUntilGlobalCheckpointDeletionPolicyTests.java

+
+public class KeepUntilGlobalCheckpointDeletionPolicyTests extends EngineTestCase {
+    final AtomicLong globalCheckpoint = new AtomicLong(SequenceNumbers.UNASSIGNED_SEQ_NO);
+    final AtomicInteger docId = new AtomicInteger();


Do these really have to be test instance fields? Can they be passed around? They are easy enough to construct.

dnhatn · 2017-11-14T21:59:30Z

@jasontedor I have addressed your feedbacks. Could you please have another quick look? Thank you.

jasontedor · 2017-11-14T22:29:08Z

...c/test/java/org/elasticsearch/index/engine/KeepUntilGlobalCheckpointDeletionPolicyTests.java

+import static org.hamcrest.Matchers.hasSize;
+
+public class KeepUntilGlobalCheckpointDeletionPolicyTests extends EngineTestCase {
+    final AtomicLong globalCheckpoint = new AtomicLong(SequenceNumbers.UNASSIGNED_SEQ_NO);


To be clear, I was referring to both globalCheckpoint and docId.

Sorry @jasontedor, I overlooked your comment. I pushed a8b3dd2.

s1monw

LGTM

jasontedor

LGTM.

dnhatn · 2017-11-17T15:14:00Z

I think the implementation is incorrect. We should compare the current global checkpoint to the max sequence number in an index commit.

jasontedor · 2017-11-17T15:21:01Z

You are correct @dnhatn.

dnhatn · 2017-11-17T15:25:00Z

Thanks @jasontedor, I will update the PR.

dnhatn · 2017-11-17T16:13:28Z

@jasontedor, I have updated the implementation. Could you please take a look? Thank you!

Today, we keep only the last index commit and use only it to calculate the minimum required translog generation. This may no longer be correct as we introduced a new deletion policy which keeps multiple index commits. This changes adjust the `CombinedDeletionPolicy` so that it can work correctly with a new index deletion policy. Relates to elastic#10708, elastic#27367

bleskes

Thx @dnhatn . I left a very small request to limit the scenarios in which we accept not finding a proper commit point in the deletion policy. Looks great.

We also need to discuss when and whether we clean up commits as soon as we can. Currently, I expect us to keep two commits most of the time. This is because that when we commit the global checkpoint will likely be lagging. It will quickly catch up but we have no mechanism to clean up the unneeded old commit. To be clear - I don't think this is necessarily bad and should also by no mean stop this PR. I'm just double checking if that has gotten some thought.

bleskes · 2017-11-20T15:36:03Z

core/src/main/java/org/elasticsearch/index/engine/KeepUntilGlobalCheckpointDeletionPolicy.java

+        // Commits are sorted by age (the 0th one is the oldest commit).
+        for (int i = commits.size() - 1; i >= 0; i--) {
+            final IndexCommit commit = commits.get(i);
+            long maxSeqNoFromCommit = Long.parseLong(commit.getUserData().get(SequenceNumbers.MAX_SEQ_NO));


watch out for legacy indices - this needs to go to 6.x and thus needs to read 5.x commits. I'm fine with doing the backport as a separate PR.

I pushed b587e20

bleskes · 2017-11-20T15:41:05Z

core/src/main/java/org/elasticsearch/index/engine/KeepUntilGlobalCheckpointDeletionPolicy.java

+                return i;
+            }
+        }
+        return -1;


I think we should hunt down when this is possible - I can't think of a case where an existing (i.e. was fully initialized and it's translog committed) has this trait. A peer recovery index (where we create the translog) may have unknown GCP (and a single commit point which we can assert on). Instead of blindly accepting the fact that we have found no commit point, can we maybe rely on the fact that the GCP is UNASSIGNED_SEQ_NO and otherwise throw an exception? Re empty indices - I'm not sure about the initialization order - we should check.

This was not possible before but can happen with a new limit.

This may happen when we upgrade from the previous 6.x versions. In the previous 6.x, we keep only the last commit - the max_seq_no of this commit is likely greater than the global checkpoint if indexing operations are in progress. Therefore, after upgrading to this version, we may not find a proper commit (eg. whose max_seq_no is less or equal to the current global checkpoint) with an old index until we reserve proper commits.

I see. Can we assert that's the case? i.e. give the deletion policy the index creation version and assert that the index was created before 6.x and that the commit has MAX_SEQ_NO in it? also please add a comment.

The also can happen in peer-recovery. If the file-based happens, a replica will be received the latest commit from a primary. However, that commit may not be a safe commit if writes are in progress.

I've documented these two cases. I think we need to discuss on the assertion.

bleskes

Left some very minor comments. One more iteration and I think we're good.

bleskes · 2017-11-22T15:19:46Z

core/src/main/java/org/elasticsearch/index/engine/KeepUntilGlobalCheckpointDeletionPolicy.java

+            final Map<String, String> commitUserData = commits.get(i).getUserData();
+            // Index from 5.x does not contain MAX_SEQ_NO, we should keep either the more recent commit with MAX_SEQ_NO,
+            // or the last commit.
+            if (commitUserData.containsKey(SequenceNumbers.MAX_SEQ_NO) == false) {


I wonder if this is correct? we shouldn't we keep the commit with no seq# info? this means it was done before sequence numbers were done and implicitly doesn't have ops above the global checkpoint? this also means we can just return i?

Both Math.min(i + 1, commits.size() - 1) and i are correct. Returning i is much simpler, but Math.min(i + 1, commits.size() - 1) will clean up unneeded commits sooner. It can be explained as follows. We have one commit (c1) without max_seq_no (from 5.x), then we have a new commit (c2) with max_seq_no. We don't need to keep the former commit (c1) if we keep c2. Returning i will keep both commits, but Math.min(i + 1, commits.size() - 1) will keep only c2.

However, I believe that I over-thinked about it. I pushed 3d5d323 to remove this optimization.

bleskes · 2017-11-22T15:22:47Z

core/src/main/java/org/elasticsearch/index/engine/KeepUntilGlobalCheckpointDeletionPolicy.java

+                return i;
+            }
+        }
+        return -1;


I see. Can we assert that's the case? i.e. give the deletion policy the index creation version and assert that the index was created before 6.x and that the commit has MAX_SEQ_NO in it? also please add a comment.

dnhatn · 2017-11-22T20:53:05Z

@bleskes Could you please take another look? I've addressed your comments.

Today, we keep only the last index commit and use only it to calculate the minimum required translog generation. This may no longer be correct as we introduced a new deletion policy which keeps multiple index commits. This change adjusts the CombinedDeletionPolicy so that it can work correctly with a new index deletion policy. Relates to #10708, #27367

bleskes

LGTM. Thanks for the extra iterations.

bleskes · 2017-11-23T16:40:13Z

...c/test/java/org/elasticsearch/index/engine/KeepUntilGlobalCheckpointDeletionPolicyTests.java

+        verify(commit2, times(0)).delete();
+        verify(commit3, times(0)).delete();
+
+        deletionPolicy.onCommit(Arrays.asList(commit1, commit2, commit3));


nit - add a check that we have a good commit with a sequence numbers we keep that and drop the old ones?

I've updated the test.

bleskes · 2017-11-23T16:44:48Z

One more thought. Now that #27456 is merged, can we add on the engine level that these components work correctly together? i.e., that we keep commits until the global checkpoint is incremented and that translog files are maintained etc.

Today, we keep only the last index commit and use only it to calculate the minimum required translog generation. This may no longer be correct as we introduced a new deletion policy which keeps multiple index commits. This change adjusts the CombinedDeletionPolicy so that it can work correctly with a new index deletion policy. Relates to #10708, #27367

dnhatn · 2017-11-24T04:34:00Z

Can we add on the engine level that these components work correctly together?

@bleskes, Unfortunately, these components don't work together. We've used a single commit assumption in some places. Both uncommittedOperations and uncommittedSizeInBytes now return incorrect values.

elasticsearch/core/src/main/java/org/elasticsearch/index/translog/Translog.java

Lines 371 to 383 in 89ba899

    
               /** 
        
                * Returns the number of operations in the translog files that aren't committed to lucene. 
        
                */ 
        
               public int uncommittedOperations() { 
        
                   return totalOperations(deletionPolicy.getMinTranslogGenerationForRecovery()); 
        
               } 
        
               /** 
        
                * Returns the size in bytes of the translog files that aren't committed to lucene. 
        
                */ 
        
               public long uncommittedSizeInBytes() { 
        
                   return sizeInBytesByMinGen(deletionPolicy.getMinTranslogGenerationForRecovery()); 
        
               }

I am addressing these. Should I include it into this PR or make a separate PR before this? Thank you.

bleskes · 2017-11-26T08:50:53Z

Unfortunately, these components don't work together. We've used a single commit assumption in some places. Both uncommittedOperations and uncommittedSizeInBytes now return incorrect values.

Aye. That's a good catch. +1 to fix this in a different PR first. You already mentioned one problem in #27456 concerning snapshots, which we have solved but there's also a problem with the flushing frequentie see IndexShard#shouldFlush() (I think we will now flush on every write until the global checkpoint advances). Did you see anything else?

dnhatn · 2017-11-26T15:33:23Z

There's also a problem with the flushing frequentie see IndexShard#shouldFlush() (I think we will now flush on every write until the global checkpoint advances)

You're correct. I will fix it using the translog generation of the last commit to calculate uncommittedOperations and uncommittedSizeInBytes in a different PR first.

Did you see anything else?

The recovering commit (eg. the first commit) of a shrunk index does not have TRANSLOG_GENERATION_KEY in its commit data. Should we clean it right after we have a new commit?

elasticsearch/core/src/main/java/org/elasticsearch/index/shard/StoreRecovery.java

Lines 170 to 177 in 303e0c0

    
           writer.setLiveCommitData(() -> { 
        
               final HashMap<String, String> liveCommitData = new HashMap<>(3); 
        
               liveCommitData.put(SequenceNumbers.MAX_SEQ_NO, Long.toString(maxSeqNo)); 
        
               liveCommitData.put(SequenceNumbers.LOCAL_CHECKPOINT_KEY, Long.toString(maxSeqNo)); 
        
               liveCommitData.put(InternalEngine.MAX_UNSAFE_AUTO_ID_TIMESTAMP_COMMIT_ID, Long.toString(maxUnsafeAutoIdTimestamp)); 
        
               return liveCommitData.entrySet().iterator(); 
        
           }); 
        
           writer.commit();

bleskes · 2017-11-26T20:49:34Z

The recovering commit (eg. the first commit) of a shrunk index does not have TRANSLOG_GENERATION_KEY in its commit data. Should we clean it right after we have a new commit?

Yeah, that's a bit funky. I do think it's OK as we always keep all commits when we don't have a global checkpoint. Once we have it we'll clean it up.

Today we can not distinguish between index commits that are kept by the primary policy and those are kept for snapshotting with a SnapshotDeletionPolicy. Since we enclose a SnapshotDeletionPolicy in a CombinedDeletionPolicy, we also we can not distinguish between those with a CombinedDeletionPolicy. This can be a problem if we update the TranslogDeletionPolicy to keep the minimum translog generation of undeleted index commits as we may keep the translog of a snapshotting commit even though it is "deleted" by the primary policy. To solve this, we enclose a CombinedDeletionPolicy in a SnapshotDeletionPolicy and track if an index commit is deleted by the primary policy, then use that value to maintain translog rather than the actual deletion of an index commit. Relates elastic#27456 elastic#27367

dnhatn · 2017-11-30T14:38:23Z

This is superseded by #27606

dnhatn added :Sequence IDs >enhancement v6.1.0 v7.0.0 labels Nov 14, 2017

dnhatn requested review from bleskes, jasontedor and s1monw November 14, 2017 02:29

s1monw suggested changes Nov 14, 2017

View reviewed changes

readable

9ee44d2

jasontedor approved these changes Nov 14, 2017

View reviewed changes

feedbacks

722c9d2

jasontedor reviewed Nov 14, 2017

View reviewed changes

use globalcheckpoint as arg

a8b3dd2

s1monw approved these changes Nov 17, 2017

View reviewed changes

jasontedor approved these changes Nov 17, 2017

View reviewed changes

Merge branch 'master' into checkpoint-deletion-policy

a837d45

compare to max_seqno

f127035

dnhatn mentioned this pull request Nov 20, 2017

Adjust CombinedDeletionPolicy for multiple commits #27456

Merged

bleskes suggested changes Nov 20, 2017

View reviewed changes

Merge branch 'master' into checkpoint-deletion-policy

8a240b2

dnhatn changed the title ~~Keeps index commits and translog operations up to the current global checkpoint~~ Keeps index commits up to the current global checkpoint Nov 20, 2017

limits the number of kept index commits

ef7f111

dnhatn added v6.2.0 and removed v6.1.0 labels Nov 22, 2017

bleskes suggested changes Nov 22, 2017

View reviewed changes

dnhatn added 3 commits November 22, 2017 15:05

do not optimize

3d5d323

add doc

90b43bd

Merge branch 'master' into checkpoint-deletion-policy

ca54309

update the legacy test

e37308c

bleskes approved these changes Nov 23, 2017

View reviewed changes

dnhatn added 4 commits November 23, 2017 12:48

use a smaller translog retention size

a2bfa73

improve the legacy index commit test

daa04ed

Merge branch 'master' into checkpoint-deletion-policy

050c195

correct required ops

da62121

This was referenced Nov 26, 2017

Enclose CombinedDeletionPolicy in SnapshotDeletionPolicy #27528

Closed

Keep commits and translog up to the global checkpoint #27606

Merged

dnhatn closed this Nov 30, 2017

dnhatn deleted the checkpoint-deletion-policy branch December 14, 2017 15:44

clintongormley added :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018

dnhatn removed v6.2.0 v7.0.0 labels Apr 25, 2018

Keeps index commits up to the current global checkpoint #27367

Keeps index commits up to the current global checkpoint #27367

Uh oh!

Conversation

dnhatn commented Nov 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

s1monw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Nov 14, 2017

Uh oh!

jasontedor left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Nov 14, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s1monw left a comment

Choose a reason for hiding this comment

Uh oh!

jasontedor left a comment

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Nov 17, 2017

Uh oh!

jasontedor commented Nov 17, 2017

Uh oh!

dnhatn commented Nov 17, 2017

Uh oh!

dnhatn commented Nov 17, 2017

Uh oh!

bleskes left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bleskes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Nov 22, 2017

Uh oh!

dnhatn commented Nov 14, 2017 •

edited

Loading

bleskes left a comment •

edited

Loading