-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Add Caching for RepositoryData in BlobStoreRepository #52341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Caching for RepositoryData in BlobStoreRepository #52341
Conversation
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
@@ -36,7 +36,7 @@ | |||
*/ | |||
public class MockSecureSettings implements SecureSettings { | |||
|
|||
private Map<String, SecureString> secureStrings = new HashMap<>(); | |||
private Map<String, String> secureStrings = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just fixing a bug here that prevented node restarts in tests for nodes that used secure settings. If we always return the same SecureString
for getString
then if a node ever closes that SecureString
it couldn't get the setting again.
@@ -356,4 +367,9 @@ private void assertRepositoryBlocked(Client client, String repo, String existing | |||
assertThat(repositoryException4.getMessage(), | |||
containsString("Could not read repository data because the contents of the repository do not match its expected state.")); | |||
} | |||
|
|||
private void fullRestart() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is admittedly pretty dirty now test-wise. Obviously, it also illustrates that this change slightly weakens our ability to detect concurrent repository writes from multiple clusters.
IMO, this is a fair trade-off though and one we've already been making over and over in other spots (anytime we optimised away a loading of RepositoryData
we weakened the ability to detect concurrent load obviously).
@@ -1359,6 +1398,7 @@ public void onFailure(String source, Exception e) { | |||
|
|||
@Override | |||
public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) { | |||
cacheRepositoryData(filteredRepositoryData.withGenId(newGen)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little dirty and we talked about it in other PRs .. the generation here is -1
due to the weird way the repository data is loaded initially and we have to eventually set it. I'll clean that up in a follow up, for now just setting it here where it matters (it doesn't matter in the above code when serializing filteredRepositoryData
to the repo) seemed the driest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a TODO then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what to even put in it though. In the end, this is simply the way RepositoryData
works for now. We have the same kind of code for the cluster state version as well I guess. I don't have a good idea for a better abstract yet :(
@@ -42,16 +42,14 @@ | |||
private final int hashCode; | |||
|
|||
public IndexId(final String name, final String id) { | |||
this.name = name; | |||
this.id = id; | |||
this.name = name.intern(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving to interning here and for the snapshot id actually makes the on heap RepositoryData
about the same size as the serialized one (unfortunately for the time being we serialize it as json uncompressed) => I think the safety check for 500kb I added is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to avoid JVM String interning. If we can deduplicate the Strings ourselves, that's fine, but adding more and more stuff to be internalized feels dangerous (another source of OOM).
try { | ||
final int len = | ||
BytesReference.bytes(updated.snapshotsToXContent(XContentFactory.jsonBuilder(), true)).length(); | ||
if (len > ByteSizeUnit.KB.toBytes(500)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context:
on heap 1k shards in a snapshot means about 25kb of shard generations which themselves make up ~50% of the size of RepositoryData
on heap with the interning changes I added.
The number of snapshots doesn't matter much by comparison. So in the Cloud case of 100 snapshots, this would allow for caching snapshots of up to ~ 15k shards which should work fine for most users I'm assuming.
@@ -42,16 +42,14 @@ | |||
private final int hashCode; | |||
|
|||
public IndexId(final String name, final String id) { | |||
this.name = name; | |||
this.id = id; | |||
this.name = name.intern(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to avoid JVM String interning. If we can deduplicate the Strings ourselves, that's fine, but adding more and more stuff to be internalized feels dangerous (another source of OOM).
@@ -513,10 +514,13 @@ public void deleteSnapshot(SnapshotId snapshotId, long repositoryStateId, Versio | |||
private RepositoryData safeRepositoryData(long repositoryStateId, Map<String, BlobMetaData> rootBlobs) { | |||
final long generation = latestGeneration(rootBlobs.keySet()); | |||
final long genToLoad; | |||
final RepositoryData cached; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can just cache the serialized (and compressed) bytes instead of the object. Decompressing should still be fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I experimented a little with this and it's really sweet so I moved to that now. The compression ration is massive (more than a factor of 10x for 100 snapshots with 1k shards each) and I can hardly imagine a repo that could break 500kb for the cached size (100 snapshots of 1k shards is only ~20kb and adding additional snapshots is cheap as well since each new snapshot effectively just adds that snapshots uuid + name if it doesn't contain new shards).
return; | ||
} | ||
} catch (IOException e) { | ||
throw new AssertionError("Impossible, no IO happens here", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the node can't serialize the RepositoryData, it would die here. Perhaps just catch and log (while having an assert as well for our tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, though I figured this is actually impossible because the same code must have serialized that repository data to even make it visible here (since we only cache what we previously wrote to the repo).
if (bestEffortConsistency == false) { | ||
try { | ||
final int len = | ||
BytesReference.bytes(updated.snapshotsToXContent(XContentFactory.jsonBuilder(), true)).length(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make RepositoryData
implements Accountable
? Should we craft an estimating function instead of serializing the whole RepositoryData
back again? If we were serializing back to XContent we could pass a FilterOutputStream to the XContentBuilder so that it just count bytes and not build everything in heap.
Or if we follow Yannick's suggestion on caching the serialized and compressed bytes then maybe we should keep track of the length of the original blob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jup, went with Yannicks approach now :)
@@ -1359,6 +1398,7 @@ public void onFailure(String source, Exception e) { | |||
|
|||
@Override | |||
public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) { | |||
cacheRepositoryData(filteredRepositoryData.withGenId(newGen)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a TODO then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more small comments. Looking very good already though.
@@ -132,6 +132,8 @@ public void testEnforcedCooldownPeriod() throws IOException { | |||
} | |||
}))); | |||
|
|||
// Master failover to clear RepositoryData cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer an explicit undocumented setting to disable the cache. This might also turn out to be useful if we see any issues with this new functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that's much nicer indeed :) added that setting now and used it in tests.
@@ -529,6 +536,9 @@ private RepositoryData safeRepositoryData(long repositoryStateId, Map<String, Bl | |||
throw new RepositoryException(metadata.name(), "concurrent modification of the index-N file, expected current generation [" + | |||
repositoryStateId + "], actual current generation [" + genToLoad + "]"); | |||
} | |||
if (cached != null && cached.v1() == genToLoad && cached.v2() != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should cached.v2()
always be non-null if cached != null
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jup now it is.
@@ -1057,6 +1067,10 @@ public void endVerification(String seed) { | |||
// and concurrent modifications. | |||
private final AtomicLong latestKnownRepoGen = new AtomicLong(RepositoryData.UNKNOWN_REPO_GEN); | |||
|
|||
// Best effort cache of the latest known repository data and its generation, cached serialized as compressed json | |||
private final AtomicReference<Tuple<Long, BytesReference>> latestKnownRepositoryData = | |||
new AtomicReference<>(new Tuple<>(RepositoryData.EMPTY_REPO_GEN, null)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps just initialize to null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
" repository behavior going forward.", metadata.name()); | ||
} | ||
// Set empty repository data to not waste heap for an outdated cached value | ||
latestKnownRepositoryData.set(new Tuple<>(RepositoryData.EMPTY_REPO_GEN, null)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just set to null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
} | ||
|
||
private RepositoryData repositoryDataFromCachedEntry(Tuple<Long, BytesReference> cacheEntry) throws IOException { | ||
if (cacheEntry.v1() == RepositoryData.EMPTY_REPO_GEN) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this optimization is unnecessary, and complicating the checks on null
(see my suggestions above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
Thanks Yannick, all addressed in 533b4f8 now I hope :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks Yannick + Tanguy! |
Cache latest `RepositoryData` on heap when it's absolutely safe to do so (i.e. when the repository is in strictly consistent mode). `RepositoryData` can safely be assumed to not grow to a size that would cause trouble because we often have at least two copies of it loaded at the same time when doing repository operations. Also, concurrent snapshot API status requests currently load it independently of each other and so on, making it safe to cache on heap and assume as "small" IMO. The benefits of this move are: * Much faster repository status API calls * listing all snapshot names becomes instant * Other operations are sped up massively too because they mostly operate in two steps: load repository data then load multiple other blobs to get the additional data * Additional cloud cost savings * Better resiliency, saving another spot where an IO issue could break the snapshot * We can simplify a number of spots in the current code that currently pass around the repository data in tricky ways to avoid loading it multiple times in follow ups.
Cache latest `RepositoryData` on heap when it's absolutely safe to do so (i.e. when the repository is in strictly consistent mode). `RepositoryData` can safely be assumed to not grow to a size that would cause trouble because we often have at least two copies of it loaded at the same time when doing repository operations. Also, concurrent snapshot API status requests currently load it independently of each other and so on, making it safe to cache on heap and assume as "small" IMO. The benefits of this move are: * Much faster repository status API calls * listing all snapshot names becomes instant * Other operations are sped up massively too because they mostly operate in two steps: load repository data then load multiple other blobs to get the additional data * Additional cloud cost savings * Better resiliency, saving another spot where an IO issue could break the snapshot * We can simplify a number of spots in the current code that currently pass around the repository data in tricky ways to avoid loading it multiple times in follow ups.
Cache latest
RepositoryData
on heap when it's absolutely safe to do so (i.e. when the repository is in strictly consistent mode).RepositoryData
can safely be assumed to not grow to a size that would cause trouble because we often have at least two copies of it loaded at the same time when doing repository operations. Also, concurrent snapshot API status requests currently load it independently of each other and so on, making it safe to cache on heap and assume as "small" IMO.The benefits of this move are:
I know we are thinking about caching other repository metadata in an internal index, but I think this blob is better served from heap since it's so performance critical (this is irrelevant now to some degree but more interesting for concurrent repository operations).