Skip to content

Add Caching for RepositoryData in BlobStoreRepository #52341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

original-brownbear
Copy link
Contributor

@original-brownbear original-brownbear commented Feb 13, 2020

Cache latest RepositoryData on heap when it's absolutely safe to do so (i.e. when the repository is in strictly consistent mode).

RepositoryData can safely be assumed to not grow to a size that would cause trouble because we often have at least two copies of it loaded at the same time when doing repository operations. Also, concurrent snapshot API status requests currently load it independently of each other and so on, making it safe to cache on heap and assume as "small" IMO.

The benefits of this move are:

  • Much faster repository status API calls
    • listing all snapshot names becomes instant
    • Other operations are sped up massively too because they mostly operate in two steps: load repository data then load multiple other blobs to get the additional data
  • Additional cloud cost savings
  • Better resiliency, saving another spot where an IO issue could break the snapshot
  • We can simplify a number of spots in the current code that currently pass around the repository data in tricky ways to avoid loading it multiple times in follow ups.

I know we are thinking about caching other repository metadata in an internal index, but I think this blob is better served from heap since it's so performance critical (this is irrelevant now to some degree but more interesting for concurrent repository operations).

@original-brownbear original-brownbear added >non-issue WIP :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Feb 13, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@@ -36,7 +36,7 @@
*/
public class MockSecureSettings implements SecureSettings {

private Map<String, SecureString> secureStrings = new HashMap<>();
private Map<String, String> secureStrings = new HashMap<>();
Copy link
Contributor Author

@original-brownbear original-brownbear Feb 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just fixing a bug here that prevented node restarts in tests for nodes that used secure settings. If we always return the same SecureString for getString then if a node ever closes that SecureString it couldn't get the setting again.

@@ -356,4 +367,9 @@ private void assertRepositoryBlocked(Client client, String repo, String existing
assertThat(repositoryException4.getMessage(),
containsString("Could not read repository data because the contents of the repository do not match its expected state."));
}

private void fullRestart() throws Exception {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is admittedly pretty dirty now test-wise. Obviously, it also illustrates that this change slightly weakens our ability to detect concurrent repository writes from multiple clusters.
IMO, this is a fair trade-off though and one we've already been making over and over in other spots (anytime we optimised away a loading of RepositoryData we weakened the ability to detect concurrent load obviously).

@@ -1359,6 +1398,7 @@ public void onFailure(String source, Exception e) {

@Override
public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) {
cacheRepositoryData(filteredRepositoryData.withGenId(newGen));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little dirty and we talked about it in other PRs .. the generation here is -1 due to the weird way the repository data is loaded initially and we have to eventually set it. I'll clean that up in a follow up, for now just setting it here where it matters (it doesn't matter in the above code when serializing filteredRepositoryData to the repo) seemed the driest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a TODO then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what to even put in it though. In the end, this is simply the way RepositoryData works for now. We have the same kind of code for the cluster state version as well I guess. I don't have a good idea for a better abstract yet :(

@@ -42,16 +42,14 @@
private final int hashCode;

public IndexId(final String name, final String id) {
this.name = name;
this.id = id;
this.name = name.intern();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving to interning here and for the snapshot id actually makes the on heap RepositoryData about the same size as the serialized one (unfortunately for the time being we serialize it as json uncompressed) => I think the safety check for 500kb I added is valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to avoid JVM String interning. If we can deduplicate the Strings ourselves, that's fine, but adding more and more stuff to be internalized feels dangerous (another source of OOM).

try {
final int len =
BytesReference.bytes(updated.snapshotsToXContent(XContentFactory.jsonBuilder(), true)).length();
if (len > ByteSizeUnit.KB.toBytes(500)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context:

on heap 1k shards in a snapshot means about 25kb of shard generations which themselves make up ~50% of the size of RepositoryData on heap with the interning changes I added.
The number of snapshots doesn't matter much by comparison. So in the Cloud case of 100 snapshots, this would allow for caching snapshots of up to ~ 15k shards which should work fine for most users I'm assuming.

@original-brownbear
Copy link
Contributor Author

@ywelsch @tlrx optimized the on heap size of RepositoryData a little and added a size limit for caching as discussed yesterday :) Should be good for another review

@@ -42,16 +42,14 @@
private final int hashCode;

public IndexId(final String name, final String id) {
this.name = name;
this.id = id;
this.name = name.intern();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to avoid JVM String interning. If we can deduplicate the Strings ourselves, that's fine, but adding more and more stuff to be internalized feels dangerous (another source of OOM).

@@ -513,10 +514,13 @@ public void deleteSnapshot(SnapshotId snapshotId, long repositoryStateId, Versio
private RepositoryData safeRepositoryData(long repositoryStateId, Map<String, BlobMetaData> rootBlobs) {
final long generation = latestGeneration(rootBlobs.keySet());
final long genToLoad;
final RepositoryData cached;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can just cache the serialized (and compressed) bytes instead of the object. Decompressing should still be fast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I experimented a little with this and it's really sweet so I moved to that now. The compression ration is massive (more than a factor of 10x for 100 snapshots with 1k shards each) and I can hardly imagine a repo that could break 500kb for the cached size (100 snapshots of 1k shards is only ~20kb and adding additional snapshots is cheap as well since each new snapshot effectively just adds that snapshots uuid + name if it doesn't contain new shards).

return;
}
} catch (IOException e) {
throw new AssertionError("Impossible, no IO happens here", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the node can't serialize the RepositoryData, it would die here. Perhaps just catch and log (while having an assert as well for our tests)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, though I figured this is actually impossible because the same code must have serialized that repository data to even make it visible here (since we only cache what we previously wrote to the repo).

if (bestEffortConsistency == false) {
try {
final int len =
BytesReference.bytes(updated.snapshotsToXContent(XContentFactory.jsonBuilder(), true)).length();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make RepositoryData implements Accountable? Should we craft an estimating function instead of serializing the whole RepositoryData back again? If we were serializing back to XContent we could pass a FilterOutputStream to the XContentBuilder so that it just count bytes and not build everything in heap.

Or if we follow Yannick's suggestion on caching the serialized and compressed bytes then maybe we should keep track of the length of the original blob.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jup, went with Yannicks approach now :)

@@ -1359,6 +1398,7 @@ public void onFailure(String source, Exception e) {

@Override
public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) {
cacheRepositoryData(filteredRepositoryData.withGenId(newGen));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a TODO then?

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more small comments. Looking very good already though.

@@ -132,6 +132,8 @@ public void testEnforcedCooldownPeriod() throws IOException {
}
})));

// Master failover to clear RepositoryData cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer an explicit undocumented setting to disable the cache. This might also turn out to be useful if we see any issues with this new functionality

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that's much nicer indeed :) added that setting now and used it in tests.

@@ -529,6 +536,9 @@ private RepositoryData safeRepositoryData(long repositoryStateId, Map<String, Bl
throw new RepositoryException(metadata.name(), "concurrent modification of the index-N file, expected current generation [" +
repositoryStateId + "], actual current generation [" + genToLoad + "]");
}
if (cached != null && cached.v1() == genToLoad && cached.v2() != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should cached.v2() always be non-null if cached != null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jup now it is.

@@ -1057,6 +1067,10 @@ public void endVerification(String seed) {
// and concurrent modifications.
private final AtomicLong latestKnownRepoGen = new AtomicLong(RepositoryData.UNKNOWN_REPO_GEN);

// Best effort cache of the latest known repository data and its generation, cached serialized as compressed json
private final AtomicReference<Tuple<Long, BytesReference>> latestKnownRepositoryData =
new AtomicReference<>(new Tuple<>(RepositoryData.EMPTY_REPO_GEN, null));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps just initialize to null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

" repository behavior going forward.", metadata.name());
}
// Set empty repository data to not waste heap for an outdated cached value
latestKnownRepositoryData.set(new Tuple<>(RepositoryData.EMPTY_REPO_GEN, null));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just set to null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

}

private RepositoryData repositoryDataFromCachedEntry(Tuple<Long, BytesReference> cacheEntry) throws IOException {
if (cacheEntry.v1() == RepositoryData.EMPTY_REPO_GEN) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this optimization is unnecessary, and complicating the checks on null (see my suggestions above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@original-brownbear
Copy link
Contributor Author

Thanks Yannick, all addressed in 533b4f8 now I hope :)

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Contributor Author

Thanks Yannick + Tanguy!

@original-brownbear original-brownbear merged commit f5ca487 into elastic:master Feb 20, 2020
@original-brownbear original-brownbear deleted the cache-latest-repository-data branch February 20, 2020 11:58
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Feb 20, 2020
Cache latest `RepositoryData` on heap when it's absolutely safe to do so (i.e. when the repository is in strictly consistent mode).

`RepositoryData` can safely be assumed to not grow to a size that would cause trouble because we often have at least two copies of it loaded at the same time when doing repository operations. Also, concurrent snapshot API status requests currently load it independently of each other and so on, making it safe to cache on heap and assume as "small" IMO.

The benefits of this move are:
* Much faster repository status API calls
   * listing all snapshot names becomes instant
   * Other operations are sped up massively too because they mostly operate in two steps: load repository data then load multiple other blobs to get the additional data
* Additional cloud cost savings
* Better resiliency, saving another spot where an IO issue could break the snapshot
* We can simplify a number of spots in the current code that currently pass around the repository data in tricky ways to avoid loading it multiple times in follow ups.
original-brownbear added a commit that referenced this pull request Feb 21, 2020
Cache latest `RepositoryData` on heap when it's absolutely safe to do so (i.e. when the repository is in strictly consistent mode).

`RepositoryData` can safely be assumed to not grow to a size that would cause trouble because we often have at least two copies of it loaded at the same time when doing repository operations. Also, concurrent snapshot API status requests currently load it independently of each other and so on, making it safe to cache on heap and assume as "small" IMO.

The benefits of this move are:
* Much faster repository status API calls
   * listing all snapshot names becomes instant
   * Other operations are sped up massively too because they mostly operate in two steps: load repository data then load multiple other blobs to get the additional data
* Additional cloud cost savings
* Better resiliency, saving another spot where an IO issue could break the snapshot
* We can simplify a number of spots in the current code that currently pass around the repository data in tricky ways to avoid loading it multiple times in follow ups.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants