Skip to content

Introduce repository test kit/analyser #67247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

DaveCTurner
Copy link
Contributor

@DaveCTurner DaveCTurner commented Jan 11, 2021

Today we rely on blob stores behaving in a certain way so that they can be used
as a snapshot repository. There are an increasing number of third-party blob
stores that claim to be S3-compatible, but which may not offer a suitably
correct or performant implementation of the S3 API. We rely on somesubtle
semantics with concurrent readers and writers, but some blob stores may not
implement it correctly. Hitting a corner case in the implementation may be rare
in normal use, and may be hard to reproduce or to distinguish from an
Elasticsearch bug.

This commit introduces a new POST /_snapshot/.../_analyse API which exercises
the more problematic corners of the repository implementation looking for
correctness bugs and measures the details of the performance of the repository
under concurrent load.

Today we rely on blob stores behaving in a certain way so that they can
be used correctly as a snapshot repository. There are an increasing
number of third-party blob stores that claim to be S3-compatible, but
which may not offer a suitably performant implementation of the S3 API.
We rely on S3's subtle semantics with concurrent readers and writers,
but some blob stores may not implement it correctly. Hitting a corner
case in the implementation may be rare in normal use, and may be hard to
reproduce or to distinguish from an Elasticsearch bug.

This commit introduces a new `POST _repository/.../_speed_test` API
which measures the details of the performance of the repository under
concurrent load, and exercises the more problematic corners of the API
implementation looking for correctness bugs.
Copy link
Contributor Author

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opening this draft for discussion on a few points before I proceed much further.

[[repo-speed-test-api-response-body]]
==== {api-response-body-title}

TODO
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today we report the details of every read and write performed during the test. It'd probably be useful to add some higher-level summary statistics too, maybe only returning the low-level ones if ?detailed is passed. Suggestions for the higher-level stats are up for discussion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added simple accumulators in 0e904d8.

import static org.hamcrest.Matchers.lessThanOrEqualTo;
import static org.hamcrest.Matchers.startsWith;

public class RepositorySpeedTestIT extends AbstractSnapshotIntegTestCase {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are incomplete and do not currently ensure that the speed test detects various kinds of repository bug.

deleteContainerAndSendResponse();
} else {
logger.debug(
"expected blobs [{}] missing in [{}:{}], trying again; retry count = {}",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this any more? AFAIK S3 was the only blob store that didn't have consistent listings, but it does now doesn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should require consistent listings (as well as test for it), as all currently officially supported blob stores support it.

Courtesy of http://weidagang.github.io/text-diagram/ with this input:

    object Writer Repo Readers

    note right of Writer: Write phase
    Writer -> Repo: Write blob with random content
    Writer -> Readers: Read range during write (rarely)
    Readers -> Repo: Read range
    Repo -> Readers: Contents of range, or "not found"
    Readers -> Writer: Acknowledge read, including checksum if found
    Repo -> Writer: Write complete
    space 5
    note right of Writer: Read phase
    Writer -> Readers: Read range [a,b)
    Readers -> Repo: Read range
    Writer -> Repo: Overwrite blob (rarely)
    Repo -> Readers: Contents of range
    Repo -> Writer: Overwrite complete
    Readers -> Writer: Ack read (with checksum)
    space 5
    note right of Writer: Verify phase
    Writer -> Writer: Confirm checksums
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a first pass on this.

Regarding naming, I'm not sure I would phrase this as a "speed_test", and would like to bring the following suggestions into the discussion:

  • Repository compatibility checker (_compatibility_check)
  • Repository compatibility tester (_compatibility_test)

I think we should provide more documentation on what this tool is checking (e.g. read after a fresh write works as expected), and also describe its limitations (e.g. it can't check durability, which is another important factor for a repository, as well as that data integrity over time (i.e. shouldn't become corrupted)) and say that it might be incomplete (can only show presence of issues, not absence of them).

I would like the tool to better exercise the various upload techniques (e.g. single vs multi-part upload or resumable upload). This currently relies too implicitly on setting the right blob sizes.

Maybe I missed this in the docs, but should we advise to run this on a multi-node cluster? What size of cluster?

How much temp storage will be taken up in the repository if a user runs this? Let's be explicit about that in the docs.

Do we test anywhere that the deletes actually worked? (i.e. not only that the API says success).

There are quite a number of parameters to tune here, and even as someone familiar with the system feel overloaded by the docs here. Should we explicitly distinguish between reasonable settings to tune, and expert settings?

);
}
final BlobContainer blobContainer = getBlobContainer();
final Map<String, BlobMetadata> blobsMap = blobContainer.listBlobs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check that sub-path semantics are properly implemented? (e.g. the trailing slash problem)

return List.of(TestPlugin.class, LocalStateCompositeXPackPlugin.class, SnapshotRepositoryTestKit.class);
}

public void testFoo() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//TODO: better name

deleteContainerAndSendResponse();
} else {
logger.debug(
"expected blobs [{}] missing in [{}:{}], trying again; retry count = {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should require consistent listings (as well as test for it), as all currently officially supported blob stores support it.

@DaveCTurner
Copy link
Contributor Author

Thanks for the review Yannick. +1 to most of the points you raise, some select comments inline:

Regarding naming, I'm not sure I would phrase this as a "speed_test", and would like to bring the following suggestions into the discussion

I know what you mean; I have the same concern with "compatibility test" or "compatibility check" as I did with "extended verification" in that if we return 200 OK then users will interpret that to mean "my repository is definitely compatible", regardless of any statements in the docs.

That said, I also find the speed test notion awkward, and I don't feel very strongly either way. I'd rather not go back and forth with the code, let's settle on a name we're happy with elsewhere and then move forward.

I would like the tool to better exercise the various upload techniques (e.g. single vs multi-part upload or resumable upload). This currently relies too implicitly on setting the right blob sizes.

Agreed, I think this needs a richer interface, something like BlobContainer#writeRandom, since each repository will need to implement this in its own way. Do you think this is needed for v1, however?

How much temp storage will be taken up in the repository if a user runs this? Let's be explicit about that in the docs.

WDYT about that being an input parameter? On reflection it sounds more useful for users to set the total size rather than the max blob size, and let us divide up the total as we see fit.

fcofdez added a commit that referenced this pull request Mar 1, 2021
fcofdez added a commit to fcofdez/elasticsearch that referenced this pull request Mar 1, 2021
fcofdez added a commit to fcofdez/elasticsearch that referenced this pull request Mar 1, 2021
fcofdez added a commit that referenced this pull request Mar 2, 2021
fcofdez added a commit that referenced this pull request Mar 2, 2021
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Mar 2, 2021
The repository analyzer does not write empty blobs to the repository,
and we assert that the blobs are nonempty, but the tests randomly check
for the empty case anyway. This commit ensures that the blobs used in
tests are always nonempty.

Relates elastic#67247
fcofdez added a commit to fcofdez/elasticsearch that referenced this pull request Mar 2, 2021
fcofdez added a commit to fcofdez/elasticsearch that referenced this pull request Mar 2, 2021
DaveCTurner added a commit that referenced this pull request Mar 2, 2021
The repository analyzer does not write empty blobs to the repository,
and we assert that the blobs are nonempty, but the tests randomly check
for the empty case anyway. This commit ensures that the blobs used in
tests are always nonempty.

Relates #67247
DaveCTurner added a commit that referenced this pull request Mar 2, 2021
The repository analyzer does not write empty blobs to the repository,
and we assert that the blobs are nonempty, but the tests randomly check
for the empty case anyway. This commit ensures that the blobs used in
tests are always nonempty.

Relates #67247
DaveCTurner added a commit that referenced this pull request Mar 2, 2021
The repository analyzer does not write empty blobs to the repository,
and we assert that the blobs are nonempty, but the tests randomly check
for the empty case anyway. This commit ensures that the blobs used in
tests are always nonempty.

Relates #67247
fcofdez added a commit that referenced this pull request Mar 2, 2021
fcofdez added a commit that referenced this pull request Mar 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement release highlight Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.12.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants