-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Introduce repository test kit/analyser #67247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce repository test kit/analyser #67247
Conversation
Today we rely on blob stores behaving in a certain way so that they can be used correctly as a snapshot repository. There are an increasing number of third-party blob stores that claim to be S3-compatible, but which may not offer a suitably performant implementation of the S3 API. We rely on S3's subtle semantics with concurrent readers and writers, but some blob stores may not implement it correctly. Hitting a corner case in the implementation may be rare in normal use, and may be hard to reproduce or to distinguish from an Elasticsearch bug. This commit introduces a new `POST _repository/.../_speed_test` API which measures the details of the performance of the repository under concurrent load, and exercises the more problematic corners of the API implementation looking for correctness bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opening this draft for discussion on a few points before I proceed much further.
[[repo-speed-test-api-response-body]] | ||
==== {api-response-body-title} | ||
|
||
TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today we report the details of every read and write performed during the test. It'd probably be useful to add some higher-level summary statistics too, maybe only returning the low-level ones if ?detailed
is passed. Suggestions for the higher-level stats are up for discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added simple accumulators in 0e904d8.
docs/reference/snapshot-restore/apis/repo-speed-test-api.asciidoc
Outdated
Show resolved
Hide resolved
import static org.hamcrest.Matchers.lessThanOrEqualTo; | ||
import static org.hamcrest.Matchers.startsWith; | ||
|
||
public class RepositorySpeedTestIT extends AbstractSnapshotIntegTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are incomplete and do not currently ensure that the speed test detects various kinds of repository bug.
deleteContainerAndSendResponse(); | ||
} else { | ||
logger.debug( | ||
"expected blobs [{}] missing in [{}:{}], trying again; retry count = {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this any more? AFAIK S3 was the only blob store that didn't have consistent listings, but it does now doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should require consistent listings (as well as test for it), as all currently officially supported blob stores support it.
...-kit/src/main/java/org/elasticsearch/repositories/blobstore/testkit/BlobSpeedTestAction.java
Outdated
Show resolved
Hide resolved
Courtesy of http://weidagang.github.io/text-diagram/ with this input: object Writer Repo Readers note right of Writer: Write phase Writer -> Repo: Write blob with random content Writer -> Readers: Read range during write (rarely) Readers -> Repo: Read range Repo -> Readers: Contents of range, or "not found" Readers -> Writer: Acknowledge read, including checksum if found Repo -> Writer: Write complete space 5 note right of Writer: Read phase Writer -> Readers: Read range [a,b) Readers -> Repo: Read range Writer -> Repo: Overwrite blob (rarely) Repo -> Readers: Contents of range Repo -> Writer: Overwrite complete Readers -> Writer: Ack read (with checksum) space 5 note right of Writer: Verify phase Writer -> Writer: Confirm checksums
...n/security/src/main/java/org/elasticsearch/xpack/security/operator/OperatorOnlyRegistry.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a first pass on this.
Regarding naming, I'm not sure I would phrase this as a "speed_test", and would like to bring the following suggestions into the discussion:
- Repository compatibility checker (
_compatibility_check
) - Repository compatibility tester (
_compatibility_test
)
I think we should provide more documentation on what this tool is checking (e.g. read after a fresh write works as expected), and also describe its limitations (e.g. it can't check durability, which is another important factor for a repository, as well as that data integrity over time (i.e. shouldn't become corrupted)) and say that it might be incomplete (can only show presence of issues, not absence of them).
I would like the tool to better exercise the various upload techniques (e.g. single vs multi-part upload or resumable upload). This currently relies too implicitly on setting the right blob sizes.
Maybe I missed this in the docs, but should we advise to run this on a multi-node cluster? What size of cluster?
How much temp storage will be taken up in the repository if a user runs this? Let's be explicit about that in the docs.
Do we test anywhere that the deletes actually worked? (i.e. not only that the API says success).
There are quite a number of parameters to tune here, and even as someone familiar with the system feel overloaded by the docs here. Should we explicitly distinguish between reasonable settings to tune, and expert settings?
); | ||
} | ||
final BlobContainer blobContainer = getBlobContainer(); | ||
final Map<String, BlobMetadata> blobsMap = blobContainer.listBlobs(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check that sub-path semantics are properly implemented? (e.g. the trailing slash problem)
...n/security/src/main/java/org/elasticsearch/xpack/security/operator/OperatorOnlyRegistry.java
Outdated
Show resolved
Hide resolved
return List.of(TestPlugin.class, LocalStateCompositeXPackPlugin.class, SnapshotRepositoryTestKit.class); | ||
} | ||
|
||
public void testFoo() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//TODO: better name
deleteContainerAndSendResponse(); | ||
} else { | ||
logger.debug( | ||
"expected blobs [{}] missing in [{}:{}], trying again; retry count = {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should require consistent listings (as well as test for it), as all currently officially supported blob stores support it.
Thanks for the review Yannick. +1 to most of the points you raise, some select comments inline:
I know what you mean; I have the same concern with "compatibility test" or "compatibility check" as I did with "extended verification" in that if we return That said, I also find the speed test notion awkward, and I don't feel very strongly either way. I'd rather not go back and forth with the code, let's settle on a name we're happy with elsewhere and then move forward.
Agreed, I think this needs a richer interface, something like
WDYT about that being an input parameter? On reflection it sounds more useful for users to set the total size rather than the max blob size, and let us divide up the total as we see fit. |
…ution. Relates elastic#67247 Backport of elastic#69690
…ution. Relates elastic#67247 Backport of elastic#69690
The repository analyzer does not write empty blobs to the repository, and we assert that the blobs are nonempty, but the tests randomly check for the empty case anyway. This commit ensures that the blobs used in tests are always nonempty. Relates elastic#67247
Relates elastic#67247 Backport of elastic#69316
Relates elastic#67247 Backport of elastic#69316
The repository analyzer does not write empty blobs to the repository, and we assert that the blobs are nonempty, but the tests randomly check for the empty case anyway. This commit ensures that the blobs used in tests are always nonempty. Relates #67247
The repository analyzer does not write empty blobs to the repository, and we assert that the blobs are nonempty, but the tests randomly check for the empty case anyway. This commit ensures that the blobs used in tests are always nonempty. Relates #67247
The repository analyzer does not write empty blobs to the repository, and we assert that the blobs are nonempty, but the tests randomly check for the empty case anyway. This commit ensures that the blobs used in tests are always nonempty. Relates #67247
Today we rely on blob stores behaving in a certain way so that they can be used
as a snapshot repository. There are an increasing number of third-party blob
stores that claim to be S3-compatible, but which may not offer a suitably
correct or performant implementation of the S3 API. We rely on somesubtle
semantics with concurrent readers and writers, but some blob stores may not
implement it correctly. Hitting a corner case in the implementation may be rare
in normal use, and may be hard to reproduce or to distinguish from an
Elasticsearch bug.
This commit introduces a new
POST /_snapshot/.../_analyse
API which exercisesthe more problematic corners of the repository implementation looking for
correctness bugs and measures the details of the performance of the repository
under concurrent load.