Skip to content

Fix deadlock bug exposed by a test #89934

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Sep 8, 2022

Conversation

grcevski
Copy link
Contributor

@grcevski grcevski commented Sep 8, 2022

A new test exposed a very rare bug where the file settings service was in the middle of processing the file when the node closed. This terminated the cluster state update task, but nobody unlocked the latch await. The fix allows the stop operation to properly terminate the watcher thread.

Essentially, the cluster state tasks we run while processing the settings.json file, can terminate when the node is being shutdown, without calling the task success or failure methods. This means that if we are waiting for all async tasks to finish inside the file watcher processing loop, we'll be waiting forever. This problem was exposed while writing a new integration test for the repository settings.

#89601

Unit tests added that expose the bug consistently.

A new test exposed a very rare bug where the
file settings service was in the middle of processing
the file when the node closed. This terminated the
cluster state update task, but nobody unlocked the
latch await. The fix allows the stop operation to
properly terminate the watcher thread.

Unit tests added that expose the bug.
@grcevski grcevski added >bug :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team auto-backport-and-merge v8.5.0 v8.4.2 labels Sep 8, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine
Copy link
Collaborator

Hi @grcevski, I've created a changelog YAML for you.

@grcevski grcevski changed the title Fix deadlock bug exposed by the test Fix deadlock bug exposed by a test Sep 8, 2022
Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@grcevski grcevski merged commit 36ed4a5 into elastic:main Sep 8, 2022
@grcevski
Copy link
Contributor Author

grcevski commented Sep 8, 2022

Thanks Chris!

@grcevski grcevski deleted the bug/file_settings_shutdown branch September 8, 2022 17:23
grcevski added a commit to grcevski/elasticsearch that referenced this pull request Sep 8, 2022
A new test exposed a very rare bug where the
file settings service was in the middle of processing
the file when the node closed. This terminated the
cluster state update task, but nobody unlocked the
latch await. The fix allows the stop operation to
properly terminate the watcher thread.
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.4

elasticsearchmachine pushed a commit that referenced this pull request Sep 8, 2022
A new test exposed a very rare bug where the
file settings service was in the middle of processing
the file when the node closed. This terminated the
cluster state update task, but nobody unlocked the
latch await. The fix allows the stop operation to
properly terminate the watcher thread.
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Sep 9, 2022
* main: (34 commits)
  Make sure ivy repo directory exists before downloading artifacts
  Use 'file://' scheme for local repository URL
  Use DRA artifacts for release build CI jobs
  Log unsuccessful attempts to get credentials from web identity tokens (elastic#88241)
  Script: Write Field API path manipulation (elastic#89889)
  Fetch health info action (elastic#89820)
  Fix memory leak in TransportDeleteExpiredDataAction (elastic#89935)
  [ML] Performance improvements for categorization jobs (elastic#89824)
  [DOCS] Revert changes for ES_JAVA_OPTS (elastic#89931)
  Fix deadlock bug exposed by a test (elastic#89934)
  [Downsampling] Remove `FieldValueFetcher` validator (elastic#89497)
  Fix segment stats in tsdb (elastic#89754)
  Synthetic _source: support dense_vector (elastic#89840)
  REST tests fetching fields with synthetic _source (elastic#89888)
  Do not deserialize back BytesTransportRequest to clone a request in MockTransportService (elastic#89926)
  Add SDK request logging to debug failures of S3BlobStoreRepositoryTests#testRequestStats (elastic#89912)
  Fix SnapshotStatusApisIT.testGetSnapshotsWithSnapshotInProgress (elastic#89925)
  Document synthetic source for text and keyword (elastic#89893)
  Fix CloneSnapshotIT.testRemoveFailedCloneFromCSWithQueuedSnapshotInProgress (elastic#89914)
  Add missing index.mapping.total_fields.limit setting to the target index (elastic#89875)
  ...
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Sep 9, 2022
* main: (176 commits)
  Fix RandomSamplerAggregatorTests testAggregationSamplingNestedAggsScaled test failure (elastic#89958)
  [Downsampling] Replace document map with SMILE encoded doc (elastic#89495)
  Remove full cluster state from error logging in MasterService (elastic#89960)
  [ML] Truncate categorization fields (elastic#89827)
  [TSDB] Removed `summary` and `histogram` metric types (elastic#89937)
  Update testNodeSelectorRouting so that it does not depend on iteration order (elastic#89879)
  Make sure listener is resolved when file queue is cleared (elastic#89929)
  [Stable plugin api] Extensible annotation (elastic#89903)
  Fix double sending of response in TransportOpenIdConnectPrepareAuthenticationAction (elastic#89930)
  Make sure ivy repo directory exists before downloading artifacts
  Use 'file://' scheme for local repository URL
  Use DRA artifacts for release build CI jobs
  Log unsuccessful attempts to get credentials from web identity tokens (elastic#88241)
  Script: Write Field API path manipulation (elastic#89889)
  Fetch health info action (elastic#89820)
  Fix memory leak in TransportDeleteExpiredDataAction (elastic#89935)
  [ML] Performance improvements for categorization jobs (elastic#89824)
  [DOCS] Revert changes for ES_JAVA_OPTS (elastic#89931)
  Fix deadlock bug exposed by a test (elastic#89934)
  [Downsampling] Remove `FieldValueFetcher` validator (elastic#89497)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team v8.4.2 v8.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants