Skip to content

[CI] DocsClientYamlTestSuiteIT test {yaml=reference/cat/nodes/line_361} failing #124103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elasticsearchmachine opened this issue Mar 5, 2025 · 10 comments · Fixed by #124684
Closed
Assignees
Labels
:Core/Infra/Core Core issues without another label low-risk An open issue or test failure that is a low risk to future releases Team:Core/Infra Meta label for core/infra team >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Mar 5, 2025

Build Scans:

Reproduction Line:

gradlew ":docs:yamlRestTest" --tests "org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT.test {yaml=reference/cat/nodes/line_361}" -Dtests.seed=245967704EFCFB2C -Dtests.locale=rn-BI -Dtests.timezone=ACT -Druntime.java=24

Applicable branches:
8.x

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.AssertionError: Failure at [reference/cat/nodes:15]: field [$body] was expected to match the provided regex but didn't
Expected: ip        \s+heap.percent \s+ram.percent \s+cpu \s+load_1m \s+load_5m \s+load_15m \s+node.role \s+master \s+name\s* 127.0.0.1           \s+\d+ \s+\d+ \s+\d+    \s+(\d+\.\d+( \s+\d+\.\d+ \s+(\d+\.\d+)?)?)?                  \s+.+       \s+[*]      \s+.+\s*
     but: was "ip        heap.percent ram.percent cpu load_1m load_5m load_15m node.role   master name\n127.0.0.1           59          39  -1                          cdfhilmrstw *      node-0\n"

Issue Reasons:

  • [8.x] 2 consecutive failures in test test {yaml=reference/cat/nodes/line_361}
  • [8.x] 8 consecutive failures in step windows-2019_checkpart1_platform-support-windows
  • [8.x] 9 consecutive failures in step windows-2022_checkpart1_platform-support-windows
  • [8.x] 9 consecutive failures in step part-1-windows
  • [8.x] 26 failures in test test {yaml=reference/cat/nodes/line_361} (6.5% fail rate in 401 executions)
  • [8.x] 8 failures in step windows-2019_checkpart1_platform-support-windows (100.0% fail rate in 8 executions)
  • [8.x] 9 failures in step windows-2022_checkpart1_platform-support-windows (100.0% fail rate in 9 executions)
  • [8.x] 9 failures in step part-1-windows (100.0% fail rate in 9 executions)
  • [8.x] 9 failures in pipeline elasticsearch-periodic-platform-support (100.0% fail rate in 9 executions)
  • [8.x] 5 failures in pipeline elasticsearch-pull-request (8.6% fail rate in 58 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Delivery/Build Build or test infrastructure >test-failure Triaged test failures from CI labels Mar 5, 2025
elasticsearchmachine added a commit that referenced this issue Mar 5, 2025
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch 9.0

Mute Reasons:

  • [9.0] 2 failures in test test {yaml=reference/cat/nodes/line_361} (1.7% fail rate in 118 executions)

Build Scans:

@elasticsearchmachine elasticsearchmachine added Team:Delivery Meta label for Delivery team needs:risk Requires assignment of a risk label (low, medium, blocker) labels Mar 5, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-delivery (Team:Delivery)

@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch 8.18

Mute Reasons:

  • [8.18] 6 consecutive failures in step windows-2019_checkpart1_platform-support-windows
  • [8.18] 6 consecutive failures in step part-1-windows
  • [8.18] 5 consecutive failures in step windows-2022_checkpart1_platform-support-windows
  • [8.18] 17 failures in test test {yaml=reference/cat/nodes/line_361} (7.2% fail rate in 236 executions)
  • [8.18] 6 failures in step windows-2019_checkpart1_platform-support-windows (100.0% fail rate in 6 executions)
  • [8.18] 6 failures in step part-1-windows (100.0% fail rate in 6 executions)
  • [8.18] 5 failures in step windows-2022_checkpart1_platform-support-windows (100.0% fail rate in 5 executions)
  • [8.18] 6 failures in pipeline elasticsearch-periodic-platform-support (100.0% fail rate in 6 executions)
  • [8.18] 4 failures in pipeline elasticsearch-pull-request (12.1% fail rate in 33 executions)

Build Scans:

@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch 8.x

Mute Reasons:

  • [8.x] 9 consecutive failures in step windows-2022_checkpart1_platform-support-windows
  • [8.x] 9 consecutive failures in step part-1-windows
  • [8.x] 7 consecutive failures in step windows-2019_checkpart1_platform-support-windows
  • [8.x] 25 failures in test test {yaml=reference/cat/nodes/line_361} (6.3% fail rate in 400 executions)
  • [8.x] 9 failures in step windows-2022_checkpart1_platform-support-windows (100.0% fail rate in 9 executions)
  • [8.x] 9 failures in step part-1-windows (100.0% fail rate in 9 executions)
  • [8.x] 7 failures in step windows-2019_checkpart1_platform-support-windows (100.0% fail rate in 7 executions)
  • [8.x] 9 failures in pipeline elasticsearch-periodic-platform-support (100.0% fail rate in 9 executions)
  • [8.x] 5 failures in pipeline elasticsearch-pull-request (8.6% fail rate in 58 executions)

Build Scans:

elasticsearchmachine added a commit that referenced this issue Mar 8, 2025
@slobodanadamovic slobodanadamovic self-assigned this Mar 12, 2025
jfreden pushed a commit to jfreden/elasticsearch that referenced this issue Mar 13, 2025
The .security index is created asynchronously on a cluster startup. This
affects some of the docs YAML tests in a way that they need to account
for the existence of the .security index or wait for the index to be
created and green. This PR disables the feature for docs YAML tests.
Disabling the feature in docs YAML tests will solve the flakiness
without affecting the coverage.

Resolves elastic#122343 Resolves
elastic#121748 Resolves
elastic#121611 Resolves
elastic#121345 Resolves
elastic#121338 Resolves
elastic#121337 Resolves
elastic#121288 Resolves
elastic#121287 Resolves
elastic#121867 Resolves
elastic#122335 Resolves
elastic#122681 Resolves
elastic#121976 Resolves
elastic#123094 Resolves
elastic#123192 Resolves
elastic#122983 Resolves
elastic#124671 Resolves
elastic#124103
@nielsbauman
Copy link
Contributor

Reopening this because the failure in the Failure Message is still relevant.

@nielsbauman nielsbauman added :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team and removed :Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team labels Mar 13, 2025
@nielsbauman nielsbauman reopened this Mar 13, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-core-infra (Team:Core/Infra)

@ldematte
Copy link
Contributor

This never worked with OpenJDK on Windows, but we bundle the Oracle JDK (at least on Windows), and we use that for tests launched via gradle.

This is the OpenJDK implementation:

// Windows doesn't provide a loadavg primitive so this is stubbed out for now.
// It does have primitives (PDH API) to get CPU usage and run queue length.
// "\\Processor(_Total)\\% Processor Time", "\\System\\Processor Queue Length"
// If we wanted to implement loadavg on Windows, we have a few options:
//
// a) Query CPU usage and run queue length and "fake" an answer by
//    returning the CPU usage if it's under 100%, and the run queue
//    length otherwise.  It turns out that querying is pretty slow
//    on Windows, on the order of 200 microseconds on a fast machine.
//    Note that on the Windows the CPU usage value is the % usage
//    since the last time the API was called (and the first call
//    returns 100%), so we'd have to deal with that as well.
//
// b) Sample the "fake" answer using a sampling thread and store
//    the answer in a global variable.  The call to loadavg would
//    just return the value of the global, avoiding the slow query.
//
// c) Sample a better answer using exponential decay to smooth the
//    value.  This is basically the algorithm used by UNIX kernels.
//
// Note that sampling thread starvation could affect both (b) and (c).
int os::loadavg(double loadavg[], int nelem) {
  return -1;
}

Apparently, the Oracle JDK has a different (more complete, proprietary?) management bean for Windows.
But there is no Oracle JDK 24 yet, so we always use the OpenJDK 24 in tests now, and this test fails on Windows now.

I think it's reasonable to adjust the test to allow -1. We wouldn't catch cases where the CPU load percentage is unexpectedly -1, but I don't think these CAT tests should be the ones asserting that.

@ldematte ldematte added the low-risk An open issue or test failure that is a low risk to future releases label Mar 13, 2025
@elasticsearchmachine elasticsearchmachine removed the needs:risk Requires assignment of a risk label (low, medium, blocker) label Mar 13, 2025
slobodanadamovic added a commit to slobodanadamovic/elasticsearch that referenced this issue Mar 13, 2025
The .security index is created asynchronously on a cluster startup. This
affects some of the docs YAML tests in a way that they need to account
for the existence of the .security index or wait for the index to be
created and green. This PR disables the feature for docs YAML tests.
Disabling the feature in docs YAML tests will solve the flakiness
without affecting the coverage.

Resolves elastic#122343 Resolves
elastic#121748 Resolves
elastic#121611 Resolves
elastic#121345 Resolves
elastic#121338 Resolves
elastic#121337 Resolves
elastic#121288 Resolves
elastic#121287 Resolves
elastic#121867 Resolves
elastic#122335 Resolves
elastic#122681 Resolves
elastic#121976 Resolves
elastic#123094 Resolves
elastic#123192 Resolves
elastic#122983 Resolves
elastic#124671 Resolves
elastic#124103

(cherry picked from commit cac356a)

# Conflicts:
#	muted-tests.yml
slobodanadamovic added a commit to slobodanadamovic/elasticsearch that referenced this issue Mar 13, 2025
The .security index is created asynchronously on a cluster startup. This
affects some of the docs YAML tests in a way that they need to account
for the existence of the .security index or wait for the index to be
created and green. This PR disables the feature for docs YAML tests.
Disabling the feature in docs YAML tests will solve the flakiness
without affecting the coverage.

Resolves elastic#122343 Resolves
elastic#121748 Resolves
elastic#121611 Resolves
elastic#121345 Resolves
elastic#121338 Resolves
elastic#121337 Resolves
elastic#121288 Resolves
elastic#121287 Resolves
elastic#121867 Resolves
elastic#122335 Resolves
elastic#122681 Resolves
elastic#121976 Resolves
elastic#123094 Resolves
elastic#123192 Resolves
elastic#122983 Resolves
elastic#124671 Resolves
elastic#124103

(cherry picked from commit cac356a)

# Conflicts:
#	muted-tests.yml
slobodanadamovic added a commit to slobodanadamovic/elasticsearch that referenced this issue Mar 13, 2025
The .security index is created asynchronously on a cluster startup. This
affects some of the docs YAML tests in a way that they need to account
for the existence of the .security index or wait for the index to be
created and green. This PR disables the feature for docs YAML tests.
Disabling the feature in docs YAML tests will solve the flakiness
without affecting the coverage.

Resolves elastic#122343 Resolves
elastic#121748 Resolves
elastic#121611 Resolves
elastic#121345 Resolves
elastic#121338 Resolves
elastic#121337 Resolves
elastic#121288 Resolves
elastic#121287 Resolves
elastic#121867 Resolves
elastic#122335 Resolves
elastic#122681 Resolves
elastic#121976 Resolves
elastic#123094 Resolves
elastic#123192 Resolves
elastic#122983 Resolves
elastic#124671 Resolves
elastic#124103

(cherry picked from commit cac356a)

# Conflicts:
#	muted-tests.yml
elasticsearchmachine pushed a commit that referenced this issue Mar 13, 2025
)

The .security index is created asynchronously on a cluster startup. This
affects some of the docs YAML tests in a way that they need to account
for the existence of the .security index or wait for the index to be
created and green. This PR disables the feature for docs YAML tests.
Disabling the feature in docs YAML tests will solve the flakiness
without affecting the coverage.

Resolves #122343 Resolves
#121748 Resolves
#121611 Resolves
#121345 Resolves
#121338 Resolves
#121337 Resolves
#121288 Resolves
#121287 Resolves
#121867 Resolves
#122335 Resolves
#122681 Resolves
#121976 Resolves
#123094 Resolves
#123192 Resolves
#122983 Resolves
#124671 Resolves
#124103

(cherry picked from commit cac356a)

# Conflicts:
#	muted-tests.yml
elasticsearchmachine pushed a commit that referenced this issue Mar 13, 2025
)

The .security index is created asynchronously on a cluster startup. This
affects some of the docs YAML tests in a way that they need to account
for the existence of the .security index or wait for the index to be
created and green. This PR disables the feature for docs YAML tests.
Disabling the feature in docs YAML tests will solve the flakiness
without affecting the coverage.

Resolves #122343 Resolves
#121748 Resolves
#121611 Resolves
#121345 Resolves
#121338 Resolves
#121337 Resolves
#121288 Resolves
#121287 Resolves
#121867 Resolves
#122335 Resolves
#122681 Resolves
#121976 Resolves
#123094 Resolves
#123192 Resolves
#122983 Resolves
#124671 Resolves
#124103

(cherry picked from commit cac356a)

# Conflicts:
#	muted-tests.yml
elasticsearchmachine pushed a commit that referenced this issue Mar 13, 2025
)

The .security index is created asynchronously on a cluster startup. This
affects some of the docs YAML tests in a way that they need to account
for the existence of the .security index or wait for the index to be
created and green. This PR disables the feature for docs YAML tests.
Disabling the feature in docs YAML tests will solve the flakiness
without affecting the coverage.

Resolves #122343 Resolves
#121748 Resolves
#121611 Resolves
#121345 Resolves
#121338 Resolves
#121337 Resolves
#121288 Resolves
#121287 Resolves
#121867 Resolves
#122335 Resolves
#122681 Resolves
#121976 Resolves
#123094 Resolves
#123192 Resolves
#122983 Resolves
#124671 Resolves
#124103

(cherry picked from commit cac356a)

# Conflicts:
#	muted-tests.yml
@ldematte
Copy link
Contributor

This is not reproducible anymore, as we are transitioning to the new doc system; as we discussed on Slack, the fix would be in the test: -1 is an acceptable value, and we already accept it in related cat API tests (see for example rest-api-spec/test/cat.nodes/10_basic.yml)

@nielsbauman
Copy link
Contributor

@ldematte should we add the -1 fix on the 8.x branches to prevent the bot from reopening this (and related) issues?

@ldematte
Copy link
Contributor

The only branch that still have the old docs is 8.18. We can add the fix to that branch if we want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label low-risk An open issue or test failure that is a low risk to future releases Team:Core/Infra Meta label for core/infra team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants