Skip to content

[CI] testSecurityActionsByLicenseType: Failed to execute phase [query] #30301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
DaveCTurner opened this issue May 1, 2018 · 7 comments · Fixed by #38815
Closed

[CI] testSecurityActionsByLicenseType: Failed to execute phase [query] #30301

DaveCTurner opened this issue May 1, 2018 · 7 comments · Fixed by #38815
Assignees
Labels
:Security/Security Security issues without another label >test-failure Triaged test failures from CI

Comments

@DaveCTurner
Copy link
Contributor

DaveCTurner commented May 1, 2018

In https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+multijob-unix-compatibility/os=fedora/16/console, org.elasticsearch.license.LicensingTests#testSecurityActionsByLicenseType failed. The REPRODUCE LINE doesn't reproduce a failure:

./gradlew :x-pack:plugin:security:test -Dtests.seed=993121463E9AC0EF -Dtests.class=org.elasticsearch.license.LicensingTests -Dtests.method="testSecurityActionsByLicenseType" -Dtests.security.manager=true -Dtests.locale=sr -Dtests.timezone=Africa/Accra

This stack trace appears, but there's nothing obviously related to it in the rest of the logs:

Failed to execute phase [query], all shards failed
  at __randomizedtesting.SeedInfo.seed([993121463E9AC0EF:5C5ECE7262D69883]:0)
  at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:288)
  at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:128)
  at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:249)
  at org.elasticsearch.action.search.InitialSearchPhase.onShardFailure(InitialSearchPhase.java:101)
  at org.elasticsearch.action.search.InitialSearchPhase.access$100(InitialSearchPhase.java:48)
  at org.elasticsearch.action.search.InitialSearchPhase$2.lambda$onFailure$1(InitialSearchPhase.java:222)
  at org.elasticsearch.action.search.InitialSearchPhase.maybeFork(InitialSearchPhase.java:176)
  at org.elasticsearch.action.search.InitialSearchPhase.access$000(InitialSearchPhase.java:48)
  at org.elasticsearch.action.search.InitialSearchPhase$2.onFailure(InitialSearchPhase.java:222)
  at org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:73)
  at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:51)
  at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:527)
  at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1095)
  at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1188)
  at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1172)
  at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:66)
  at org.elasticsearch.action.search.SearchTransportService$6$1.onFailure(SearchTransportService.java:385)
  at org.elasticsearch.search.SearchService$2.onFailure(SearchService.java:341)
  at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:335)
  at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:329)
  at org.elasticsearch.search.SearchService$3.doRun(SearchService.java:1019)
  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724)
  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
  at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41)
  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
@DaveCTurner DaveCTurner added >test Issues or PRs that are addressing/adding tests :Security/Security Security issues without another label labels May 1, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security

@DaveCTurner
Copy link
Contributor Author

Here are the logs related to this test: consoleText.txt.gz

@jdconrad
Copy link
Contributor

jdconrad commented May 7, 2018

@polyfractal polyfractal added >test-failure Triaged test failures from CI and removed >test Issues or PRs that are addressing/adding tests labels May 9, 2018
@jaymode
Copy link
Member

jaymode commented May 16, 2018

@jaymode jaymode self-assigned this May 16, 2018
jaymode added a commit that referenced this issue May 16, 2018
This commit increases the logging level around search to aid in
debugging failures in LicensingTests#testSecurityActionsByLicenseType
where we are seeing all shards failed error while trying to search the
security index.

See #30301
jaymode added a commit that referenced this issue May 16, 2018
This commit increases the logging level around search to aid in
debugging failures in LicensingTests#testSecurityActionsByLicenseType
where we are seeing all shards failed error while trying to search the
security index.

See #30301
@jaymode
Copy link
Member

jaymode commented May 16, 2018

I pushed a change to increase logging around search to see exactly what is causing all of the shards to fail during the search.

jaymode added a commit that referenced this issue May 16, 2018
This commit increases the logging level around search to aid in
debugging failures in LicensingTests#testSecurityActionsByLicenseType
where we are seeing all shards failed error while trying to search the
security index.

See #30301
jaymode added a commit to jaymode/elasticsearch that referenced this issue May 21, 2018
This commit changes the wait for a few netty threads to wait for these
threads to complete after the cluster has stopped. Previously, we were
waiting for these threads before the cluster was actually stopped; the
cluster is stopped in an AfterClass method of ESIntegTestCase, while
the wait was performed in the AfterClass of a class that extended
ESIntegTestCase, which is always executed before the AfterClass of
ESIntegTestCase.

Now, the wait is contained in an ExternalResource ClassRule that
implements the waiting for the threads to terminate in the after
method. This rule is executed after the AfterClass method in
ESIntegTestCase. The same fix has also been applied in
SecuritySingleNodeTestCase.

Closes elastic#30301
ywelsch pushed a commit to ywelsch/elasticsearch that referenced this issue May 23, 2018
This commit increases the logging level around search to aid in
debugging failures in LicensingTests#testSecurityActionsByLicenseType
where we are seeing all shards failed error while trying to search the
security index.

See elastic#30301
@danielmitterdorfer
Copy link
Member

Another instance of this on 6.2 in https://internal-ci.elastic.co/job/elastic+x-pack-elasticsearch+6.2+matrix-java-periodic/ES_BUILD_JAVA=java9,ES_RUNTIME_JAVA=java8,nodes=linux/94/console

reproduction line:

./gradlew :x-pack-elasticsearch:plugin:security:test \
  -Dtests.seed=D37D8894A94058BA \
  -Dtests.class=org.elasticsearch.license.LicensingTests \
  -Dtests.method="testSecurityActionsByLicenseType" \
  -Dtests.security.manager=true \
  -Dtests.locale=fi \
  -Dtests.timezone=Africa/Porto-Novo

build log

jaymode added a commit to jaymode/elasticsearch that referenced this issue Feb 12, 2019
This change updates the authentication service to use a consistent view
of the realms based on the license state at the start of
authentication. Without this, the license can change during
authentication of a request and it will result in a failure if the
realm that extracted the token is no longer in the realm list. This
manifests in some tests as an authentication failure that should never
really happen; one example would be the test framework's transport
client user should always have a succesful authentication but in the
LicensingTests this can fail and will show up as a
NoNodeAvailableException.

Additionally, the licensing tests have been updated to ensure that
there is consistency when changing the license. The license is changed
by modifying the internal xpack license state on each node, which has
no protection against be changed by some pending cluster action. The
methods to disable and enable now ensure we have a green cluster and
that the cluster is consistent before returning.

Closes elastic#30301
@jaymode
Copy link
Member

jaymode commented Feb 12, 2019

I haven't been able to find a reproduction with the increased logging, but I did spot how there could be a issue with the test. The test relies on updating the internal license state of a node but does not have any protection against a new change overwriting the changed license state.

Additionally, this test also triggers a NoNodeAvailableException when a node's client is trying to talk to itself. The underlying cause is an Authentication failure due to the license state changing in the middle of authentication. A token gets extracted but cannot be authenticated because the realm is no longer returned by the Realms method. While this would be rare in production scenarios, this is a real bug.

I've opened #38815 to address these items and will have that PR close this issue. If we see failures, we can re-open or create a new issue.

jaymode added a commit that referenced this issue Feb 14, 2019
This change updates the authentication service to use a consistent view
of the realms based on the license state at the start of
authentication. Without this, the license can change during
authentication of a request and it will result in a failure if the
realm that extracted the token is no longer in the realm list. This
manifests in some tests as an authentication failure that should never
really happen; one example would be the test framework's transport
client user should always have a succesful authentication but in the
LicensingTests this can fail and will show up as a
NoNodeAvailableException.

Additionally, the licensing tests have been updated to ensure that
there is consistency when changing the license. The license is changed
by modifying the internal xpack license state on each node, which has
no protection against be changed by some pending cluster action. The
methods to disable and enable now ensure we have a green cluster and
that the cluster is consistent before returning.

Closes #30301
jaymode added a commit that referenced this issue Feb 14, 2019
This change updates the authentication service to use a consistent view
of the realms based on the license state at the start of
authentication. Without this, the license can change during
authentication of a request and it will result in a failure if the
realm that extracted the token is no longer in the realm list. This
manifests in some tests as an authentication failure that should never
really happen; one example would be the test framework's transport
client user should always have a succesful authentication but in the
LicensingTests this can fail and will show up as a
NoNodeAvailableException.

Additionally, the licensing tests have been updated to ensure that
there is consistency when changing the license. The license is changed
by modifying the internal xpack license state on each node, which has
no protection against be changed by some pending cluster action. The
methods to disable and enable now ensure we have a green cluster and
that the cluster is consistent before returning.

Closes #30301
jaymode added a commit that referenced this issue Feb 14, 2019
This change updates the authentication service to use a consistent view
of the realms based on the license state at the start of
authentication. Without this, the license can change during
authentication of a request and it will result in a failure if the
realm that extracted the token is no longer in the realm list. This
manifests in some tests as an authentication failure that should never
really happen; one example would be the test framework's transport
client user should always have a succesful authentication but in the
LicensingTests this can fail and will show up as a
NoNodeAvailableException.

Additionally, the licensing tests have been updated to ensure that
there is consistency when changing the license. The license is changed
by modifying the internal xpack license state on each node, which has
no protection against be changed by some pending cluster action. The
methods to disable and enable now ensure we have a green cluster and
that the cluster is consistent before returning.

Closes #30301
jaymode added a commit that referenced this issue Feb 14, 2019
This change updates the authentication service to use a consistent view
of the realms based on the license state at the start of
authentication. Without this, the license can change during
authentication of a request and it will result in a failure if the
realm that extracted the token is no longer in the realm list. This
manifests in some tests as an authentication failure that should never
really happen; one example would be the test framework's transport
client user should always have a succesful authentication but in the
LicensingTests this can fail and will show up as a
NoNodeAvailableException.

Additionally, the licensing tests have been updated to ensure that
there is consistency when changing the license. The license is changed
by modifying the internal xpack license state on each node, which has
no protection against be changed by some pending cluster action. The
methods to disable and enable now ensure we have a green cluster and
that the cluster is consistent before returning.

Closes #30301
jaymode added a commit that referenced this issue Feb 14, 2019
This change updates the authentication service to use a consistent view
of the realms based on the license state at the start of
authentication. Without this, the license can change during
authentication of a request and it will result in a failure if the
realm that extracted the token is no longer in the realm list. This
manifests in some tests as an authentication failure that should never
really happen; one example would be the test framework's transport
client user should always have a succesful authentication but in the
LicensingTests this can fail and will show up as a
NoNodeAvailableException.

Additionally, the licensing tests have been updated to ensure that
there is consistency when changing the license. The license is changed
by modifying the internal xpack license state on each node, which has
no protection against be changed by some pending cluster action. The
methods to disable and enable now ensure we have a green cluster and
that the cluster is consistent before returning.

Closes #30301
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Security/Security Security issues without another label >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants