-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[CI] RemoteClusterClientTests#testConnectAndExecuteRequest fails #41745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/es-search |
Also seems to be an occasional issue on 7.x |
The assertion thats failing is checking that a remote node is connected through the remote cluster service (which is coming from a mock in this case since its ESTestCase). The log shows test timeouts:
I can trigger this kind of error by adding a wait of more than 30s in RemoteClusterConnection#collectRemoteNodes() in the IOInterruptible function passed there, but I wonder why such a long wait could happen in practice, especially since this isn't an integration test. There has been another instance of this on 5.4. (https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=debian-9/372/console) which also shows the same kind of timeout. |
This part of the stack trace shows the ConnectionManger hangs somewhere when opening the internal connection. The future that gets interupted is a listener passed to Transport#openConnection() which in this case is probably the MockNioTransport.
|
The same problem as #41745 (comment) occurred in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+intake/1360/, although the failing test was
|
Talked to @original-brownbear who was investigating similar looking problems elsewhere lately and who suggested using e.g. Yourkit to identify potential locks and determine how much time threads spend waiting on them as a next step. |
* Follow up to elastic#39729 extending the functionality to actually dump the stack when the thread is blocked not afterwards * Logging the stacktrace after the thread became unblocked is only of limited use because we don't know what happened in the slow callback from that (only whether we were blocked on a read,write,connect etc.) * Relates elastic#41745
I opened #42000 to get us more useful logging on this situation |
* Dump Stacktrace on Slow IO-Thread Operations * Follow up to #39729 extending the functionality to actually dump the stack when the thread is blocked not afterwards * Logging the stacktrace after the thread became unblocked is only of limited use because we don't know what happened in the slow callback from that (only whether we were blocked on a read,write,connect etc.) * Relates #41745
* Dump Stacktrace on Slow IO-Thread Operations * Follow up to elastic#39729 extending the functionality to actually dump the stack when the thread is blocked not afterwards * Logging the stacktrace after the thread became unblocked is only of limited use because we don't know what happened in the slow callback from that (only whether we were blocked on a read,write,connect etc.) * Relates elastic#41745
* Dump Stacktrace on Slow IO-Thread Operations * Follow up to elastic#39729 extending the functionality to actually dump the stack when the thread is blocked not afterwards * Logging the stacktrace after the thread became unblocked is only of limited use because we don't know what happened in the slow callback from that (only whether we were blocked on a read,write,connect etc.) * Relates elastic#41745
* Dump Stacktrace on Slow IO-Thread Operations * Follow up to #39729 extending the functionality to actually dump the stack when the thread is blocked not afterwards * Logging the stacktrace after the thread became unblocked is only of limited use because we don't know what happened in the slow callback from that (only whether we were blocked on a read,write,connect etc.) * Relates #41745
This appear to have failed again on master on May 4th (https://scans.gradle.com/s/ftmdnpa26cdty#tests) |
Looks like the logging from #42572 wasn't backported to 7.2. unfortunately. I'll take a look if the logs reveal anything interesting anyway. |
The test logs from the latest failure on 7.2
|
@cbuescher I'm on PTO starting effectively right now so I might not get to this but: feel free to backport this to 7.2 to get a better shot a catching the deadlock here. If you back port the logging improvements though, please make sure to back-port exactly #42572 and not the underlying commit to |
* Dump Stacktrace on Slow IO-Thread Operations * Follow up to #39729 extending the functionality to actually dump the stack when the thread is blocked not afterwards * Logging the stacktrace after the thread became unblocked is only of limited use because we don't know what happened in the slow callback from that (only whether we were blocked on a read,write,connect etc.) * Relates #41745
Just checked again, we got another one with potentially better logging on master on 24. June 2019 here: |
@cbuescher looks like a missed call to |
Didn't mean to close this, only added more logging. I will revisit this in time to see if we got more of these failures. |
@original-brownbear do you think #44622 fixed this or should we keep it open for investigation for a bit longer? |
Let's wait for #44939 which should def. fix this (or at least have it fail very differently :D). |
We currently block the transport thread on startup, which has caused test failures. I think this is some kind of deadlock situation. I don't think we should even block a transport thread, and there's also no need to do so. We can just reject requests as long we're not fully set up. Note that the HTTP layer is only started much later (after we've completed full start up of the transport layer), so that one should be completely unaffected by this. Closes #41745
We currently block the transport thread on startup, which has caused test failures. I think this is some kind of deadlock situation. I don't think we should even block a transport thread, and there's also no need to do so. We can just reject requests as long we're not fully set up. Note that the HTTP layer is only started much later (after we've completed full start up of the transport layer), so that one should be completely unaffected by this. Closes #41745
We currently block the transport thread on startup, which has caused test failures. I think this is some kind of deadlock situation. I don't think we should even block a transport thread, and there's also no need to do so. We can just reject requests as long we're not fully set up. Note that the HTTP layer is only started much later (after we've completed full start up of the transport layer), so that one should be completely unaffected by this. Closes #41745
On master: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=debian-9/369/console
Couldn't reproduce with:
The text was updated successfully, but these errors were encountered: