Add Netty ByteBuf Leak Check to REST Test Clusters #64528

original-brownbear · 2020-11-03T13:47:06Z

We do this check in all tests that inherit from EsTestCase but didn't check for ByteBuf
leaks in rest test clusters, which means we have very little coverage of the REST layer.
With recent reports of very rare leak warnings in logs I think it's worthwhile to do this check
in REST tests as well.

NOTE: I'm not sure this way of doing the check is in line with how we do these things in the test infrastructure. Happy to rework this and do it differently, but it would be great to have this check in some form :)

We do this check in all tests that inherit from `EsTestCase` but didn't check for `ByteBuf` leaks in rest test clusters, which means we have very little coverage of the REST layer. With recent reports of very rare leak warnings in logs I think it's worthwhile to do this check in REST tests as well.

elasticmachine · 2020-11-03T13:47:08Z

Pinging @elastic/es-core-infra (:Core/Infra/Build)

elasticmachine · 2020-11-03T13:47:20Z

Pinging @elastic/es-distributed (:Distributed/Network)

original-brownbear · 2020-11-19T10:51:10Z

ping @mark-vieira just in case this got lost :) It's not super urgent though.

mark-vieira

Sorry this slipped through the cracks @original-brownbear. The implementation here seems fine, but my main concern here is that we could introduce a lot of instability in CI here for hard to track down leaks. Given that CI right now is already quite unstable I think it might be worth bringing this up for further discussion with the team.

For example, rather than enable this across the board on all builds, perhaps we could introduce a single periodic build w/ this enabled so if we do legitimately introduce a leak, we don't disrupt all of CI.

DaveCTurner · 2020-11-19T18:35:51Z

An alternative idea: could we enable this everywhere but include instructions for muting it in the event that it does start to yield failures in future? I would rather we ran most builds with leak detection switched on so that we get told about leaks ASAP -- it's true that they can be hard to track down, but the sooner we hear about them the fewer commits we have to look at to find the culprit.

mark-vieira · 2020-11-19T18:58:43Z

That seems fine as well, assuming it's convenient to do so. I still think this is one of those things that probably deserves a shout out to the ES devs as it's something that is going to affect every developer's workflow. I have no intuition about how common the introduction of these leaks might be, or how sensitive netty's detection might be so maybe it's a non issue but I just want to be very conservative about adding broad failure conditions to our already tenuous CI situation.

original-brownbear · 2020-11-19T19:07:32Z

An alternative idea: could we enable this everywhere but include instructions for muting it in the event that it does start to yield failures in future? I would rather we ran most builds with leak detection switched on so that we get told about leaks ASAP -- it's true that they can be hard to track down, but the sooner we hear about them the fewer commits we have to look at to find the culprit.

++ I'd see this as something similar to running with -ea. If we have e.g. an uncaught exception somewhere that currently takes out a node in a REST test as well and kills all tests after it. Also, at paranoid level these are generally pretty easy to track down :)

I have no intuition about how common the introduction of these leaks might be, or how sensitive netty's detection might be so maybe it's a non issue but I just want to be very conservative about adding broad failure conditions to our already tenuous CI situation.

In general there should not be any leaks ever. The detection at paranoid level (what I set in this PR) will catch close to 100% of leaks. If there is a bug that causes a leak consistently (i.e. makes every build red) it's a very serious bug I'd say and should be fixed right away anyway. If it's a bug that sporadically occurs it should be easy to track down with the help of paranoid level logging as well.

it's a non issue but I just want to be very conservative about adding broad failure conditions to our already tenuous CI situation.

I'd assume this to be a non issue. If we had a consistent memory leak it would show in Cloud production logs which it doesn't.

=> Does this PR generally look ok? Should I add a way to mute this check and some documentation around that to it?

mark-vieira · 2020-11-19T19:18:53Z

One more question, do we expect this to add any significant performance hit to our test runtimes?

original-brownbear · 2020-11-19T19:21:26Z

One more question, do we expect this to add any significant performance hit to our test runtimes?

From the experience of running with these settings in unit tests and internal cluster tests already I'd say no. The runtime cost share of handling network requests overall is a pretty small contributor to the total test runtime IMO. I couldn't see a measurable difference in runtimes locally when I ran this in a loop.

mark-vieira · 2020-11-19T19:26:37Z

Sounds good. I'd say let's put in some toggle to easily turn this off in case something slips into a mainline branch and takes a bit to track down as @DaveCTurner suggestd.

DaveCTurner · 2020-11-19T19:27:56Z

Should I add a way to mute this check and some documentation around that to it?

Yes please, and ideally make it easy to determine when/how to mute it in the exception message itself.

…est-tests

original-brownbear · 2020-11-23T06:00:57Z

Jenkins run elasticsearch-ci/2 (unrelated+known watcher issue)

original-brownbear · 2020-11-23T06:44:15Z

Thanks David and Mark, I added a switch similar to what we have for the BwC tests now, let me know what you think

DaveCTurner

LGTM with a few caveats

I'm relatively unfamiliar with the testclusters code.
We might regret not having tests for this logic in future, it was already quite complicated and this makes it more so.
I am taking it on trust that this actually detects the leaks it claims to 😄

Therefore I'm leaving the final OK to Mark or someone else from the build team.

original-brownbear · 2020-11-23T09:32:48Z

I am taking it on trust that this actually detects the leaks it claims to smile

Manually verified by adding some .retain() calls :D

We might regret not having tests for this logic in future, it was already quite complicated and this makes it more so.

Wouldn't be too hard in theory to add a real test for this I guess (my best idea would be some purposely broken plugin). The only painful thing about this would be that there is a tiny chance that a leak wouldn't be detected before a node shuts down (since the leak detection is tied to GC) making it a bit more involved to set all of this up. Not sure it's worth the time that would have to be invested into this at this point? (could add a TODO?)

mark-vieira · 2020-11-23T18:51:17Z

buildSrc/src/main/java/org/elasticsearch/gradle/testclusters/ElasticsearchNode.java

+            }
+        }
+        if (foundNettyLeaks) {
+            final boolean leakTestsEnabled = (Boolean) project.getExtensions().getExtraProperties().get("netty_leak_tests_enabled");


We need to make this lenient. Calling get() here will throw an exception if that extra property isn't defined, which it may not be for external users of this plugin. We need to check for its existance first.

…est-tests

mark-vieira · 2020-11-24T19:03:04Z

buildSrc/src/main/java/org/elasticsearch/gradle/testclusters/ElasticsearchNode.java

+        }
+        if (foundNettyLeaks) {
+            final ExtraPropertiesExtension extension = project.getExtensions().getExtraProperties();
+            final boolean leakTestsEnabled = extension.has("netty_leak_tests_enabled") == false


I think we want this to be final boolean leakTestsEnabled = extension.has("netty_leak_tests_enabled") && (Boolean) extension.get("netty_leak_tests_enabled");. For external plugin authors we want this disabled by default.

Are we sure about that? Anything they build that has Netty leaks will be broken from the get-go, why deny them this functionality by default (I don't see how we would document it in a way that would make people aware of its existence).
<=> I basically chose defaulting to true because this is a pretty catastrophic issue.

I'll defer to @rjernst on this one since it has plugin authoring implications.

In the past we have erred on the side of forcing anything we deem relevant to our plugins/modules to be used by all plugin authors. I think we have found this to be to burdensome on those users as the vast majority don't care about our internal rules. I would say do not force this on plugin authors at first.

I'm also wondering if this is something we should have on for all builds, but instead have as another CI job? It makes sense to set it for tests inside the netty module, since that is testing our netty code, but forcing this on all developers seems like a step backwards towards the old pattern of "lets make everything an integration test and we will naturally catch distributed bugs". The cost is high when developers outside the distributed area hit an issue. I also do not think we should reuse the build.gradle based setting like with bwc tests, as that was specifically added to enable back ports, while this seems like a general switch to disable the testing globally "just in case", which to me indicates it should be an isolated test run periodically.

I think we have found this to be to burdensome on those users as the vast majority don't care about our internal rules.

This isn't an arbitrary code style rule though? Anything you build that leaks Netty buffers will be fundamentally broken and OOM eventually. Also note that EsTestCase does similar checks for both Netty and our self-managed buffer pool by default for all unit tests so I don't see why being more lax about this in integration tests would be necessary?

this seems like a general switch to disable the testing globally "just in case", which to me indicates it should be an isolated test run periodically.

The thing is, running this in a periodic job makes it a lot less useful. David's argument here #64528 (comment) makes a lot of sense and also, as explained above in the real world this is expected to fail at an incredibly low rate (currently I'm seeing leak warnings in Cloud logs at a frequency of less than 10 globally per day and an even lower rate for 7.x versions). Again, if this triggers at any meaningful frequency something is very broken.

This isn't an arbitrary code style rule though? Anything you build that leaks Netty buffers will be fundamentally broken and OOM eventually.

True, but we can't anticipate all plugin authors use cases or intentions. What if they are testing something that purposefully creates a leak, etc. Now they can't build their plugin. I vote for disabling this externally only because we've been so disruptive in the past and I don't want to risk any more side-effects from stuff like this. Of course, external plugin authors that are doing bytebuf allocation in their plugins (probably not common) should have leak-detection enabled, but I agree with Ryan that we should lean towards being less prescriptive there, simply because historically it has caused lots of problem.

RE: the periodic job, we can start with doing this across the board and then potentially move it into a periodic job if it becomes too disruptive or costly.

I'm fine with moving to a periodic job as necessary, but then I don't think we should have an escape hatch. If it is problematic, let's move it at that time, rather than try to proactively allow for disabling across the board.

I think "problematic" needs to take into consideration the existence of an escape hatch. Just as we do for BWC testing (which is included in the pr/merge workflow). We would not add any check to the build that we could not somehow mute temporarily, and this is no different.

Muting temporarily does not mean it's problematic. It's a necessary tool to keep work flowing. If this becomes the norm (i.e. it's disable more often than it isn't) then that's very different.

Alright :) disabled the check by default now so that plugin authors. I'd still rather not make this a separate CI task for previously stated reasons. Let me know what you think

…est-tests

original-brownbear · 2020-12-02T08:05:28Z

Jenkins run elasticsearch-ci/default-distro (various known but unrelated failures ...)

mark-vieira

One comment about log level but otherwise LGTM.

mark-vieira · 2020-12-02T17:26:05Z

buildSrc/src/main/java/org/elasticsearch/gradle/testclusters/ElasticsearchNode.java

+                        + "\"netty_leak_tests_enabled\" to false in the root level build.gradle file"
+                );
+            } else {
+                LOGGER.error(


Let's log this at warn level since we aren't failing the build.

rjernst

Moving this comment out to the top level: I don't think we should have the escape hatch. If we find this is causing churning for developers or CI, we can disable by removing it as the default and moving out to a dedicated job. We shouldn't need a gradle level flag to disable/track.

mark-vieira · 2020-12-02T18:54:27Z

Moving this comment out to the top level: I don't think we should have the escape hatch. If we find this is causing churning for developers or CI, we can disable by removing it as the default and moving out to a dedicated job. We shouldn't need a gradle level flag to disable/track.

@original-brownbear what's the likelyhood of a change introducing a leak not being detected in PR checks but finding it's way into intake? How accurate is the netty leak detection? As I understand it, only a portion of buffers are instrumented, yes?

original-brownbear · 2020-12-02T21:25:36Z

what's the likelyhood of a change introducing a leak not being detected in PR checks but finding it's way into intake?

I would expect any leak we find to be extremely low frequency so for an individual leak the chance of not being detected in a PR is very high probably. That said, it is extremely unlikely to run into one to begin with and this is mainly a guard against bugs in Netty, extreme corner cases in exception handling etc. We do have a number of unit tests that would catch an obvious spot in every PR.

How accurate is the netty leak detection? As I understand it, only a portion of buffers are instrumented, yes?

With the settings used here essentially every leak will be caught. At "paranoid" level ~100% of buffers are instrumented when they are GCed so only those few buffers that weren't GCed yet when a node shut down are technically not instrumented.

Moving this comment out to the top level: I don't think we should have the escape hatch. If we find this is causing churning for developers or CI, we can disable by removing it as the default and moving out to a dedicated job.

Would be fine by me as well to remove the hatch. Again there's no way this will lead to any high frequency failures unless there is a serious bug somewhere because all our UTs and internal cluster tests (those that use Netty at least) run the same check. So this is really just to get some more coverage on the REST layer but all the underlying logic is heavily tested for this already anyway.

mark-vieira · 2020-12-02T22:49:29Z

Would be fine by me as well to remove the hatch. Again there's no way this will lead to any high frequency failures unless there is a serious bug somewhere because all our UTs and internal cluster tests (those that use Netty at least) run the same check. So this is really just to get some more coverage on the REST layer but all the underlying logic is heavily tested for this already anyway.

👍

…est-tests

original-brownbear · 2020-12-03T09:14:49Z

@rjernst dropped the escape hatch now :)

rjernst

LGTM

original-brownbear · 2020-12-03T23:40:09Z

Thanks everyone!

We do this check in all tests that inherit from `EsTestCase` but didn't check for `ByteBuf` leaks in rest test clusters, which means we have very little coverage of the REST layer. With recent reports of very rare leak warnings in logs I think it's worthwhile to do this check in REST tests as well.

original-brownbear added >test Issues or PRs that are addressing/adding tests :Delivery/Build Build or test infrastructure v8.0.0 v7.11.0 labels Nov 3, 2020

elasticmachine added the Team:Core/Infra Meta label for core/infra team label Nov 3, 2020

original-brownbear added the :Distributed Coordination/Network Http and internode communication implementations label Nov 3, 2020

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Nov 3, 2020

rjernst requested a review from mark-vieira November 3, 2020 20:00

mark-vieira added Team:Delivery Meta label for Delivery team and removed Team:Core/Infra Meta label for core/infra team labels Nov 11, 2020

mark-vieira reviewed Nov 19, 2020

View reviewed changes

original-brownbear added 2 commits November 23, 2020 04:06

Merge remote-tracking branch 'elastic/master' into netty-leak-check-r…

65e6df7

…est-tests

CR: add mute facility and enhance failure

bc7c5fe

original-brownbear requested a review from mark-vieira November 23, 2020 06:43

DaveCTurner reviewed Nov 23, 2020

View reviewed changes

mark-vieira requested changes Nov 23, 2020

View reviewed changes

original-brownbear added 2 commits November 24, 2020 10:25

Merge remote-tracking branch 'elastic/master' into netty-leak-check-r…

af8e0cb

…est-tests

cleaner boolean

ccf3333

original-brownbear requested a review from mark-vieira November 24, 2020 10:15

mark-vieira requested changes Nov 24, 2020

View reviewed changes

original-brownbear requested a review from rjernst November 25, 2020 11:34

original-brownbear added 2 commits December 2, 2020 08:22

Merge remote-tracking branch 'elastic/master' into netty-leak-check-r…

4f1ea87

…est-tests

default to false

a6b6007

original-brownbear requested a review from mark-vieira December 2, 2020 09:06

mark-vieira approved these changes Dec 2, 2020

View reviewed changes

rjernst requested changes Dec 2, 2020

View reviewed changes

original-brownbear added 2 commits December 3, 2020 09:10

Merge remote-tracking branch 'elastic/master' into netty-leak-check-r…

e431859

…est-tests

bye escape hatch

652ee3c

original-brownbear requested a review from rjernst December 3, 2020 09:14

rjernst approved these changes Dec 3, 2020

View reviewed changes

original-brownbear merged commit 9612c9a into elastic:master Dec 3, 2020

original-brownbear deleted the netty-leak-check-rest-tests branch December 3, 2020 23:40

original-brownbear mentioned this pull request Dec 3, 2020

Add Netty ByteBuf Leak Check to REST Test Clusters (#64528) #65864

Merged

original-brownbear restored the netty-leak-check-rest-tests branch December 6, 2020 19:02

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Add Netty ByteBuf Leak Check to REST Test Clusters #64528

Add Netty ByteBuf Leak Check to REST Test Clusters #64528

Uh oh!

Conversation

original-brownbear commented Nov 3, 2020

Uh oh!

elasticmachine commented Nov 3, 2020

Uh oh!

elasticmachine commented Nov 3, 2020

Uh oh!

original-brownbear commented Nov 19, 2020

Uh oh!

mark-vieira left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner commented Nov 19, 2020

Uh oh!

mark-vieira commented Nov 19, 2020

Uh oh!

original-brownbear commented Nov 19, 2020

Uh oh!

mark-vieira commented Nov 19, 2020

Uh oh!

original-brownbear commented Nov 19, 2020

Uh oh!

mark-vieira commented Nov 19, 2020

Uh oh!

DaveCTurner commented Nov 19, 2020

Uh oh!

original-brownbear commented Nov 23, 2020

Uh oh!

original-brownbear commented Nov 23, 2020

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Nov 23, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mark-vieira Dec 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Dec 2, 2020

Uh oh!

mark-vieira left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjernst left a comment

Choose a reason for hiding this comment

Uh oh!

mark-vieira commented Dec 2, 2020

Uh oh!

original-brownbear commented Dec 2, 2020

Uh oh!

mark-vieira commented Dec 2, 2020

Uh oh!

mark-vieira Dec 1, 2020 •

edited

Loading