-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Add Netty ByteBuf Leak Check to REST Test Clusters #64528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Netty ByteBuf Leak Check to REST Test Clusters #64528
Conversation
We do this check in all tests that inherit from `EsTestCase` but didn't check for `ByteBuf` leaks in rest test clusters, which means we have very little coverage of the REST layer. With recent reports of very rare leak warnings in logs I think it's worthwhile to do this check in REST tests as well.
Pinging @elastic/es-core-infra (:Core/Infra/Build) |
Pinging @elastic/es-distributed (:Distributed/Network) |
ping @mark-vieira just in case this got lost :) It's not super urgent though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this slipped through the cracks @original-brownbear. The implementation here seems fine, but my main concern here is that we could introduce a lot of instability in CI here for hard to track down leaks. Given that CI right now is already quite unstable I think it might be worth bringing this up for further discussion with the team.
For example, rather than enable this across the board on all builds, perhaps we could introduce a single periodic build w/ this enabled so if we do legitimately introduce a leak, we don't disrupt all of CI.
An alternative idea: could we enable this everywhere but include instructions for muting it in the event that it does start to yield failures in future? I would rather we ran most builds with leak detection switched on so that we get told about leaks ASAP -- it's true that they can be hard to track down, but the sooner we hear about them the fewer commits we have to look at to find the culprit. |
That seems fine as well, assuming it's convenient to do so. I still think this is one of those things that probably deserves a shout out to the ES devs as it's something that is going to affect every developer's workflow. I have no intuition about how common the introduction of these leaks might be, or how sensitive netty's detection might be so maybe it's a non issue but I just want to be very conservative about adding broad failure conditions to our already tenuous CI situation. |
++ I'd see this as something similar to running with
In general there should not be any leaks ever. The detection at paranoid level (what I set in this PR) will catch close to 100% of leaks. If there is a bug that causes a leak consistently (i.e. makes every build red) it's a very serious bug I'd say and should be fixed right away anyway. If it's a bug that sporadically occurs it should be easy to track down with the help of paranoid level logging as well.
I'd assume this to be a non issue. If we had a consistent memory leak it would show in Cloud production logs which it doesn't. => Does this PR generally look ok? Should I add a way to mute this check and some documentation around that to it? |
One more question, do we expect this to add any significant performance hit to our test runtimes? |
From the experience of running with these settings in unit tests and internal cluster tests already I'd say no. The runtime cost share of handling network requests overall is a pretty small contributor to the total test runtime IMO. I couldn't see a measurable difference in runtimes locally when I ran this in a loop. |
Sounds good. I'd say let's put in some toggle to easily turn this off in case something slips into a mainline branch and takes a bit to track down as @DaveCTurner suggestd. |
Yes please, and ideally make it easy to determine when/how to mute it in the exception message itself. |
Jenkins run elasticsearch-ci/2 (unrelated+known watcher issue) |
Thanks David and Mark, I added a switch similar to what we have for the BwC tests now, let me know what you think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a few caveats
- I'm relatively unfamiliar with the testclusters code.
- We might regret not having tests for this logic in future, it was already quite complicated and this makes it more so.
- I am taking it on trust that this actually detects the leaks it claims to 😄
Therefore I'm leaving the final OK to Mark or someone else from the build team.
Manually verified by adding some
Wouldn't be too hard in theory to add a real test for this I guess (my best idea would be some purposely broken plugin). The only painful thing about this would be that there is a tiny chance that a leak wouldn't be detected before a node shuts down (since the leak detection is tied to GC) making it a bit more involved to set all of this up. Not sure it's worth the time that would have to be invested into this at this point? (could add a TODO?) |
} | ||
} | ||
if (foundNettyLeaks) { | ||
final boolean leakTestsEnabled = (Boolean) project.getExtensions().getExtraProperties().get("netty_leak_tests_enabled"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to make this lenient. Calling get()
here will throw an exception if that extra property isn't defined, which it may not be for external users of this plugin. We need to check for its existance first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed :)
} | ||
if (foundNettyLeaks) { | ||
final ExtraPropertiesExtension extension = project.getExtensions().getExtraProperties(); | ||
final boolean leakTestsEnabled = extension.has("netty_leak_tests_enabled") == false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want this to be final boolean leakTestsEnabled = extension.has("netty_leak_tests_enabled") && (Boolean) extension.get("netty_leak_tests_enabled");
. For external plugin authors we want this disabled by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure about that? Anything they build that has Netty leaks will be broken from the get-go, why deny them this functionality by default (I don't see how we would document it in a way that would make people aware of its existence).
<=> I basically chose defaulting to true
because this is a pretty catastrophic issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll defer to @rjernst on this one since it has plugin authoring implications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past we have erred on the side of forcing anything we deem relevant to our plugins/modules to be used by all plugin authors. I think we have found this to be to burdensome on those users as the vast majority don't care about our internal rules. I would say do not force this on plugin authors at first.
I'm also wondering if this is something we should have on for all builds, but instead have as another CI job? It makes sense to set it for tests inside the netty module, since that is testing our netty code, but forcing this on all developers seems like a step backwards towards the old pattern of "lets make everything an integration test and we will naturally catch distributed bugs". The cost is high when developers outside the distributed area hit an issue. I also do not think we should reuse the build.gradle based setting like with bwc tests, as that was specifically added to enable back ports, while this seems like a general switch to disable the testing globally "just in case", which to me indicates it should be an isolated test run periodically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have found this to be to burdensome on those users as the vast majority don't care about our internal rules.
This isn't an arbitrary code style rule though? Anything you build that leaks Netty buffers will be fundamentally broken and OOM eventually. Also note that EsTestCase
does similar checks for both Netty and our self-managed buffer pool by default for all unit tests so I don't see why being more lax about this in integration tests would be necessary?
this seems like a general switch to disable the testing globally "just in case", which to me indicates it should be an isolated test run periodically.
The thing is, running this in a periodic job makes it a lot less useful. David's argument here #64528 (comment) makes a lot of sense and also, as explained above in the real world this is expected to fail at an incredibly low rate (currently I'm seeing leak warnings in Cloud logs at a frequency of less than 10 globally per day and an even lower rate for 7.x
versions). Again, if this triggers at any meaningful frequency something is very broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't an arbitrary code style rule though? Anything you build that leaks Netty buffers will be fundamentally broken and OOM eventually.
True, but we can't anticipate all plugin authors use cases or intentions. What if they are testing something that purposefully creates a leak, etc. Now they can't build their plugin. I vote for disabling this externally only because we've been so disruptive in the past and I don't want to risk any more side-effects from stuff like this. Of course, external plugin authors that are doing bytebuf allocation in their plugins (probably not common) should have leak-detection enabled, but I agree with Ryan that we should lean towards being less prescriptive there, simply because historically it has caused lots of problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RE: the periodic job, we can start with doing this across the board and then potentially move it into a periodic job if it becomes too disruptive or costly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with moving to a periodic job as necessary, but then I don't think we should have an escape hatch. If it is problematic, let's move it at that time, rather than try to proactively allow for disabling across the board.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "problematic" needs to take into consideration the existence of an escape hatch. Just as we do for BWC testing (which is included in the pr/merge workflow). We would not add any check to the build that we could not somehow mute temporarily, and this is no different.
Muting temporarily does not mean it's problematic. It's a necessary tool to keep work flowing. If this becomes the norm (i.e. it's disable more often than it isn't) then that's very different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright :) disabled the check by default now so that plugin authors. I'd still rather not make this a separate CI task for previously stated reasons. Let me know what you think
Jenkins run elasticsearch-ci/default-distro (various known but unrelated failures ...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment about log level but otherwise LGTM.
+ "\"netty_leak_tests_enabled\" to false in the root level build.gradle file" | ||
); | ||
} else { | ||
LOGGER.error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's log this at warn
level since we aren't failing the build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving this comment out to the top level: I don't think we should have the escape hatch. If we find this is causing churning for developers or CI, we can disable by removing it as the default and moving out to a dedicated job. We shouldn't need a gradle level flag to disable/track.
@original-brownbear what's the likelyhood of a change introducing a leak not being detected in PR checks but finding it's way into intake? How accurate is the netty leak detection? As I understand it, only a portion of buffers are instrumented, yes? |
I would expect any leak we find to be extremely low frequency so for an individual leak the chance of not being detected in a PR is very high probably. That said, it is extremely unlikely to run into one to begin with and this is mainly a guard against bugs in Netty, extreme corner cases in exception handling etc. We do have a number of unit tests that would catch an obvious spot in every PR.
With the settings used here essentially every leak will be caught. At "paranoid" level ~100% of buffers are instrumented when they are GCed so only those few buffers that weren't GCed yet when a node shut down are technically not instrumented.
Would be fine by me as well to remove the hatch. Again there's no way this will lead to any high frequency failures unless there is a serious bug somewhere because all our UTs and internal cluster tests (those that use Netty at least) run the same check. So this is really just to get some more coverage on the REST layer but all the underlying logic is heavily tested for this already anyway. |
👍 |
@rjernst dropped the escape hatch now :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks everyone! |
We do this check in all tests that inherit from `EsTestCase` but didn't check for `ByteBuf` leaks in rest test clusters, which means we have very little coverage of the REST layer. With recent reports of very rare leak warnings in logs I think it's worthwhile to do this check in REST tests as well.
We do this check in all tests that inherit from `EsTestCase` but didn't check for `ByteBuf` leaks in rest test clusters, which means we have very little coverage of the REST layer. With recent reports of very rare leak warnings in logs I think it's worthwhile to do this check in REST tests as well.
We do this check in all tests that inherit from
EsTestCase
but didn't check forByteBuf
leaks in rest test clusters, which means we have very little coverage of the REST layer.
With recent reports of very rare leak warnings in logs I think it's worthwhile to do this check
in REST tests as well.
NOTE: I'm not sure this way of doing the check is in line with how we do these things in the test infrastructure. Happy to rework this and do it differently, but it would be great to have this check in some form :)