NETWORKING: RST Misbehaving Connections #34665

original-brownbear · 2018-10-20T06:58:33Z

We should just RST misbehaving connections instead of just an active close that leaves us with a TIME_WAIT
Relates Connect timeout in SimpleNetty4TransportTests testAcceptedChannelCount #30876

This isn't a complete fix for #30876 but an improvement saving us some TIME_WAIT situations that are clearly needless and RST on misbehaving client should be fine in production anyway.

* We should just RST misbehaving connections instead of just an active close that leaves us with a TIME_WAIT * Relates elastic#30876

elasticmachine · 2018-10-20T06:58:34Z

Pinging @elastic/es-distributed

jasontedor

I left a question.

jasontedor · 2018-10-20T15:29:55Z

server/src/main/java/org/elasticsearch/transport/TcpChannel.java

+            channel.setSoLinger(0);
+        } catch (IOException | RuntimeException e) {
+            // Depending on the implementation an already closed channel can either throw an IOException or a RuntimeException
+            // that we ignore because we are closing with RST because of an error here.


Which implementations throw a runtime exception?

@jasontedor Netty4 throws a custom runtime exception here (can't catch it specifically here either because we don't have Netty on the classpath here)

Catching and swallowing a runtime exception is pretty broad. Am I reading the code correctly that it is a channel exception? I know it is hacky, but can we check the class name manually and re-throw if it is not a io.netty.channel.ChannelException?

@jasontedor yea, you're correct. But actually I think I just found a better fix here.

In all other implementations but the Netty one, we do this:

@Override public void setSoLinger(int value) throws IOException { if (isOpen()) { getRawChannel().setOption(StandardSocketOptions.SO_LINGER, value); } }

... check if the socket is open before trying to set the setting.
Only in the Netty 4 one we don't do that and it throws in these corner cases ... I'll just add that check there I guess?

original-brownbear · 2018-10-20T15:59:38Z

Fixed the Netty4 so linger setting in 118d04a

jasontedor · 2018-10-20T16:15:42Z

modules/transport-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4TcpChannel.java

@@ -72,7 +72,9 @@ public void addCloseListener(ActionListener<Void> listener) {

    @Override
    public void setSoLinger(int value) {
-        channel.config().setOption(ChannelOption.SO_LINGER, value);
+        if (channel.isOpen()) {
+            channel.config().setOption(ChannelOption.SO_LINGER, value);


I guess this could still throw on us? How about we catch the channel exception here and re-throw it as an I/O exception? Then I think we are covered on all bases.

original-brownbear · 2018-10-20T17:12:41Z

@jasontedor added the rethrow in 759f455

jasontedor

LGTM.

Tim-Brooks · 2018-10-20T21:13:39Z

It's possible where I missed a conversation where this change was decided. But it seems like not a good idea to me. And I'm not clear on what is motivating it.

In my opinion we should be reducing the instances where we force RSTs in production code opposed to increasing them. The conditions that TIME_WAIT protects against are rare, but seem (to me) to outweigh optimizing for CI (and that could be addressed in other ways).

I am also not clear if the linked ticket is is still obviously an issue. I understand that we have fought back against that by reducing normal integration tests from 26 to 6 (or 2 for blocking transport) connections. And by forcing RSTs in some cases when we stop a node. Are these failures still occurring?

And is this caused from cases where exception conditions needed to be RST? It seems like the connections that are two-way closed at node shutdown probably dwarf the exception case.

Tim-Brooks · 2018-10-20T21:33:34Z

Like to clarify my position:

Disabling two-way close is normally discouraged.
There is no api in Java that guarantees RSTs. Just incidental behavior that survived removal in JDK11 when netty complained.

Therefore, we should do our best to avoid this.

original-brownbear · 2018-10-20T21:52:12Z

@tbrooks8

It's possible where I missed a conversation where this change was decided. But it seems like not a good idea to me. And I'm not clear on what is motivating it.

I think the motivation here is simply that we're still leaving behind a huge number of TIME_WAIT connections in networking related tests (e.g. org.elasticsearch.transport.netty4.SimpleNetty4TransportTests leaves more than 1k TIME_WAITs behind which does add up and will if you're unlucky with the ordering of tests run you out of resources and then trigger timeouts, though you're right it got rarer).

Are these failures still occurring?

Much less frequently than in the past but I def. remember running into this locally a few times lately yes. And just running the tests and taking a look at netstat we're flying pretty close to the sun still in terms of dangling connections.

The conditions that TIME_WAIT protects against are rare, but seem (to me) to outweigh optimizing for CI (and that could be addressed in other ways).

And is this caused from cases where exception conditions needed to be RST? It seems like the connections that are two-way closed at node shutdown probably dwarf the exception case.

The exception case is definitely just a fraction of the overall connections that end up in TIME_WAIT but as you point out, you generally want to avoid RST unless your protocol contains a clean shutdown message that can be exchanged before triggering RST which we don't have and probably don't need as far as I can see. The exception case is one where RST is valid and standard though I think even in protocols that don't have a close/shutdown message.

=> This particular one doesn't save that many FDs but I think it's a safe case for lowering the pressure via RST.
We can probably find many more cases where there's no good reason to keep connections open in test code or exception handling to lower our resource use. I tried forcing RSTs globally in Netty as suggested in the linked issue, but that caused a bunch of test failures, some of which were valid (like e.g. running HTTP against the non-HTTP endpoint wouldn't get the error message back cleanly everytime etc.) => I think we should look for spots where RST makes sense because it's an exception case or because shutdown is implied via the protocol.

Tim-Brooks · 2018-10-23T00:35:02Z

I think the motivation here is simply that we're still leaving behind a huge number

And just running the tests and taking a look at netstat we're flying pretty close to the sun still in terms of dangling connections.

Can we define this quantitatively?

When I run the check module for netty4 I see the TIME_WAIT connections max out at 1491. As you note most of those connections are from the abstract simple transport test. After 30 seconds pass (the TIME_WAIT period on Mac. On linux. I believe the TIME_WAIT period is 60 seconds) those drop immediately.

When I run the server:integTest I see the TIME_WAIT connections max out at 619. I assume this is greater on linux (due to higher TIME_WAIT period). And I assume it used to be much greater (when our integ test transport sometimes used 26 connections instead of 6).

I guess here is my point. When I checked limits on a couple different CI boxes a month ago it was set to 500,000 for the CI PID. So the exhausting file descriptors explanation has never really made sense to me. And the ip_local_port_range for sles defaults to 28,231. I believe that is the default linux settings. Some distros (Debian) default to allowing 55,295 ports.

Exhausting 28,231 ports (the failure in the linked issue was sles) seems possible with the old integration tests (26 connections) and some extraordinary bad luck. But seems unlikely when we are talking about a few thousand TIME_WAIT connections. Are there different parts of the build that I should be looking at where you see us flying pretty close to the sun? Or are we sure that we have a clear hypothesis for this issue? As I seemed to hint at in my last comment, I have not noticed any CI failures due to this timeout issue since we reduced MockNioTransport connections.

I guess in my opinion (#34665 (comment)), we should have a strong status quo bias against using an undocumented API to force connections resets at the application level.

I would be interested in the following possible approaches:

Reducing connections used in AbstractSimpleTransportTestCase. Some of the tests (testProfileSettings) do not actually use the transports that are opened and closed all the time. Additionally, some of the tests could probably use the same transports. This test case is (to my knowledge) our worst case of using tons of connections (and the test runs for like 5-6 different transports).
Reseting "accepted" connections similar to how we do with client connections when the lifecycle has been stopped (at node shutdown).
Consider eventually moving the resetting at node shutdown to be a CI only thing.

@ywelsch thoughts since we slacked about this earlier?

Let me know if there is something I misunderstand about the build. Or if there is some section of it that is using so many connections it would push against the ip_local_port_range .

original-brownbear · 2018-10-23T04:34:35Z

@tbrooks8

Can we define this quantitatively?

Last time I ran this against master I saw the time waits periodically and randomly (as in not the same number on every run) go up to ~5k+ on my system with the default 60s socket wait during server:check and I think a few others -> 2 min time_waits and default port range you mentioned => I extrapolated that if you get unlucky with the test ordering and have a fast box you could run out of ports.

That said, the more I think about it the more I agree with you :) We maybe shouldn't force this code to RST too much just to make CI happy when production use cases don't really suffer from running out of ports. =>
Why don't we just fix this at the infra level and set something like net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 1 there (could maybe go a little higher like 5, but a defined value that is much less than the 2 minutes should make it impossible to ever run into a large number of dangling connections even with unlucky ordering of tests)?

Tim-Brooks · 2018-10-24T03:44:51Z

We can talk to infra about something like that. We could also expand the ip_local_port_range on the linux boxes at least to what Debian uses (55,295).

One area that we could improve immediately is that I do not think we are actually RSTing connections at node restart now? I only looked at this briefly today. But in TransportService we close the ConnectionManager before the TcpTransport. So the TcpTransport is not stopped when we close all the outbound client connections in the connection manager (hence no RSTing). We definitely want to close the ConnectionManager first. But we might need a different way than TcpTransport lifecycle to trigger RSTs.

original-brownbear · 2018-10-24T12:59:06Z

One area that we could improve immediately is that I do not think we are actually RSTing connections at node restart now? I only looked at this briefly today. ...

Yea I found this too. If we fixed this it would probably be a big improvement as far as I understand the tests. I don't have enough understanding of the internals of these components yet though to suggest a plan for setting state on the TcpTransport for the RST (would be interested in learning more here though :)).

Dismissing review based on additional discussion.

original-brownbear · 2018-10-25T16:45:31Z

Closing this because #34863 should make it obsolete :)

NETWORKING: RST Misbehaving Connections

135ae1f

* We should just RST misbehaving connections instead of just an active close that leaves us with a TIME_WAIT * Relates elastic#30876

original-brownbear added >non-issue :Distributed Coordination/Network Http and internode communication implementations v7.0.0 v6.5.0 labels Oct 20, 2018

original-brownbear requested review from Tim-Brooks, jasontedor and s1monw October 20, 2018 11:55

jasontedor reviewed Oct 20, 2018

View reviewed changes

original-brownbear added 2 commits October 20, 2018 17:49

Merge remote-tracking branch 'elastic/master' into rst-bad-client

1c03716

check netty4 channel open state before trying to set SO_LINGER

118d04a

jasontedor reviewed Oct 20, 2018

View reviewed changes

CR: Catch ChannelException and rethrow as IOException

759f455

jasontedor previously approved these changes Oct 20, 2018

View reviewed changes

colings86 added v6.6.0 and removed v6.5.0 labels Oct 25, 2018

original-brownbear closed this Oct 25, 2018

original-brownbear mentioned this pull request Oct 25, 2018

NETWORK: Align Behaviour of NettyChan setSoLinger #34870

Merged

original-brownbear deleted the rst-bad-client branch October 25, 2018 21:16

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

NETWORKING: RST Misbehaving Connections #34665

NETWORKING: RST Misbehaving Connections #34665

Uh oh!

Conversation

original-brownbear commented Oct 20, 2018

Uh oh!

elasticmachine commented Oct 20, 2018

Uh oh!

jasontedor left a comment

Choose a reason for hiding this comment

Uh oh!

jasontedor Oct 20, 2018

Choose a reason for hiding this comment

Uh oh!

original-brownbear Oct 20, 2018

Choose a reason for hiding this comment

Uh oh!

jasontedor Oct 20, 2018

Choose a reason for hiding this comment

Uh oh!

original-brownbear Oct 20, 2018

Choose a reason for hiding this comment

Uh oh!

jasontedor Oct 20, 2018

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Oct 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jasontedor Oct 20, 2018

Choose a reason for hiding this comment

Uh oh!

original-brownbear Oct 20, 2018

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Oct 20, 2018

Uh oh!

jasontedor left a comment

Choose a reason for hiding this comment

Uh oh!

Tim-Brooks commented Oct 20, 2018

Uh oh!

Tim-Brooks commented Oct 20, 2018

Uh oh!

original-brownbear commented Oct 20, 2018

Uh oh!

Tim-Brooks commented Oct 23, 2018

Uh oh!

original-brownbear commented Oct 23, 2018

Uh oh!

Tim-Brooks commented Oct 24, 2018

Uh oh!

original-brownbear commented Oct 24, 2018

Uh oh!

original-brownbear commented Oct 25, 2018

Uh oh!

Uh oh!

original-brownbear commented Oct 20, 2018 •

edited

Loading