Skip to content

NETWORKING: RST Misbehaving Connections #34665

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

original-brownbear
Copy link
Contributor


This isn't a complete fix for #30876 but an improvement saving us some TIME_WAIT situations that are clearly needless and RST on misbehaving client should be fine in production anyway.

* We should just RST misbehaving connections instead of just an active close that leaves us with a TIME_WAIT
* Relates elastic#30876
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a question.

channel.setSoLinger(0);
} catch (IOException | RuntimeException e) {
// Depending on the implementation an already closed channel can either throw an IOException or a RuntimeException
// that we ignore because we are closing with RST because of an error here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which implementations throw a runtime exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasontedor Netty4 throws a custom runtime exception here (can't catch it specifically here either because we don't have Netty on the classpath here)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching and swallowing a runtime exception is pretty broad. Am I reading the code correctly that it is a channel exception? I know it is hacky, but can we check the class name manually and re-throw if it is not a io.netty.channel.ChannelException?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasontedor yea, you're correct. But actually I think I just found a better fix here.

In all other implementations but the Netty one, we do this:

    @Override
    public void setSoLinger(int value) throws IOException {
        if (isOpen()) {
            getRawChannel().setOption(StandardSocketOptions.SO_LINGER, value);
        }
    }

... check if the socket is open before trying to set the setting.
Only in the Netty 4 one we don't do that and it throws in these corner cases ... I'll just add that check there I guess?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@original-brownbear
Copy link
Contributor Author

original-brownbear commented Oct 20, 2018

Fixed the Netty4 so linger setting in 118d04a

@@ -72,7 +72,9 @@ public void addCloseListener(ActionListener<Void> listener) {

@Override
public void setSoLinger(int value) {
channel.config().setOption(ChannelOption.SO_LINGER, value);
if (channel.isOpen()) {
channel.config().setOption(ChannelOption.SO_LINGER, value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this could still throw on us? How about we catch the channel exception here and re-throw it as an I/O exception? Then I think we are covered on all bases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 :)

@original-brownbear
Copy link
Contributor Author

@jasontedor added the rethrow in 759f455

jasontedor
jasontedor previously approved these changes Oct 20, 2018
Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@Tim-Brooks
Copy link
Contributor

It's possible where I missed a conversation where this change was decided. But it seems like not a good idea to me. And I'm not clear on what is motivating it.

In my opinion we should be reducing the instances where we force RSTs in production code opposed to increasing them. The conditions that TIME_WAIT protects against are rare, but seem (to me) to outweigh optimizing for CI (and that could be addressed in other ways).

I am also not clear if the linked ticket is is still obviously an issue. I understand that we have fought back against that by reducing normal integration tests from 26 to 6 (or 2 for blocking transport) connections. And by forcing RSTs in some cases when we stop a node. Are these failures still occurring?

And is this caused from cases where exception conditions needed to be RST? It seems like the connections that are two-way closed at node shutdown probably dwarf the exception case.

@Tim-Brooks
Copy link
Contributor

Like to clarify my position:

  1. Disabling two-way close is normally discouraged.
  2. There is no api in Java that guarantees RSTs. Just incidental behavior that survived removal in JDK11 when netty complained.

Therefore, we should do our best to avoid this.

@original-brownbear
Copy link
Contributor Author

@tbrooks8

It's possible where I missed a conversation where this change was decided. But it seems like not a good idea to me. And I'm not clear on what is motivating it.

I think the motivation here is simply that we're still leaving behind a huge number of TIME_WAIT connections in networking related tests (e.g. org.elasticsearch.transport.netty4.SimpleNetty4TransportTests leaves more than 1k TIME_WAITs behind which does add up and will if you're unlucky with the ordering of tests run you out of resources and then trigger timeouts, though you're right it got rarer).

Are these failures still occurring?

Much less frequently than in the past but I def. remember running into this locally a few times lately yes. And just running the tests and taking a look at netstat we're flying pretty close to the sun still in terms of dangling connections.

The conditions that TIME_WAIT protects against are rare, but seem (to me) to outweigh optimizing for CI (and that could be addressed in other ways).

And is this caused from cases where exception conditions needed to be RST? It seems like the connections that are two-way closed at node shutdown probably dwarf the exception case.

The exception case is definitely just a fraction of the overall connections that end up in TIME_WAIT but as you point out, you generally want to avoid RST unless your protocol contains a clean shutdown message that can be exchanged before triggering RST which we don't have and probably don't need as far as I can see. The exception case is one where RST is valid and standard though I think even in protocols that don't have a close/shutdown message.


=> This particular one doesn't save that many FDs but I think it's a safe case for lowering the pressure via RST.
We can probably find many more cases where there's no good reason to keep connections open in test code or exception handling to lower our resource use. I tried forcing RSTs globally in Netty as suggested in the linked issue, but that caused a bunch of test failures, some of which were valid (like e.g. running HTTP against the non-HTTP endpoint wouldn't get the error message back cleanly everytime etc.) => I think we should look for spots where RST makes sense because it's an exception case or because shutdown is implied via the protocol.

@Tim-Brooks
Copy link
Contributor

I think the motivation here is simply that we're still leaving behind a huge number

And just running the tests and taking a look at netstat we're flying pretty close to the sun still in terms of dangling connections.

Can we define this quantitatively?

When I run the check module for netty4 I see the TIME_WAIT connections max out at 1491. As you note most of those connections are from the abstract simple transport test. After 30 seconds pass (the TIME_WAIT period on Mac. On linux. I believe the TIME_WAIT period is 60 seconds) those drop immediately.

When I run the server:integTest I see the TIME_WAIT connections max out at 619. I assume this is greater on linux (due to higher TIME_WAIT period). And I assume it used to be much greater (when our integ test transport sometimes used 26 connections instead of 6).

I guess here is my point. When I checked limits on a couple different CI boxes a month ago it was set to 500,000 for the CI PID. So the exhausting file descriptors explanation has never really made sense to me. And the ip_local_port_range for sles defaults to 28,231. I believe that is the default linux settings. Some distros (Debian) default to allowing 55,295 ports.

Exhausting 28,231 ports (the failure in the linked issue was sles) seems possible with the old integration tests (26 connections) and some extraordinary bad luck. But seems unlikely when we are talking about a few thousand TIME_WAIT connections. Are there different parts of the build that I should be looking at where you see us flying pretty close to the sun? Or are we sure that we have a clear hypothesis for this issue? As I seemed to hint at in my last comment, I have not noticed any CI failures due to this timeout issue since we reduced MockNioTransport connections.

I guess in my opinion (#34665 (comment)), we should have a strong status quo bias against using an undocumented API to force connections resets at the application level.

I would be interested in the following possible approaches:

  1. Reducing connections used in AbstractSimpleTransportTestCase. Some of the tests (testProfileSettings) do not actually use the transports that are opened and closed all the time. Additionally, some of the tests could probably use the same transports. This test case is (to my knowledge) our worst case of using tons of connections (and the test runs for like 5-6 different transports).
  2. Reseting "accepted" connections similar to how we do with client connections when the lifecycle has been stopped (at node shutdown).
  3. Consider eventually moving the resetting at node shutdown to be a CI only thing.

@ywelsch thoughts since we slacked about this earlier?

Let me know if there is something I misunderstand about the build. Or if there is some section of it that is using so many connections it would push against the ip_local_port_range .

@original-brownbear
Copy link
Contributor Author

@tbrooks8

Can we define this quantitatively?

Last time I ran this against master I saw the time waits periodically and randomly (as in not the same number on every run) go up to ~5k+ on my system with the default 60s socket wait during server:check and I think a few others -> 2 min time_waits and default port range you mentioned => I extrapolated that if you get unlucky with the test ordering and have a fast box you could run out of ports.

That said, the more I think about it the more I agree with you :) We maybe shouldn't force this code to RST too much just to make CI happy when production use cases don't really suffer from running out of ports. =>
Why don't we just fix this at the infra level and set something like net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 1 there (could maybe go a little higher like 5, but a defined value that is much less than the 2 minutes should make it impossible to ever run into a large number of dangling connections even with unlucky ordering of tests)?

@Tim-Brooks
Copy link
Contributor

We can talk to infra about something like that. We could also expand the ip_local_port_range on the linux boxes at least to what Debian uses (55,295).

One area that we could improve immediately is that I do not think we are actually RSTing connections at node restart now? I only looked at this briefly today. But in TransportService we close the ConnectionManager before the TcpTransport. So the TcpTransport is not stopped when we close all the outbound client connections in the connection manager (hence no RSTing). We definitely want to close the ConnectionManager first. But we might need a different way than TcpTransport lifecycle to trigger RSTs.

@original-brownbear
Copy link
Contributor Author

One area that we could improve immediately is that I do not think we are actually RSTing connections at node restart now? I only looked at this briefly today. ...

Yea I found this too. If we fixed this it would probably be a big improvement as far as I understand the tests. I don't have enough understanding of the internals of these components yet though to suggest a plan for setting state on the TcpTransport for the RST (would be interested in learning more here though :)).

@jasontedor jasontedor dismissed their stale review October 24, 2018 19:23

Dismissing review based on additional discussion.

@colings86 colings86 added v6.6.0 and removed v6.5.0 labels Oct 25, 2018
@original-brownbear
Copy link
Contributor Author

Closing this because #34863 should make it obsolete :)

@original-brownbear original-brownbear deleted the rst-bad-client branch October 25, 2018 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Network Http and internode communication implementations >non-issue v6.6.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants