-
Notifications
You must be signed in to change notification settings - Fork 25.2k
NETWORKING: RST Misbehaving Connections #34665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NETWORKING: RST Misbehaving Connections #34665
Conversation
* We should just RST misbehaving connections instead of just an active close that leaves us with a TIME_WAIT * Relates elastic#30876
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a question.
channel.setSoLinger(0); | ||
} catch (IOException | RuntimeException e) { | ||
// Depending on the implementation an already closed channel can either throw an IOException or a RuntimeException | ||
// that we ignore because we are closing with RST because of an error here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which implementations throw a runtime exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jasontedor Netty4 throws a custom runtime exception here (can't catch it specifically here either because we don't have Netty on the classpath here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catching and swallowing a runtime exception is pretty broad. Am I reading the code correctly that it is a channel exception? I know it is hacky, but can we check the class name manually and re-throw if it is not a io.netty.channel.ChannelException?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jasontedor yea, you're correct. But actually I think I just found a better fix here.
In all other implementations but the Netty one, we do this:
@Override
public void setSoLinger(int value) throws IOException {
if (isOpen()) {
getRawChannel().setOption(StandardSocketOptions.SO_LINGER, value);
}
}
... check if the socket is open before trying to set the setting.
Only in the Netty 4 one we don't do that and it throws in these corner cases ... I'll just add that check there I guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Fixed the Netty4 so linger setting in 118d04a |
@@ -72,7 +72,9 @@ public void addCloseListener(ActionListener<Void> listener) { | |||
|
|||
@Override | |||
public void setSoLinger(int value) { | |||
channel.config().setOption(ChannelOption.SO_LINGER, value); | |||
if (channel.isOpen()) { | |||
channel.config().setOption(ChannelOption.SO_LINGER, value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this could still throw on us? How about we catch the channel exception here and re-throw it as an I/O exception? Then I think we are covered on all bases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 :)
@jasontedor added the rethrow in 759f455 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
It's possible where I missed a conversation where this change was decided. But it seems like not a good idea to me. And I'm not clear on what is motivating it. In my opinion we should be reducing the instances where we force RSTs in production code opposed to increasing them. The conditions that TIME_WAIT protects against are rare, but seem (to me) to outweigh optimizing for CI (and that could be addressed in other ways). I am also not clear if the linked ticket is is still obviously an issue. I understand that we have fought back against that by reducing normal integration tests from 26 to 6 (or 2 for blocking transport) connections. And by forcing RSTs in some cases when we stop a node. Are these failures still occurring? And is this caused from cases where exception conditions needed to be RST? It seems like the connections that are two-way closed at node shutdown probably dwarf the exception case. |
Like to clarify my position:
Therefore, we should do our best to avoid this. |
@tbrooks8
I think the motivation here is simply that we're still leaving behind a huge number of
Much less frequently than in the past but I def. remember running into this locally a few times lately yes. And just running the tests and taking a look at
The exception case is definitely just a fraction of the overall connections that end up in => This particular one doesn't save that many FDs but I think it's a safe case for lowering the pressure via RST. |
Can we define this quantitatively? When I run the When I run the I guess here is my point. When I checked Exhausting 28,231 ports (the failure in the linked issue was I guess in my opinion (#34665 (comment)), we should have a strong status quo bias against using an undocumented API to force connections resets at the application level. I would be interested in the following possible approaches:
@ywelsch thoughts since we slacked about this earlier? Let me know if there is something I misunderstand about the build. Or if there is some section of it that is using so many connections it would push against the |
@tbrooks8
Last time I ran this against master I saw the time waits periodically and randomly (as in not the same number on every run) go up to ~5k+ on my system with the default 60s socket wait during That said, the more I think about it the more I agree with you :) We maybe shouldn't force this code to RST too much just to make CI happy when production use cases don't really suffer from running out of ports. => |
We can talk to infra about something like that. We could also expand the One area that we could improve immediately is that I do not think we are actually RSTing connections at node restart now? I only looked at this briefly today. But in |
Yea I found this too. If we fixed this it would probably be a big improvement as far as I understand the tests. I don't have enough understanding of the internals of these components yet though to suggest a plan for setting state on the |
Dismissing review based on additional discussion.
Closing this because #34863 should make it obsolete :) |
This isn't a complete fix for #30876 but an improvement saving us some
TIME_WAIT
situations that are clearly needless and RST on misbehaving client should be fine in production anyway.