Skip to content

Commit cdec0f8

Browse files
committed
Suggest reducing tcp_retries2
Adds documentation suggesting reducing `tcp_retries2` on Linux to detect network partitions more quickly. Relates elastic#34405
1 parent 3515909 commit cdec0f8

File tree

2 files changed

+52
-0
lines changed

2 files changed

+52
-0
lines changed

docs/reference/setup/sysconfig.asciidoc

+3
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ The following settings *must* be considered before going to production:
1414
* <<max-number-of-threads,Ensure sufficient threads>>
1515
* <<networkaddress-cache-ttl,JVM DNS cache settings>>
1616
* <<executable-jna-tmpdir,Temporary directory not mounted with `noexec`>>
17+
* <<system-config-tcpretries,TCP retransmission timeout>>
1718

1819
[[dev-vs-prod]]
1920
[float]
@@ -43,3 +44,5 @@ include::sysconfig/threads.asciidoc[]
4344
include::sysconfig/dns-cache.asciidoc[]
4445

4546
include::sysconfig/executable-jna-tmpdir.asciidoc[]
47+
48+
include::sysconfig/tcpretries.asciidoc[]
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
[[system-config-tcpretries]]
2+
=== TCP retransmission timeout
3+
4+
Each pair of nodes in a cluster communicates via a number of TCP connections
5+
which remain open until one of the nodes shuts down or communication between
6+
the nodes is disrupted by a failure in the underlying infrastructure.
7+
8+
TCP provides reliable communication over occasionally-unreliable networks by
9+
hiding temporary network disruptions from the communicating applications. Your
10+
operating system will retransmit any lost messages a number of times before
11+
informing the sender of any problem. Most Linux distributions default to
12+
retransmitting any lost packets 15 times. Retransmissions back off
13+
exponentially, so these 15 retransmissions take over 900 seconds to complete.
14+
This means it takes Linux many minutes to detect a network partition or a
15+
failed node with this method. Windows defaults to just 5 retransmissions which
16+
corresponds with a timeout of around 6 seconds.
17+
18+
The Linux default allows for communication over networks that may experience
19+
very long periods of packet loss, but this default is excessive for production
20+
networks within a single data centre as is the case for most {es} clusters.
21+
Highly-available clusters must be able to detect node failures quickly so that
22+
they can react promptly by reallocating lost shards, rerouting searches and
23+
perhaps electing a new master node. Linux users should therefore reduce the
24+
maximum number of TCP retransmissions.
25+
26+
You can decrease the maximum number of TCP retransmissions to `5` by running
27+
the following command as `root`. Five retransmissions corresponds with a
28+
timeout of around 6 seconds.
29+
30+
[source,sh]
31+
-------------------------------------
32+
sysctl -w net.ipv4.tcp_retries2=5
33+
-------------------------------------
34+
35+
To set this value permanently, update the `net.ipv4.tcp_retries2` setting in
36+
`/etc/sysctl.conf`. To verify after rebooting, run `sysctl
37+
net.ipv4.tcp_retries2`.
38+
39+
{es} also implements its own health checks with timeouts that are much shorter
40+
than the default retransmission timeout on Linux. However these health checks
41+
must allow for application-level effects such as garbage collection pauses. We
42+
do not recommend reducing any timeouts related to these application-level
43+
health checks.
44+
45+
IMPORTANT: This setting applies to all TCP connections and will affect the
46+
reliability of communication with systems outside your cluster too. If your
47+
cluster communicates with external systems over an unreliable network then you
48+
may need to select a higher value for `net.ipv4.tcp_retries2`. For this reason,
49+
{es} does not adjust this setting automatically.

0 commit comments

Comments
 (0)