net/http: http2 clientConnPool may cache a dead connection forever (write tcp use of closed network connection) #39750

povsister · 2020-06-22T09:14:17Z

What version of Go are you using (`go version`)?

go version go1.13.9 linux/amd64

Does this issue reproduce with the latest release?

Yes. It happens after kubelet re-complied with 1.14.4
But I saw a commit connection health check on golang/net@0ba52f6
It should help avoid this problem partly, yet not enabled by default.

What operating system and processor architecture are you using (`go env`)?

go env Output

GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/root/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/root/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/root/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="0"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

If we close the connection via conn.Close, the readLoop will immediately receive a read tcp ... use of closed connection error. Then a cleanup will be performed: notifying every stream on this connection, removing the broken connection from clientConnPool... etc
Everything seems fine. right ?

But, sometimes the readLoop of a clientConn may block on a idle connection.
And the writeHeaders method of clientConn in roundTrip never close a broken connection. (It just report errors to its caller)
there is the problem begins

Imaging this: An incoming request is about to using this idle connection. After getting it from the pool, the connection died very soon, so writeHeaders method failed with an error write tcp ... use of closed connection and left the broken connection untouched in the pool.

https://github.com/golang/net/blob/627f9648deb96c27737b83199d44bb5c1010cbcf/http2/transport.go#L1083-L1100

Since there is no traffic on the connection. readLoop will block, either no cleanup will be triggered.
Plus, incoming requests keep calling roundTrip -> newStream, add stream count on the conn, making the conn "busy". So the closeIfIdle will not cleanup the dead connection either.

Some actual case

kubernetes/kubernetes#87615

We encountered this problem several times on highly loaded VMs. Heavy load may lead to temporarily suspend of some VMs by master machine. That's why connection may died right after getting from the pool.
But it is hard to reproduce this scenario.

kubelet keep using dead connections

kubelet.service - Kubernetes Kubelet Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2020-06-16 08:45:59 CST; 1h 25min ago Docs: https://github.com/GoogleCloudPlatform/kubernetes Main PID: 21344 (kubelet) Tasks: 0 CGroup: /system.slice/kubelet.service ‣ 21344 /usr/bin/kubelet --config=/opt/kubernetes/cfg/kubelet-config --root-dir=/opt/kubernetes/data/kubelet --cert-dir=/opt/kubernetes/data/kubelet/pki --network-plugin=cni --cni-bin-dir=/opt/kubernetes/cni --cni-conf-dir=/opt/kubernetes/cni/conf --kubeconfig=/opt/kubernetes/ssl/native.kubeconfig --hostname-override=autovmcos76vm01.company.com --node-labels=master=true,role=loadbalancer --v=1 --logtostderr=true --runtime-cgroups=/systemd/system.slice --pod-infra-container-image=localhost:5000/sandbox/pause:3.1

Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:39.160008 21344 reflector.go:153] object-"kube-system"/"dns-hosts-configmap": Failed to list *v1.ConfigMap: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Ddns-hosts-configmap&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:39.195291 21344 controller.go:135] failed to ensure node lease exists, will retry in 7s, error: Get https://autovmcos76vm01.company.com:8443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/autovmcos76vm01.company.com?timeout=10s: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:39.358961 21344 reflector.go:153] object-"core"/"cdf-deployer-token-4qr6w": Failed to list *v1.Secret: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/core/secrets?fieldSelector=metadata.name%3Dcdf-deployer-token-4qr6w&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:39.558938 21344 reflector.go:153] k8s.io/client-go/informers/factory.go:135: Failed to list *v1beta1.RuntimeClass: Get https://autovmcos76vm01.company.com:8443/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:39.759003 21344 reflector.go:153] object-"demo2ns"/"registrypullsecret": Failed to list *v1.Secret: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/demo2ns/secrets?fieldSelector=metadata.name%3Dregistrypullsecret&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: W0616 10:11:39.958867 21344 status_manager.go:530] Failed to get status for pod "kube-registry-h67dx_core(4603bbf2-b773-4155-95a4-f73bdf2048eb)": Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/core/pods/kube-registry-h67dx: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:40 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:40.159135 21344 reflector.go:153] object-"core"/"base-configmap": Failed to list *v1.ConfigMap: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/core/configmaps?fieldSelector=metadata.name%3Dbase-configmap&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:40 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:40.358980 21344 reflector.go:153] object-"kube-system"/"coredns": Failed to list *v1.ConfigMap: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dcoredns&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:40 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:40.559314 21344 reflector.go:153] object-"kube-system"/"itom-cdf-keepalived-token-c76dd": Failed to list *v1.Secret: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%3Ditom-cdf-keepalived-token-c76dd&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:40 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:40.761201 21344 reflector.go:153] object-"core"/"default-database-configmap": Failed to list *v1.ConfigMap: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/core/configmaps?fieldSelector=metadata.name%3Ddefault-database-configmap&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

The text was updated successfully, but these errors were encountered:

networkimprov · 2020-06-23T03:26:33Z

cc @fraenkel

fraenkel · 2020-06-24T00:46:32Z

This covers multiple issues already reported, some with awaiting fixes, others with awaiting responses.
#32388
https://go-review.googlesource.com/c/net/+/181457/
https://go-review.googlesource.com/c/net/+/224797/

#39337

There is no guarantee that the http side and the http2 side are actually in agreement. There are many more cases where the http side believes that http2 connections are still alive and valid. However, when they are used, that is when the detection kicks in.
CloseIdleConnection is one case where the http2 side will attempt cleanup. However, there are many situations that will make an existing connection not quite idle.

The only safe way out of many of these situations currently is setting write timeouts.
It would be helpful to have something that is reproducible or a stack that shows the hung threads.

povsister · 2020-06-24T02:50:13Z

@fraenkel I just go through the cases you provided above. It looks quite similar to my problem.
But, actually, I am talking about the cleanup mechanism failure in some odd situation. which leads to cache of dead connections. (sorry I can't reproduce it by doing some unit tests. It really happens in production and I can only analysis the issue with limited application logs and code implement.)

For now, the underlying tcp connections mostly rely on the readLoop to detect if there is some connection problem, plus some other mechanisms such as CloseIdleConnection and so on.

However, when they are used, that is when the detection kicks in.

I agree with this detection strategy. But in my case, the readLoop somehow seems fine and not returning any error (I guess It may block on completely idle connection). Meanwhile, I find a batch of write closed connection error, and it keeps using the same dead connection for multiple requests again and again, cause application looping on the network requests. ( which also make the connection not idle )

The only safe way out of many of these situations currently is setting write timeouts.

Setting write timeout just fixed the application hanging problem you mentioned above, it has no impact on dead connections.

It is proved by the code that currently we do not invoke any detection in writing: When an error occurs in writing, errors are returned to caller and nothing else was performed.
So, if we admit "writing is a kind of usage", I think we should introduce some judgement logic to determine if we have some connection problems on writing and should we immediately notify connectionPool about the issue to avoid returning this broken conn on next request coming. Then the retry attempt made by higher level applications may have the chance to succeed.

HTTP/2 support is golang has many problematic cornercases where dead connections would be kept. golang/go#32388 golang/go#39337 golang/go#39750 I suggest we disable HTTP/2 for now and enable it manually on the blackbox exporter. Signed-off-by: Julien Pivotto <[email protected]>

povsister · 2020-08-11T13:38:18Z

Ping @fraenkel
It's proved that h2 transport healthCheck works on such issues.

Any updates or plan to enable it by default ?

roidelapluie · 2020-10-29T13:47:36Z

Is this a duplicate of #39337 ?

povsister · 2020-11-01T15:33:33Z

@roidelapluie Thanks a lot.
The fix in #39337 can also fix this issue. If write error is proper handled, this issue will not exist.

I will close this issue.

povsister mentioned this issue Jun 23, 2020

(1.17) Kubelet won't reconnect to Apiserver after NIC failure (use of closed network connection) kubernetes/kubernetes#87615

Closed

agnivade added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 23, 2020

agnivade added this to the Backlog milestone Jun 23, 2020

povsister changed the title ~~net/http: http2 clientConnPool may cache a dead connection forever if a clientConn died just after getting from pool~~ net/http: http2 clientConnPool may cache a dead connection forever (write tcp use of closed network connection) Jun 24, 2020

povsister mentioned this issue Jul 17, 2020

x/net/http2: ReadIdleTimeout/PingTimeout not exposed if configured through http package #40201

Closed

roidelapluie mentioned this issue Aug 3, 2020

Disable HTTP/2 prometheus/common#249

Merged

brian-brazil mentioned this issue Oct 27, 2020

Enable HTTP/2 again prometheus/common#261

Closed

povsister closed this as completed Nov 1, 2020

golang locked and limited conversation to collaborators Nov 1, 2021

gopherbot added the FrozenDueToAge label Nov 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

net/http: http2 clientConnPool may cache a dead connection forever (write tcp use of closed network connection) #39750

net/http: http2 clientConnPool may cache a dead connection forever (write tcp use of closed network connection) #39750

povsister commented Jun 22, 2020 •

edited

Loading

networkimprov commented Jun 23, 2020

fraenkel commented Jun 24, 2020

povsister commented Jun 24, 2020 •

edited

Loading

povsister commented Aug 11, 2020

roidelapluie commented Oct 29, 2020

povsister commented Nov 1, 2020

net/http: http2 clientConnPool may cache a dead connection forever (write tcp use of closed network connection) #39750

net/http: http2 clientConnPool may cache a dead connection forever (write tcp use of closed network connection) #39750

Comments

povsister commented Jun 22, 2020 • edited Loading

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

Some actual case

networkimprov commented Jun 23, 2020

fraenkel commented Jun 24, 2020

povsister commented Jun 24, 2020 • edited Loading

povsister commented Aug 11, 2020

roidelapluie commented Oct 29, 2020

povsister commented Nov 1, 2020

povsister commented Jun 22, 2020 •

edited

Loading

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

povsister commented Jun 24, 2020 •

edited

Loading