du and find on following dirs took long time #2212

TonySun1994 · 2019-04-03T05:02:26Z

Running cadvisor with the container，The log shows the following：
I0328 13:41:29.716552 1 fsHandler.go:135] du and find on following dirs took 13.006500097s: [/rootfs/vdata/docker/overlay2/1259cbee3d31494942129513374cc8702ad68c676b700638534a52b19aadcb26/diff /rootfs/vdata/docker/containers/d866dd5101d9788bdf71a7aacd012bae15c2c304a4b70c80b2647c05a7ebc050]; will not log again for this container unless duration exceeds 5s
I0328 13:41:32.106341 1 fsHandler.go:135] du and find on following dirs took 13.308135764s: [/rootfs/vdata/docker/overlay2/5dfa8775376b9725b0af27edb4ba93a2a9da368d4099633b49f6dba3ae5e6a70/diff /rootfs/vdata/docker/containers/50a1a11de04b411c270683dc834696d1150433120d44deb70bae3ba109a54b06]; will not log again for this container unless duration exceeds 6s
I0328 13:41:32.204471 1 fsHandler.go:135] du and find on following dirs took 15.599689255s: [/rootfs/vdata/docker/overlay2/102debde6d41426b202167877b92bd61b4d4d2f3f9b67c9c4241b760898fa9cb/diff /rootfs/vdata/docker/containers/45b76dbc340d3bb358ec9ec08d5341822575ed68245ea2976596bf17dc2d3a9f]; will not log again for this container unless duration exceeds 6s
I0328 13:41:32.804843 1 fsHandler.go:135] du and find on following dirs took 14.19422831s: [/rootfs/vdata/docker/overlay2/a426d4717c6454ed568e028f95d055f86cebaba26048ac09c019a758d9456158/diff /rootfs/vdata/docker/containers/32d2324f8f4f5727eaf61579148f8915e6019361f032ddcc11c66cb9685b11ff]; will not log again for this container unless duration exceeds 6s
I0328 13:42:15.917654 1 fsHandler.go:135] du and find on following dirs took 11.919525783s: [/rootfs/vdata/docker/overlay2/f2d7561691dd2f3aaee9bf753a152c793edb987717d9e56381df65e4532ec4ee/diff /rootfs/vdata/docker/containers/4cec80a94dabed251376f50c9b24a578d0a6525bfbb096ce73b63adff97db315]; will not log again for this container unless duration exceeds 2s

I know it is caused by the deep file system hierarchy and the large number of small files, so I want to cancel the metrics collection for the file system and disk, so that I can focus on the collection of CPU and memory，Now the web service cannot be started and “/metrics“ cannot be accessed

dashpole · 2019-04-03T17:12:37Z

try running with --disable_metrics=disk,tcp,udp,percpu

TonySun1994 · 2019-04-04T01:01:00Z

try running with --disable_metrics=disk,tcp,udp,percpu

I've tried it but it's still stuck， I wonder if there are other file system operations

dashpole · 2019-04-04T01:04:02Z

what do you mean by stuck? What happens?

TonySun1994 · 2019-04-04T01:08:37Z

what do you mean by stuck? What happens?

That "/metrics" endpoint is not accessible or is very slow,and It crashed very quickly

dashpole · 2019-04-04T20:12:27Z

Can you share the logs from when that happens?

dcvtruong · 2019-06-11T15:31:31Z

I also encountered the same errors below had to restart the affected master node. What is the root cause? Is it docker?

k8s: 1.13.0
docker: 18.06.1-ce
k8s cluster: 3 master, 4 workers

If you notice there are failured in the logs but is reported as info.

LOGS:

        Jun 11 11:58:33 ark2-westeuropetest01-master-01 kubelet: I0611 11:57:36.411957   11872 prober.go:111] Readiness probe for "weave-net-ljj47_kube-system(45efc710-2e41-11e9-8813-000d3a484762):weave" failed (failure): Get http://127.0.0.1:6784/status: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
	Jun 11 11:58:35 ark2-westeuropetest01-master-01 kubelet: I0611 11:57:10.110453   11872 prober.go:111] Liveness probe for "kube-scheduler-ark2-westeuropetest01-master-01_kube-system(4483c96edc67da591846e71847253bf4):kube-scheduler" failed (failure): Get http://127.0.0.1:10251/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
	Jun 11 12:00:12 ark2-westeuropetest01-master-01 kernel: xt_physdev: using --physdev-out and --physdev-is-out are only supported in the FORWARD and POSTROUTING chains with bridged traffic.
	Jun 11 12:00:58 ark2-westeuropetest01-master-01 Container_ContainerInventory[9333]: API call in GenerateImageNameMap to list images returned null
	Jun 11 12:01:40 ark2-westeuropetest01-master-01 kubelet: I0611 11:59:56.570016   11872 fsHandler.go:135] du and find on following dirs took 3m34.71022604s: [/var/lib/docker/overlay2/5944d0060bcf3480b4e500cfb2d4113c2b7ef767c30de2fbf2a6b567d331baeb/diff /var/lib/docker/containers/9bf1acadff45ad935ccab7daf539959089ea0436fd2a7e6619ff03dcc782416f]; will not log again for this container unless duration exceeds 2s
	Jun 11 12:02:36 ark2-westeuropetest01-master-01 kubelet: I0611 12:00:07.220215   11872 fsHandler.go:135] du and find on following dirs took 3m35.410003254s: [/var/lib/docker/overlay2/36b6841b00063b2b222cad86fe3c032f239af2a7c3e6b3ef1c4ae404c34cf2b6/diff /var/lib/docker/containers/4065dcacca44d62b3538b1833f82921d73074a9d2e9b4e13da76b8795ca508f4]; will not log again for this container unless duration exceeds 2s
	Jun 11 12:02:37 ark2-westeuropetest01-master-01 dbus[5049]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 44002ms)
	Jun 11 12:02:37 ark2-westeuropetest01-master-01 dbus[5049]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 40351ms)
	Jun 11 12:03:12 ark2-westeuropetest01-master-01 kubelet: I0611 12:00:06.119820   11872 fsHandler.go:135] du and find on following dirs took 3m35.310160444s: [/var/lib/docker/overlay2/10a8522ea927fa2c73076457d6af9d903066268f75bcabd6d1bb052f30425d77/diff /var/lib/docker/containers/614a407e36dc3c30256869a1d22956c4d1e1b0ff4e25ece89ca1e47b02a518ff]; will not log again for this container unless duration exceeds 2s
	Jun 11 12:03:29 ark2-westeuropetest01-master-01 kubelet: I0611 12:01:25.973977   11872 fsHandler.go:135] du and find on following dirs took 3m36.410182743s: [/var/lib/docker/overlay2/1ddc2e77d8eb09c90e97b4b261780deed81b99c1d887db7a6a67a4d68217c50e/diff /var/lib/docker/containers/3058852d0566024eaa3e401650608b10ee5e728cee4fb226c94933addaf29996]; will not log again for this container unless duration exceeds 2s
	Jun 11 12:03:41 ark2-westeuropetest01-master-01 systemd-logind: Failed to start session scope session-c9270099.scope: Connection timed out
	Jun 11 12:03:47 ark2-westeuropetest01-master-01 kubelet: I0611 12:01:59.925298   11872 fsHandler.go:135] du and find on following dirs took 3m34.360095219s: [/var/lib/docker/overlay2/cd3e4a6fffd9ef598b7818023716a351102d34eae18039915e62fd7b9c88cb00/diff /var/lib/docker/containers/4b9725f9a2a1fe79adc798332a8bfae9e21378e5b8e3e73821845a9a4dce3955]; will not log again for this container unless duration exceeds 2s
	Jun 11 12:03:54 ark2-westeuropetest01-master-01 dockerd: time="2019-06-11T12:03:28.230979725Z" level=error msg="Handler for GET /containers/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
	Jun 11 12:03:56 ark2-westeuropetest01-master-01 dockerd: time="2019-06-11T12:03:28.231031827Z" level=error msg="Handler for GET /images/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
	Jun 11 12:03:56 ark2-westeuropetest01-master-01 dockerd: time="2019-06-11T12:03:28.230676609Z" level=error msg="Handler for GET /containers/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
	Jun 11 12:03:58 ark2-westeuropetest01-master-01 dockerd: http: multiple response.WriteHeader calls
	Jun 11 12:03:58 ark2-westeuropetest01-master-01 kubelet: I0611 12:00:47.072801   11872 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /systemd/system.slice
	Jun 11 12:03:58 ark2-westeuropetest01-master-01 kubelet: I0611 11:58:40.965282   11872 prober.go:111] Liveness probe for "kube-apiserver-ark2-westeuropetest01-master-01_kube-system(1d41b2843c135cc770ca51bc51b321b5):kube-apiserver" failed (failure): Get https://192.168.6.5:6443/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
	Jun 11 12:03:58 ark2-westeuropetest01-master-01 kubelet: I0611 12:00:10.820233   11872 fsHandler.go:135] du and find on following dirs took 3m34.860232116s: [/var/lib/docker/overlay2/40e6d22c08c2210933f9dfeaf085901f39b5d6c3a8e00b4e906234b7d5e60bb8/diff /var/lib/docker/containers/7e63c9c0b6f78a55c0b7269bfcd2176362521b81c172eb3698bdb1f257d6ac99]; will not log again for this container unless duration exceeds 2s
	Jun 11 12:03:58 ark2-westeuropetest01-master-01 kubelet: E0611 11:59:32.117684   11872 remote_runtime.go:438] Status from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
	Jun 11 12:03:58 ark2-westeuropetest01-master-01 dockerd: http: multiple response.WriteHeader calls
	Jun 11 12:03:58 ark2-westeuropetest01-master-01 kubelet: I0611 12:02:39.326982   11872 fsHandler.go:135] du and find on following dirs took 3m34.310701592s: [/var/lib/docker/overlay2/19ed09312413df38910a1da38fa44092169737c996d73b25cfee37e7e6011e1d/diff /var/lib/docker/containers/dbae1786b067f3b4331846437407f9e601dca5819e18e6eb2919fd89a7d8e8e6]; will not log again for this container unless duration exceeds 2s
	Jun 11 12:03:58 ark2-westeuropetest01-master-01 Container_ImageInventory[9333]: API call in QueryAll to list images returned null
	Jun 11 12:03:58 ark2-westeuropetest01-master-01 dockerd: http: multiple response.WriteHeader calls
	Jun 11 12:04:31 ark2-westeuropetest01-master-01 omiagent[9333]: API call in Container_HostInventory::InspectHost to get host info returned null
	Jun 11 12:04:53 ark2-westeuropetest01-master-01 kernel: xt_physdev: using --physdev-out and --physdev-is-out are only supported in the FORWARD and POSTROUTING chains with bridged traffic.
	Jun 11 12:05:58 ark2-westeuropetest01-master-01 Container_DaemonEvent[9333]: Attempt in QueryAll to get Docker events failed
	Jun 11 12:06:10 ark2-westeuropetest01-master-01 kubelet: E0611 12:00:00.819865   11872 remote_image.go:71] ListImages with filter nil from image service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
	Jun 11 12:06:24 ark2-westeuropetest01-master-01 dockerd: http: multiple response.WriteHeader calls
	Jun 11 12:06:34 ark2-westeuropetest01-master-01 dockerd: time="2019-06-11T12:05:58.587379844Z" level=error msg="Handler for GET /events returned error: write unix /var/run/docker.sock->@: write: broken pipe"
	Jun 11 12:06:41 ark2-westeuropetest01-master-01 dockerd: time="2019-06-11T12:05:58.587350043Z" level=error msg="Handler for GET /images/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"

dashpole · 2019-06-11T22:13:08Z

@dcvtruong It looks like you have many symptoms, including slow du/find calls, and timeouts to docker. Most of the time, the root cause is resource constraints including CPU and DiskIO contention. This could be made worse if the writable layers of the containers are very large.

dcvtruong · 2019-06-19T14:06:05Z

More ram was added to the node to handle more requests.

uniuuu · 2023-02-15T06:23:11Z

Hi @dashpole @TonySun1994 @dcvtruong

I have got same issue.
kubernetes v1.11.0

Killing cmd [ionice -c3 nice -n 19 du -s /var/lib/docker/overlay2/53d0e673d5f10be5e46c013a1fa10965302fa103a127486a989b01f8df41ca54/diff] due to timeout 

fsHandler.go:135] du and find on following dirs took 5m52.632364142s: [/var/lib/docker/overlay2/53d0e673d5f10be5e46c013a1fa10965302fa103a127486a989b01f8df41ca54/diff /var/lib/docker/containers/e3556fe9c4ac466412f655b857a7e1f2449faf55c90d3ca6d90880a394567923]; will not log again for this container unless duration exceeds 2s

After troubleshooting it was found 2 processes with high CPU usage and long running in pods where wordpress websites hosted.
These processes ended up to be a virus Coin Miner https://www.virustotal.com/gui/file/9671bb38678257f57c3e8215ec164f8085b72466ddfbf94deaed4c971bde5a03
Killing these pods fixed whole k8s cluster issue above.

It was reported that it's cadvisor issue
Bug 1459265- journactl on node repeats: du and find on following dirs took
ContainerGCFailed / ImageGCFailed context deadline exceeded
And it should have been fixed
adaptive longOp for du operation

But the fact that it's only observed under high CPU load made this issue harder to detect.

try running with --disable_metrics=disk,tcp,udp,percpu

I've tried it but it's still stuck， I wonder if there are other file system operations

This was exactly how cluster behaved. Any kubectl command stuck. But if to leave it in terminal the command finally will get executed. So after waiting pods with virus to be deleted cluster has been recovered to normal function in case described.

dashpole added the kind/support label Apr 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

du and find on following dirs took long time #2212

du and find on following dirs took long time #2212

TonySun1994 commented Apr 3, 2019

dashpole commented Apr 3, 2019

TonySun1994 commented Apr 4, 2019

dashpole commented Apr 4, 2019

TonySun1994 commented Apr 4, 2019

dashpole commented Apr 4, 2019

dcvtruong commented Jun 11, 2019 •

edited

Loading

dashpole commented Jun 11, 2019

dcvtruong commented Jun 19, 2019 •

edited

Loading

uniuuu commented Feb 15, 2023 •

edited

Loading

du and find on following dirs took long time #2212

du and find on following dirs took long time #2212

Comments

TonySun1994 commented Apr 3, 2019

dashpole commented Apr 3, 2019

TonySun1994 commented Apr 4, 2019

dashpole commented Apr 4, 2019

TonySun1994 commented Apr 4, 2019

dashpole commented Apr 4, 2019

dcvtruong commented Jun 11, 2019 • edited Loading

dashpole commented Jun 11, 2019

dcvtruong commented Jun 19, 2019 • edited Loading

uniuuu commented Feb 15, 2023 • edited Loading

dcvtruong commented Jun 11, 2019 •

edited

Loading

dcvtruong commented Jun 19, 2019 •

edited

Loading

uniuuu commented Feb 15, 2023 •

edited

Loading