Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

du and find on following dirs took long time #2212

Open
TonySun1994 opened this issue Apr 3, 2019 · 9 comments
Open

du and find on following dirs took long time #2212

TonySun1994 opened this issue Apr 3, 2019 · 9 comments

Comments

@TonySun1994
Copy link

Running cadvisor with the container,The log shows the following:
I0328 13:41:29.716552 1 fsHandler.go:135] du and find on following dirs took 13.006500097s: [/rootfs/vdata/docker/overlay2/1259cbee3d31494942129513374cc8702ad68c676b700638534a52b19aadcb26/diff /rootfs/vdata/docker/containers/d866dd5101d9788bdf71a7aacd012bae15c2c304a4b70c80b2647c05a7ebc050]; will not log again for this container unless duration exceeds 5s
I0328 13:41:32.106341 1 fsHandler.go:135] du and find on following dirs took 13.308135764s: [/rootfs/vdata/docker/overlay2/5dfa8775376b9725b0af27edb4ba93a2a9da368d4099633b49f6dba3ae5e6a70/diff /rootfs/vdata/docker/containers/50a1a11de04b411c270683dc834696d1150433120d44deb70bae3ba109a54b06]; will not log again for this container unless duration exceeds 6s
I0328 13:41:32.204471 1 fsHandler.go:135] du and find on following dirs took 15.599689255s: [/rootfs/vdata/docker/overlay2/102debde6d41426b202167877b92bd61b4d4d2f3f9b67c9c4241b760898fa9cb/diff /rootfs/vdata/docker/containers/45b76dbc340d3bb358ec9ec08d5341822575ed68245ea2976596bf17dc2d3a9f]; will not log again for this container unless duration exceeds 6s
I0328 13:41:32.804843 1 fsHandler.go:135] du and find on following dirs took 14.19422831s: [/rootfs/vdata/docker/overlay2/a426d4717c6454ed568e028f95d055f86cebaba26048ac09c019a758d9456158/diff /rootfs/vdata/docker/containers/32d2324f8f4f5727eaf61579148f8915e6019361f032ddcc11c66cb9685b11ff]; will not log again for this container unless duration exceeds 6s
I0328 13:42:15.917654 1 fsHandler.go:135] du and find on following dirs took 11.919525783s: [/rootfs/vdata/docker/overlay2/f2d7561691dd2f3aaee9bf753a152c793edb987717d9e56381df65e4532ec4ee/diff /rootfs/vdata/docker/containers/4cec80a94dabed251376f50c9b24a578d0a6525bfbb096ce73b63adff97db315]; will not log again for this container unless duration exceeds 2s

I know it is caused by the deep file system hierarchy and the large number of small files, so I want to cancel the metrics collection for the file system and disk, so that I can focus on the collection of CPU and memory,Now the web service cannot be started and “/metrics“ cannot be accessed

@dashpole
Copy link
Collaborator

dashpole commented Apr 3, 2019

try running with --disable_metrics=disk,tcp,udp,percpu

@TonySun1994
Copy link
Author

try running with --disable_metrics=disk,tcp,udp,percpu

I've tried it but it's still stuck, I wonder if there are other file system operations

@dashpole
Copy link
Collaborator

dashpole commented Apr 4, 2019

what do you mean by stuck? What happens?

@TonySun1994
Copy link
Author

what do you mean by stuck? What happens?

That "/metrics" endpoint is not accessible or is very slow,and It crashed very quickly

@dashpole
Copy link
Collaborator

dashpole commented Apr 4, 2019

Can you share the logs from when that happens?

@dcvtruong
Copy link

dcvtruong commented Jun 11, 2019

I also encountered the same errors below had to restart the affected master node. What is the root cause? Is it docker?

k8s: 1.13.0
docker: 18.06.1-ce
k8s cluster: 3 master, 4 workers

If you notice there are failured in the logs but is reported as info.

LOGS:

        Jun 11 11:58:33 ark2-westeuropetest01-master-01 kubelet: I0611 11:57:36.411957   11872 prober.go:111] Readiness probe for "weave-net-ljj47_kube-system(45efc710-2e41-11e9-8813-000d3a484762):weave" failed (failure): Get http://127.0.0.1:6784/status: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
	Jun 11 11:58:35 ark2-westeuropetest01-master-01 kubelet: I0611 11:57:10.110453   11872 prober.go:111] Liveness probe for "kube-scheduler-ark2-westeuropetest01-master-01_kube-system(4483c96edc67da591846e71847253bf4):kube-scheduler" failed (failure): Get http://127.0.0.1:10251/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
	Jun 11 12:00:12 ark2-westeuropetest01-master-01 kernel: xt_physdev: using --physdev-out and --physdev-is-out are only supported in the FORWARD and POSTROUTING chains with bridged traffic.
	Jun 11 12:00:58 ark2-westeuropetest01-master-01 Container_ContainerInventory[9333]: API call in GenerateImageNameMap to list images returned null
	Jun 11 12:01:40 ark2-westeuropetest01-master-01 kubelet: I0611 11:59:56.570016   11872 fsHandler.go:135] du and find on following dirs took 3m34.71022604s: [/var/lib/docker/overlay2/5944d0060bcf3480b4e500cfb2d4113c2b7ef767c30de2fbf2a6b567d331baeb/diff /var/lib/docker/containers/9bf1acadff45ad935ccab7daf539959089ea0436fd2a7e6619ff03dcc782416f]; will not log again for this container unless duration exceeds 2s
	Jun 11 12:02:36 ark2-westeuropetest01-master-01 kubelet: I0611 12:00:07.220215   11872 fsHandler.go:135] du and find on following dirs took 3m35.410003254s: [/var/lib/docker/overlay2/36b6841b00063b2b222cad86fe3c032f239af2a7c3e6b3ef1c4ae404c34cf2b6/diff /var/lib/docker/containers/4065dcacca44d62b3538b1833f82921d73074a9d2e9b4e13da76b8795ca508f4]; will not log again for this container unless duration exceeds 2s
	Jun 11 12:02:37 ark2-westeuropetest01-master-01 dbus[5049]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 44002ms)
	Jun 11 12:02:37 ark2-westeuropetest01-master-01 dbus[5049]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 40351ms)
	Jun 11 12:03:12 ark2-westeuropetest01-master-01 kubelet: I0611 12:00:06.119820   11872 fsHandler.go:135] du and find on following dirs took 3m35.310160444s: [/var/lib/docker/overlay2/10a8522ea927fa2c73076457d6af9d903066268f75bcabd6d1bb052f30425d77/diff /var/lib/docker/containers/614a407e36dc3c30256869a1d22956c4d1e1b0ff4e25ece89ca1e47b02a518ff]; will not log again for this container unless duration exceeds 2s
	Jun 11 12:03:29 ark2-westeuropetest01-master-01 kubelet: I0611 12:01:25.973977   11872 fsHandler.go:135] du and find on following dirs took 3m36.410182743s: [/var/lib/docker/overlay2/1ddc2e77d8eb09c90e97b4b261780deed81b99c1d887db7a6a67a4d68217c50e/diff /var/lib/docker/containers/3058852d0566024eaa3e401650608b10ee5e728cee4fb226c94933addaf29996]; will not log again for this container unless duration exceeds 2s
	Jun 11 12:03:41 ark2-westeuropetest01-master-01 systemd-logind: Failed to start session scope session-c9270099.scope: Connection timed out
	Jun 11 12:03:47 ark2-westeuropetest01-master-01 kubelet: I0611 12:01:59.925298   11872 fsHandler.go:135] du and find on following dirs took 3m34.360095219s: [/var/lib/docker/overlay2/cd3e4a6fffd9ef598b7818023716a351102d34eae18039915e62fd7b9c88cb00/diff /var/lib/docker/containers/4b9725f9a2a1fe79adc798332a8bfae9e21378e5b8e3e73821845a9a4dce3955]; will not log again for this container unless duration exceeds 2s
	Jun 11 12:03:54 ark2-westeuropetest01-master-01 dockerd: time="2019-06-11T12:03:28.230979725Z" level=error msg="Handler for GET /containers/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
	Jun 11 12:03:56 ark2-westeuropetest01-master-01 dockerd: time="2019-06-11T12:03:28.231031827Z" level=error msg="Handler for GET /images/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
	Jun 11 12:03:56 ark2-westeuropetest01-master-01 dockerd: time="2019-06-11T12:03:28.230676609Z" level=error msg="Handler for GET /containers/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
	Jun 11 12:03:58 ark2-westeuropetest01-master-01 dockerd: http: multiple response.WriteHeader calls
	Jun 11 12:03:58 ark2-westeuropetest01-master-01 kubelet: I0611 12:00:47.072801   11872 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /systemd/system.slice
	Jun 11 12:03:58 ark2-westeuropetest01-master-01 kubelet: I0611 11:58:40.965282   11872 prober.go:111] Liveness probe for "kube-apiserver-ark2-westeuropetest01-master-01_kube-system(1d41b2843c135cc770ca51bc51b321b5):kube-apiserver" failed (failure): Get https://192.168.6.5:6443/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
	Jun 11 12:03:58 ark2-westeuropetest01-master-01 kubelet: I0611 12:00:10.820233   11872 fsHandler.go:135] du and find on following dirs took 3m34.860232116s: [/var/lib/docker/overlay2/40e6d22c08c2210933f9dfeaf085901f39b5d6c3a8e00b4e906234b7d5e60bb8/diff /var/lib/docker/containers/7e63c9c0b6f78a55c0b7269bfcd2176362521b81c172eb3698bdb1f257d6ac99]; will not log again for this container unless duration exceeds 2s
	Jun 11 12:03:58 ark2-westeuropetest01-master-01 kubelet: E0611 11:59:32.117684   11872 remote_runtime.go:438] Status from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
	Jun 11 12:03:58 ark2-westeuropetest01-master-01 dockerd: http: multiple response.WriteHeader calls
	Jun 11 12:03:58 ark2-westeuropetest01-master-01 kubelet: I0611 12:02:39.326982   11872 fsHandler.go:135] du and find on following dirs took 3m34.310701592s: [/var/lib/docker/overlay2/19ed09312413df38910a1da38fa44092169737c996d73b25cfee37e7e6011e1d/diff /var/lib/docker/containers/dbae1786b067f3b4331846437407f9e601dca5819e18e6eb2919fd89a7d8e8e6]; will not log again for this container unless duration exceeds 2s
	Jun 11 12:03:58 ark2-westeuropetest01-master-01 Container_ImageInventory[9333]: API call in QueryAll to list images returned null
	Jun 11 12:03:58 ark2-westeuropetest01-master-01 dockerd: http: multiple response.WriteHeader calls
	Jun 11 12:04:31 ark2-westeuropetest01-master-01 omiagent[9333]: API call in Container_HostInventory::InspectHost to get host info returned null
	Jun 11 12:04:53 ark2-westeuropetest01-master-01 kernel: xt_physdev: using --physdev-out and --physdev-is-out are only supported in the FORWARD and POSTROUTING chains with bridged traffic.
	Jun 11 12:05:58 ark2-westeuropetest01-master-01 Container_DaemonEvent[9333]: Attempt in QueryAll to get Docker events failed
	Jun 11 12:06:10 ark2-westeuropetest01-master-01 kubelet: E0611 12:00:00.819865   11872 remote_image.go:71] ListImages with filter nil from image service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
	Jun 11 12:06:24 ark2-westeuropetest01-master-01 dockerd: http: multiple response.WriteHeader calls
	Jun 11 12:06:34 ark2-westeuropetest01-master-01 dockerd: time="2019-06-11T12:05:58.587379844Z" level=error msg="Handler for GET /events returned error: write unix /var/run/docker.sock->@: write: broken pipe"
	Jun 11 12:06:41 ark2-westeuropetest01-master-01 dockerd: time="2019-06-11T12:05:58.587350043Z" level=error msg="Handler for GET /images/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"

@dashpole
Copy link
Collaborator

@dcvtruong It looks like you have many symptoms, including slow du/find calls, and timeouts to docker. Most of the time, the root cause is resource constraints including CPU and DiskIO contention. This could be made worse if the writable layers of the containers are very large.

@dcvtruong
Copy link

dcvtruong commented Jun 19, 2019

More ram was added to the node to handle more requests.

@uniuuu
Copy link

uniuuu commented Feb 15, 2023

Hi @dashpole @TonySun1994 @dcvtruong

I have got same issue.
kubernetes v1.11.0

Killing cmd [ionice -c3 nice -n 19 du -s /var/lib/docker/overlay2/53d0e673d5f10be5e46c013a1fa10965302fa103a127486a989b01f8df41ca54/diff] due to timeout 

fsHandler.go:135] du and find on following dirs took 5m52.632364142s: [/var/lib/docker/overlay2/53d0e673d5f10be5e46c013a1fa10965302fa103a127486a989b01f8df41ca54/diff /var/lib/docker/containers/e3556fe9c4ac466412f655b857a7e1f2449faf55c90d3ca6d90880a394567923]; will not log again for this container unless duration exceeds 2s

After troubleshooting it was found 2 processes with high CPU usage and long running in pods where wordpress websites hosted.
These processes ended up to be a virus Coin Miner https://www.virustotal.com/gui/file/9671bb38678257f57c3e8215ec164f8085b72466ddfbf94deaed4c971bde5a03
Killing these pods fixed whole k8s cluster issue above.

It was reported that it's cadvisor issue
Bug 1459265- journactl on node repeats: du and find on following dirs took
ContainerGCFailed / ImageGCFailed context deadline exceeded
And it should have been fixed
adaptive longOp for du operation

But the fact that it's only observed under high CPU load made this issue harder to detect.

try running with --disable_metrics=disk,tcp,udp,percpu

I've tried it but it's still stuck, I wonder if there are other file system operations

This was exactly how cluster behaved. Any kubectl command stuck. But if to leave it in terminal the command finally will get executed. So after waiting pods with virus to be deleted cluster has been recovered to normal function in case described.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants