Skip to content

AKS | EMFILE: too many open files -> Blobfuse proxy making troubles if nodejs application is opening a few thousand files #706

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sphinx02 opened this issue Jun 24, 2022 · 3 comments · Fixed by #707

Comments

@sphinx02
Copy link

What happened:
If we deploy the blob-csi-driver with enabled blobfuseproxy we are seeing the error EMFILE: too many open files after opening a few thousand files in our deployed nodejs application:
helm install blob-csi-driver blob-csi-driver/blob-csi-driver --namespace kube-system --version v1.15.0 --set node.enableBlobfuseProxy=true

If we deploy the blob-csi-driver without enabling the blobfuseproxy we do not have the problem:
helm install blob-csi-driver blob-csi-driver/blob-csi-driver --namespace kube-system --version v1.15.0

What you expected to happen:
We expect to be able to open more files also when blobfuseproxy is enabled

How to reproduce it:

  • Spin up an AKS cluster
  • Install blob-csi-driver with enabled blobfuseproxy: helm install blob-csi-driver blob-csi-driver/blob-csi-driver --namespace kube-system --version v1.15.0 --set node.enableBlobfuseProxy=true
  • Deploy an application that is opening a few thousand files inside a mounted blob storage

Environment:

  • CSI Driver version: v1.15.0
  • Kubernetes version (use kubectl version): v1.22.6 (AKS)
  • OS (e.g. from /etc/os-release): Ubuntu 18.04.6 LTS
  • Kernel (e.g. uname -a): Linux 5.4.0-1083-azure

Anything else we need to know?:

  • We already tried to increase the maxOpenFileNum settings of the helm-chart but this does not help:
    helm install blob-csi-driver blob-csi-driver/blob-csi-driver --namespace kube-system --version v1.15.0 --set node.enableBlobfuseProxy=true --set node.blobfuseProxy.setMaxOpenFileNum=true --set node.blobfuseProxy.maxOpenFileNum=999000000
  • The problem seem to be similar to this closed (but unresolved) issue: Failed to mount storage due to too many open files in the daemonset/csi-blobfuse-proxy Azure/azure-storage-fuse#653
  • As the problem does only happen if the blobfuseproxy is enalbed I guess the problem is coming from the blobfuseproxy
  • When the error occurs lsof | wc -l is only counting 69136 elements which is far below the maxOpenFileNum.

Possible reason:
So far I could see, the blobfuse-proxy is installed as an systemd-service on the nodes. I am not sure if everybody is aware that processes started by systemd are limited regarding open-files from the systemd-daemon itself, see https://manpages.ubuntu.com/manpages/bionic/man5/systemd.exec.5.html#process%20properties

As far as I could see the current blobfuse-proxy service is described like this:

[Unit]
Description=Blobfuse proxy service

[Service]
ExecStart=/usr/bin/blobfuse-proxy --v=5 --blobfuse-proxy-endpoint=unix://var/lib/kubelet/plugins/blob.csi.azure.com/blobfuse-proxy.sock

[Install]
WantedBy=multi-user.target

That means systemd-limits are not explicitly set and are defaulting maybe to a low value. Setting sysctl -w fs.file-max=9000000 which the driver init is doing will not help to get around this systemd-limitation. The containerd-service on AKS nodes is e.g. deployed like this:

[Unit]
Description=containerd daemon
After=network.target

[Service]
ExecStartPre=/sbin/modprobe overlay
ExecStart=/usr/bin/containerd
Delegate=yes
KillMode=process
Restart=always
OOMScoreAdjust=-999
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
LimitNOFILE=infinity
TasksMax=infinity

[Install]
WantedBy=multi-user.target

As one can see the LimitNOFILE and also other configuration values are explicitly set to infinity which allows the deployments (running in the same context) to open more files than the systemd default values.

I could not test but maybe somewhere in this systemd limitations the problem can be found. Let me know if I can give further useful information.

@andyzhangx
Copy link
Member

@sphinx02 thanks, that's very nice analysis, I have worked out a PR to fix the issue: #707

@sphinx02
Copy link
Author

@andyzhangx thank you very much for this speedy PR. Could you already check if this fixes issues with too much open files? I can test it as soon as an release is available. Is there any time-schedule for the next release? Is there any reason why the 1.15 release is not yet shown under github releases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants