-
Notifications
You must be signed in to change notification settings - Fork 2.4k
To reduce the lock pressure on the kernfs file system, we should reduce the frequency of cgroup reads. #3528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Question: do you think this
|
Yes, on the machine, cadvisor, kubelet, and other processes (e.g., node-manager frequently reading and writing to cgroup) can encounter a kernfs deadlock. The specific trigger cause is: Source: When offline instances of large nodes are being suppressed/evicted, the CPU quota is set too low (1 CPU core), causing Java C1 compiler and other threads (2000+) in the offline instances to be throttled frequently. This increases the throttling intensity, and the thread holds the kernfs_rwsem lock. Due to the high number of threads, this results in lock waits from 10ms to several seconds. Escalation: Multiple lock chains further amplify the impact (taking cgroup deletion as an example: inode.rw_sem (head lock) -> cgroup_mutex (middle lock) -> kernfs_rwsem (tail lock)). If the tail lock is blocked, it causes the wait time of the entire lock chain's source (head lock) to increase exponentially, leading to a stall that lasts from minutes to hours, ultimately triggering system-wide freezes. |
/assign @iwankgb @kolyshkin |
Uh oh!
There was an error while loading. Please reload this page.
Intro:
We call cadvisor subcontainers API to obtain the data of all cgroup containers. With the increase of QPS, the frequency of cgroup read will increase, which will aggravate the pressure of kernfs file system lock and even cause machine hang.
Question:
Do we need to read the cgroup less frequently when we call
GetInfo
to fetchContainerInfo
data by caching it?https://github.com/google/cadvisor/blob/54dff2b8ccb147747d814b0ff3b4a3256dc569c0/manager/manager.go#L543C19-L543C47
If the call
GetInfo (shouldUpdateSubcontainers bool)
functionshouldUpdateSubcontainers
parameter is set to true, TheupdateSpec()
function is called on each request to read the cgroup data, which increases the pressure on kernsfs Lock.The text was updated successfully, but these errors were encountered: