To reduce the lock pressure on the kernfs file system, we should reduce the frequency of cgroup reads. #3528

xigang · 2024-05-09T00:45:59Z

Intro:
We call cadvisor subcontainers API to obtain the data of all cgroup containers. With the increase of QPS, the frequency of cgroup read will increase, which will aggravate the pressure of kernfs file system lock and even cause machine hang.

Question:
Do we need to read the cgroup less frequently when we call GetInfo to fetch ContainerInfo data by caching it?

https://github.com/google/cadvisor/blob/54dff2b8ccb147747d814b0ff3b4a3256dc569c0/manager/manager.go#L543C19-L543C47

func (m *manager) containerDataToContainerInfo(cont *containerData, query *info.ContainerInfoRequest) (*info.ContainerInfo, error) {
	// Get the info from the container.
	cinfo, err := cont.GetInfo(true)  
	if err != nil {
		return nil, err
	}

	stats, err := m.memoryCache.RecentStats(cinfo.Name, query.Start, query.End, query.NumStats)
	if err != nil {
		return nil, err
	}

	// Make a copy of the info for the user.
	ret := &info.ContainerInfo{
		ContainerReference: cinfo.ContainerReference,
		Subcontainers:      cinfo.Subcontainers,
		Spec:               m.getAdjustedSpec(cinfo),
		Stats:              stats,
	}
	return ret, nil
}

If the call GetInfo (shouldUpdateSubcontainers bool) function shouldUpdateSubcontainers parameter is set to true, The updateSpec() function is called on each request to read the cgroup data, which increases the pressure on kernsfs Lock.

func (cd *containerData) GetInfo(shouldUpdateSubcontainers bool) (*containerInfo, error) {
	// Get spec and subcontainers.
	if cd.clock.Since(cd.infoLastUpdatedTime) > 5*time.Second || shouldUpdateSubcontainers {
		err := cd.updateSpec()
		if err != nil {
			return nil, err
		}
		if shouldUpdateSubcontainers {
			err = cd.updateSubcontainers()
			if err != nil {
				return nil, err
			}
		}
		cd.infoLastUpdatedTime = cd.clock.Now()
	}
	cd.lock.Lock()
	defer cd.lock.Unlock()
	cInfo := containerInfo{
		Subcontainers: cd.info.Subcontainers,
		Spec:          cd.info.Spec,
	}
	cInfo.Id = cd.info.Id
	cInfo.Name = cd.info.Name
	cInfo.Aliases = cd.info.Aliases
	cInfo.Namespace = cd.info.Namespace
	return &cInfo, nil
}

The text was updated successfully, but these errors were encountered:

bhaveshdavda · 2025-02-15T19:35:22Z

Question: do you think this cadvisor kernfs frequent locking could lead to a deadlock situation on K8s nodes like this?

cadvisor crawls sysfs grabbing kernfs read-lock
kubelet liveness probes on critical system components (e.g. mofed from NVIDIA's network-operator that installs mlx5_core drivers) times out as the probe (sh -c lsmod | grep mlx5_core) blocks on the kernfs read-lock while cadvisor scrapes stuff
On successive liveness probe failures, kubelet attempts to kill the pod, which in this case involves killing containers with NICs driven by those mlx5_core drivers. This surely involves a kworker thread trying to grab a write-lock on kernfs to change sysfs nodes related to those devices and drivers
The write-locker blocks on cadvisor which has the read-lock, and multiple other processes, potentially cadvisor itself too (?) block waiting on that write-locker, leading to the deadlock

xigang · 2025-02-16T01:31:02Z

Question: do you think this cadvisor kernfs frequent locking could lead to a deadlock situation on K8s nodes like this?

cadvisor crawls sysfs grabbing kernfs read-lock

kubelet liveness probes on critical system components (e.g. mofed from NVIDIA's network-operator that installs mlx5_core drivers) times out as the probe (sh -c lsmod | grep mlx5_core) blocks on the kernfs read-lock while cadvisor scrapes stuff

On successive liveness probe failures, kubelet attempts to kill the pod, which in this case involves killing containers with NICs driven by those mlx5_core drivers. This surely involves a kworker thread trying to grab a write-lock on kernfs to change sysfs nodes related to those devices and drivers

The write-locker blocks on cadvisor which has the read-lock, and multiple other processes, potentially cadvisor itself too (?) block waiting on that write-locker, leading to the deadlock

@bhaveshdavda

Yes, on the machine, cadvisor, kubelet, and other processes (e.g., node-manager frequently reading and writing to cgroup) can encounter a kernfs deadlock. The specific trigger cause is:

Source: When offline instances of large nodes are being suppressed/evicted, the CPU quota is set too low (1 CPU core), causing Java C1 compiler and other threads (2000+) in the offline instances to be throttled frequently. This increases the throttling intensity, and the thread holds the kernfs_rwsem lock. Due to the high number of threads, this results in lock waits from 10ms to several seconds.

Escalation: Multiple lock chains further amplify the impact (taking cgroup deletion as an example: inode.rw_sem (head lock) -> cgroup_mutex (middle lock) -> kernfs_rwsem (tail lock)). If the tail lock is blocked, it causes the wait time of the entire lock chain's source (head lock) to increase exponentially, leading to a stall that lasts from minutes to hours, ultimately triggering system-wide freezes.

xigang · 2025-02-16T01:35:36Z

/assign @iwankgb @kolyshkin

xigang mentioned this issue Dec 8, 2024

Reduce cgroups read frequency to avoid kernel kernfs clock pressure #3633

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

To reduce the lock pressure on the kernfs file system, we should reduce the frequency of cgroup reads. #3528

To reduce the lock pressure on the kernfs file system, we should reduce the frequency of cgroup reads. #3528

xigang commented May 9, 2024 •

edited

Loading

bhaveshdavda commented Feb 15, 2025

Uh oh!

xigang commented Feb 16, 2025

Uh oh!

xigang commented Feb 16, 2025

Uh oh!

To reduce the lock pressure on the kernfs file system, we should reduce the frequency of cgroup reads. #3528

To reduce the lock pressure on the kernfs file system, we should reduce the frequency of cgroup reads. #3528

Comments

xigang commented May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

bhaveshdavda commented Feb 15, 2025

Uh oh!

xigang commented Feb 16, 2025

Uh oh!

xigang commented Feb 16, 2025

Uh oh!

xigang commented May 9, 2024 •

edited

Loading