Skip to content

To reduce the lock pressure on the kernfs file system, we should reduce the frequency of cgroup reads. #3528

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
xigang opened this issue May 9, 2024 · 3 comments

Comments

@xigang
Copy link

xigang commented May 9, 2024

Intro:
We call cadvisor subcontainers API to obtain the data of all cgroup containers. With the increase of QPS, the frequency of cgroup read will increase, which will aggravate the pressure of kernfs file system lock and even cause machine hang.

Question:
Do we need to read the cgroup less frequently when we call GetInfo to fetch ContainerInfo data by caching it?

https://github.com/google/cadvisor/blob/54dff2b8ccb147747d814b0ff3b4a3256dc569c0/manager/manager.go#L543C19-L543C47

func (m *manager) containerDataToContainerInfo(cont *containerData, query *info.ContainerInfoRequest) (*info.ContainerInfo, error) {
	// Get the info from the container.
	cinfo, err := cont.GetInfo(true)  
	if err != nil {
		return nil, err
	}

	stats, err := m.memoryCache.RecentStats(cinfo.Name, query.Start, query.End, query.NumStats)
	if err != nil {
		return nil, err
	}

	// Make a copy of the info for the user.
	ret := &info.ContainerInfo{
		ContainerReference: cinfo.ContainerReference,
		Subcontainers:      cinfo.Subcontainers,
		Spec:               m.getAdjustedSpec(cinfo),
		Stats:              stats,
	}
	return ret, nil
}

If the call GetInfo (shouldUpdateSubcontainers bool) function shouldUpdateSubcontainers parameter is set to true, The updateSpec() function is called on each request to read the cgroup data, which increases the pressure on kernsfs Lock.

func (cd *containerData) GetInfo(shouldUpdateSubcontainers bool) (*containerInfo, error) {
	// Get spec and subcontainers.
	if cd.clock.Since(cd.infoLastUpdatedTime) > 5*time.Second || shouldUpdateSubcontainers {
		err := cd.updateSpec()
		if err != nil {
			return nil, err
		}
		if shouldUpdateSubcontainers {
			err = cd.updateSubcontainers()
			if err != nil {
				return nil, err
			}
		}
		cd.infoLastUpdatedTime = cd.clock.Now()
	}
	cd.lock.Lock()
	defer cd.lock.Unlock()
	cInfo := containerInfo{
		Subcontainers: cd.info.Subcontainers,
		Spec:          cd.info.Spec,
	}
	cInfo.Id = cd.info.Id
	cInfo.Name = cd.info.Name
	cInfo.Aliases = cd.info.Aliases
	cInfo.Namespace = cd.info.Namespace
	return &cInfo, nil
}
@bhaveshdavda
Copy link

Question: do you think this cadvisor kernfs frequent locking could lead to a deadlock situation on K8s nodes like this?

  1. cadvisor crawls sysfs grabbing kernfs read-lock
  2. kubelet liveness probes on critical system components (e.g. mofed from NVIDIA's network-operator that installs mlx5_core drivers) times out as the probe (sh -c lsmod | grep mlx5_core) blocks on the kernfs read-lock while cadvisor scrapes stuff
  3. On successive liveness probe failures, kubelet attempts to kill the pod, which in this case involves killing containers with NICs driven by those mlx5_core drivers. This surely involves a kworker thread trying to grab a write-lock on kernfs to change sysfs nodes related to those devices and drivers
  4. The write-locker blocks on cadvisor which has the read-lock, and multiple other processes, potentially cadvisor itself too (?) block waiting on that write-locker, leading to the deadlock

@xigang
Copy link
Author

xigang commented Feb 16, 2025

Question: do you think this cadvisor kernfs frequent locking could lead to a deadlock situation on K8s nodes like this?

  1. cadvisor crawls sysfs grabbing kernfs read-lock
  2. kubelet liveness probes on critical system components (e.g. mofed from NVIDIA's network-operator that installs mlx5_core drivers) times out as the probe (sh -c lsmod | grep mlx5_core) blocks on the kernfs read-lock while cadvisor scrapes stuff
  3. On successive liveness probe failures, kubelet attempts to kill the pod, which in this case involves killing containers with NICs driven by those mlx5_core drivers. This surely involves a kworker thread trying to grab a write-lock on kernfs to change sysfs nodes related to those devices and drivers
  4. The write-locker blocks on cadvisor which has the read-lock, and multiple other processes, potentially cadvisor itself too (?) block waiting on that write-locker, leading to the deadlock

@bhaveshdavda

Yes, on the machine, cadvisor, kubelet, and other processes (e.g., node-manager frequently reading and writing to cgroup) can encounter a kernfs deadlock. The specific trigger cause is:

Source: When offline instances of large nodes are being suppressed/evicted, the CPU quota is set too low (1 CPU core), causing Java C1 compiler and other threads (2000+) in the offline instances to be throttled frequently. This increases the throttling intensity, and the thread holds the kernfs_rwsem lock. Due to the high number of threads, this results in lock waits from 10ms to several seconds.

Escalation: Multiple lock chains further amplify the impact (taking cgroup deletion as an example: inode.rw_sem (head lock) -> cgroup_mutex (middle lock) -> kernfs_rwsem (tail lock)). If the tail lock is blocked, it causes the wait time of the entire lock chain's source (head lock) to increase exponentially, leading to a stall that lasts from minutes to hours, ultimately triggering system-wide freezes.

@xigang
Copy link
Author

xigang commented Feb 16, 2025

/assign @iwankgb @kolyshkin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants