-
Notifications
You must be signed in to change notification settings - Fork 8
Export libcontainer/cgroups from runc #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Kir Kolyshkin <[email protected]>
Signed-off-by: Kir Kolyshkin <[email protected]>
Kir Kolyshkin (24): *: stop using pkg/errors libct/cg: stop using pkg/errors libct/cg/ebpf: stop using pkg/errors libct/cg/devices: stop using pkg/errors .golangci.yml: enable errorlint *: ignore errorlint warnings about unix.* errors *: use errors.As and errors.Is tty.go: don't use pkg/errors, use errors.Is libct/keys: stop using pkg/errors libct: fix errorlint warning about strconv.NumError *: fmt.Errorf: use %w when appropriate libct/rootfs: improve some errors libct: wrap unix.Mount/Unmount errors libct/cg/fs2: fix/unify parsing errors libct/cg/fs: fix/unify parsing errors libct/cg/fscommon: introduce and use ParseError libct/cg/fs[2]: simplify getting pid stats libct/cg/fs/stats_util_test: fix errors libct/StartInitialization: fix errors libct/cg/fs/*_test: simplify errors ... LGTMs: AkihiroSuda cyphar Closes #3011
Signed-off-by: Kir Kolyshkin <[email protected]>
Since device updates in cgroup v2 are atomic for systemd, there is no need to freeze the processes before running the updates. Signed-off-by: Odin Ugedal <[email protected]>
If a control group is frozen, all its descendants will report FROZEN in freezer.state cgroup file. OTOH cgroup v2 cgroup.freeze is not reporting the cgroup as frozen unless it is frozen directly (i.e. not via an ancestor). Fix the discrepancy between v1 and v2 drivers behavior by looking into freezer.self_freezing cgroup file, which, according to kernel documentation, will show 1 iff the cgroup was frozen directly. Co-authored-by: Kir Kolyshkin <[email protected]> Signed-off-by: Odin Ugedal <[email protected]> Signed-off-by: Kir Kolyshkin <[email protected]>
This test the issues fixed by the two preceding commits. Co-Authored-By: Kir Kolyshkin <[email protected]> Signed-off-by: Odin Ugedal <[email protected]> Signed-off-by: Kir Kolyshkin <[email protected]>
ci: enable unconvert linter, fix its warnings
Odin Ugedal (2): libct/cg/sd: Don't freeze cgroup on cgroup v2 Set Update device update tests LGTMs: kolyshkin mrunalp cyphar Closes #3067
This is necessary in order for runc to be able to configure device cgroups with --systemd-cgroup on distributions that have very strict SELinux policies such as openSUSE MicroOS[1]. The core issue here is that systemd is adding its own BPF policy that has an SELinux label such that runc cannot interact with it. In order to work around this, we can just ignore the policy -- in theory this behaviour is not correct but given that the most obvious case (--systemd-cgroup) will still handle updates correctly, this logic is reasonable. [1]: https://bugzilla.suse.com/show_bug.cgi?id=1182428 Fixes: ea58c90 ("cgroup2: devices: replace all existing filters when attaching") Signed-off-by: Aleksa Sarai <[email protected]>
cgroupv2: ebpf: ignore inaccessible existing programs
m.Freeze method changes m.cgroups.Resources.Freezer field, which should not be done while we're temporarily freezing the cgroup in Set. If this field is changed, and r == m.cgroups.Resources (as it often happens), this results in inability to freeze the container using Set(). To fix, add and use a method which does not change r.Freezer field. A test case for the bug will be added separately. Signed-off-by: Kir Kolyshkin <[email protected]>
Introduce freezeBeforeSet, which contains the logic of figuring out whether we need to freeze/thaw around setting systemd unit properties. In particular, if SkipDevices is set, and the current unit properties allow all devices, there is no need to freeze and thaw, as systemd won't write any device rules in this case. Signed-off-by: Kir Kolyshkin <[email protected]>
This was initially added by commit 4a45e7a because Set (with r.Freezer = Frozen) was not able to freeze a container. Now (see a few previous commits) Set can do the freeze, so the explicit Freeze is no longer needed. Signed-off-by: Kir Kolyshkin <[email protected]>
TestPodSkipDevicesUpdate checks that updating a pod having SkipDevices: true does not result in spurious "permission denied" errors in a container running under the pod. The test is somewhat similar in nature to the @test "update devices [minimal transition rules]" in tests/integration, but uses a pod. This tests the validity of freezeBeforeSet in v1. Signed-off-by: Kir Kolyshkin <[email protected]>
cgroups: Set: fix freeze, avoid unnecessary freeze from systemd v1
Signed-off-by: Kir Kolyshkin <[email protected]>
This simplifies the code as no explicit cleanup is required. Signed-off-by: Kir Kolyshkin <[email protected]>
Replace ioutil.TempDir (mostly) with t.TempDir, which require no explicit cleanup. While at it, fix incorrect usage of os.ModePerm in libcontainer/intelrdt test. This is supposed to be a mask, not mode bits. Signed-off-by: Kir Kolyshkin <[email protected]>
Drop Go 1.13 support, require go 1.15+
This is just moving the code around to ease the code review, no other changes. Signed-off-by: Kir Kolyshkin <[email protected]>
Since every cgroup directory is guaranteed to have cgroup.procs file, we don't have to do filename comparison in GetAllPids() and just read cgroup.procs in every directory. While at it, switch readProcsFile to use our own OpenFile. Signed-off-by: Kir Kolyshkin <[email protected]>
filepath.WalkDir function, introduced in Go 1.16, doesn't do stat(2) on every entry, and is therefore somewhat faster (see below). Since we have to support Go 1.15, keep the old version for backward compatibility. Add a quick benchmark, which shows approximately 3x improvement: $ go1.15.15 test -bench AllPid -run xxx . BenchmarkGetAllPids-4 48 23528839 ns/op $ go version go version go1.16.6 linux/amd64 $ go test -bench AllPid -run xxx . BenchmarkGetAllPids-4 147 7700170 ns/op (Unrelated but worth noting -- go 1.17rc2 is pushing it even further) $ go1.17rc2 test -bench AllPid -run xxx . BenchmarkGetAllPids-4 164 6820994 ns/op Signed-off-by: Kir Kolyshkin <[email protected]>
This was initially added by commits 41fdf95 and 950f2c6, apparently to implement docker run --cgroup container:ID, which was never merged. Therefore, this code is not and was never used. It needs to be removed mainly because having it makes it much harder to understand how cgroup manager works (because with this in place we have not one or two but three sets of cgroup paths to think about). Note if the paths are known and there is a need to add a PID to existing cgroup, cgroup manager is not needed at all -- something like cgroups.WriteCgroupProc or cgroups.EnterPid is sufficient (and the latter is what runc exec uses in (*setnsProcess).start). Signed-off-by: Kir Kolyshkin <[email protected]>
Fix reading cgroup files from the top cgroup directory, i.e. /sys/fs/cgroup. The code was working for for any subdirectory of /sys/fs/cgroup, but for dir="/sys/fs/cgroup" a fallback (open and fstatfs) was used, because of the way the function worked with the dir argument. Fix those cases, and add unit tests to make sure they work. While at it, make the rules for dir and name components more relaxed, and add test cases for this, too. While at it, improve OpenFile documentation, and remove a duplicated doc comment for openFile. Without these fixes, the unit test fails the following cases: file_test.go:67: case {path:/sys/fs/cgroup name:cgroup.controllers}: fallback file_test.go:67: case {path:/sys/fs/cgroup/ name:cgroup.controllers}: openat2 /sys/fs/cgroup//cgroup.controllers: invalid cross-device link file_test.go:67: case {path:/sys/fs/cgroup/ name:/cgroup.controllers}: openat2 /sys/fs/cgroup///cgroup.controllers: invalid cross-device link file_test.go:67: case {path:/ name:/sys/fs/cgroup/cgroup.controllers}: fallback file_test.go:67: case {path:/ name:sys/fs/cgroup/cgroup.controllers}: fallback file_test.go:67: case {path:/sys/fs/cgroup/cgroup.controllers name:}: openat2 /sys/fs/cgroup/cgroup.controllers/: not a directory Here "fallback" means openat2-based implementation fails, and the fallback code is used (and works). Signed-off-by: Kir Kolyshkin <[email protected]>
Kir Kolyshkin (3): libct/cg: GetAllPids: optimize for go 1.16+ libct/cg: improve GetAllPids and readProcsFile libct/cg: move GetAllPids out of utils.go LGTMs: AkihiroSuda cyphar Closes #3133
opencontainers/runc issue 3026 describes a scenario in which OpenFile failed to open a legitimate existing cgroupfs file. Added debug (similar to what this commit does) shown that cgroupFd is no longer opened to "/sys/fs/cgroup", but to "/" (it's not clear what caused it, and the source code is not available, but they might be using the same process on the both sides of the container/chroot/pivot_root/mntns boundary, or remounting /sys/fs/cgroup). Consider such use incorrect, but give a helpful hint as two what is going on by wrapping the error in a more useful message. NB: this can potentially be fixed by reopening the cgroupFd once we detected that it's screwed, and retrying openat2. Alas I do not have a test case for this, so left this as a TODO suggestion. Signed-off-by: Kir Kolyshkin <[email protected]>
Possibly there was a specific reason to use a rune for this, but I noticed that there's various parts in the code that has to convert values from a string to this type. Using a string as type for this can simplify some of that code. Signed-off-by: Sebastiaan van Stijn <[email protected]>
The two exceptions I had to add to codespellrc are: - CLOS (used by intelrtd); - creat (syscall name used in tests/integration/testdata/seccomp_*.json). Signed-off-by: Kir Kolyshkin <[email protected]>
This fixes the behavior intended to avoid freezing containers/control groups without it being necessary. This is important for end users of libcontainer who rely on the behavior of no freeze. The previous implementation would always get error trying to get DevicePolicy from the Unit via dbus, since the Unit interface doesn't contain DevicePolicy. Signed-off-by: Odin Ugedal <[email protected]>
1. GetCgroupParamUint: drop strings.TrimSpace since it was already done by GetCgroupParamString. 2. GetCgroupParamInt: use GetCgroupParamString, drop strings.TrimSpace. Signed-off-by: Kir Kolyshkin <[email protected]>
Using strings.CutPrefix (available since Go 1.20) instead of strings.HasPrefix and/or strings.TrimPrefix makes the code a tad more straightforward. No functional change. Signed-off-by: Kir Kolyshkin <[email protected]>
We were using utils.ProcThreadSelf since commit 29aafee, which provides two things: 1. locking the OS tread; 2. fallback to /proc/self/task/$TID when /proc/thread-self is not available (kernel < 3.17). Now, (1) is not needed since we only call readlink and not perform any file data operation, and (2) is not needed here as this code is only running when openat2 syscall is available, meaning kernel >= v5.6. Also, check the error from readlink, so when it fails, we do not try to enhance the error message. Signed-off-by: Kir Kolyshkin <[email protected]>
The _defaultDirPath was only used for testing, and the test case is quite easy to adopt to defaultDirPath. Signed-off-by: Kir Kolyshkin <[email protected]>
The code which determines inner cgroup path from cgroup config is identical in fs and fs2 drivers, and it is using utils.CleanPath. In preparation to move libcontainer/cgroups to a separate repo, we have to get rid of libcontainer/utils dependency. So, - copy the utils.CleanPath implementation to internal/path; - consolidate the two innerPath implementations to internal/path. Signed-off-by: Kir Kolyshkin <[email protected]>
Instead, we can just do filepath.Clean("/"+path) here. While at it, add a comment telling why this is needed and important. Signed-off-by: Kir Kolyshkin <[email protected]>
This was needed for a test case only, but we can easily copy the data needed. The alternatives are: - keep things as is (and have cgroups depend on runc/libcontainer/specconv); - remove this test case; - move AllowedDevices to cgroups/devices/config. Signed-off-by: Kir Kolyshkin <[email protected]>
libct/cg: eliminate remaining libct deps
Tomasz Duda (1): support cgroup v1 mounted with noprefix LGTMs: kolyshkin cyphar
This modifies imports to reflect the move of github.com/opencontainers/runc/libcontainer/cgroups to github.com/opencontainers/cgroups Generated via find . -type f -name "*.go" -exec sed -i 's|github.com/opencontainers/runc/libcontainer/cgroups|github.com/opencontainers/cgroups|g' {} + Signed-off-by: Kir Kolyshkin <[email protected]>
Generated by: go mod init github.com/opencontainers/cgroups go mod edit -go=1.23.0 go mod tidy Signed-off-by: Kir Kolyshkin <[email protected]>
This is mostly result of the following command: git filter-repo --path libcontainer/cgroups/ --path-rename libcontainer/cgroups/: Signed-off-by: Kir Kolyshkin <[email protected]>
The list is taken from https://github.com/opencontainers/tob/blob/main/proposals/cgroups.md Signed-off-by: Kir Kolyshkin <[email protected]>
I feel like this could've been force-pushed to remove the fake template history, but I'm okay either way. |
@@ -0,0 +1,78 @@ | |||
package cgroups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way we can maintain git history while also adding this to pkg/cgroups
directory? it feels to me as being more idomatic modern go. I'm not terribly attached to it though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a way, but this repo is called github.com/opencontainers/cgroups, so what could be done is something like github.com/opencontainers/cgroups/pkg/cgroups which does not look good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I was unsure about that too. fine with me
@AkihiroSuda @mrunalp @thaJeztah @odinuge @dims PTAL. I'm going to merge it tomorrow or so (and address any concerns as followups), but would like more input before the merge if at all possible. |
LGTM |
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
post-merge LGTM (sorry, overflowing inbox and fighting some small fires here and there 😂)
This is mostly result of the following command:
See a few recent commits for some more details. The rest is exported from runc.
This is a step to implement https://github.com/opencontainers/tob/blob/main/proposals/cgroups.md, and is a part of opencontainers/runc#4618
ℹ️ this is being tested in opencontainers/runc#4638