Skip to content

Export libcontainer/cgroups from runc #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 976 commits into from
Feb 28, 2025
Merged

Conversation

kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Feb 26, 2025

This is mostly result of the following command:

        git filter-repo --path libcontainer/cgroups/ --path-rename libcontainer/cgroups/:

See a few recent commits for some more details. The rest is exported from runc.

This is a step to implement https://github.com/opencontainers/tob/blob/main/proposals/cgroups.md, and is a part of opencontainers/runc#4618

ℹ️ this is being tested in opencontainers/runc#4638

kolyshkin and others added 30 commits June 22, 2021 16:09
Signed-off-by: Kir Kolyshkin <[email protected]>
Signed-off-by: Kir Kolyshkin <[email protected]>
Kir Kolyshkin (24):
  *: stop using pkg/errors
  libct/cg: stop using pkg/errors
  libct/cg/ebpf: stop using pkg/errors
  libct/cg/devices: stop using pkg/errors
  .golangci.yml: enable errorlint
  *: ignore errorlint warnings about unix.* errors
  *: use errors.As and errors.Is
  tty.go: don't use pkg/errors, use errors.Is
  libct/keys: stop using pkg/errors
  libct: fix errorlint warning about strconv.NumError
  *: fmt.Errorf: use %w when appropriate
  libct/rootfs: improve some errors
  libct: wrap unix.Mount/Unmount errors
  libct/cg/fs2: fix/unify parsing errors
  libct/cg/fs: fix/unify parsing errors
  libct/cg/fscommon: introduce and use ParseError
  libct/cg/fs[2]: simplify getting pid stats
  libct/cg/fs/stats_util_test: fix errors
  libct/StartInitialization: fix errors
  libct/cg/fs/*_test: simplify errors
  ...

LGTMs: AkihiroSuda cyphar
Closes #3011
Since device updates in cgroup v2 are atomic for systemd, there is no
need to freeze the processes before running the updates.

Signed-off-by: Odin Ugedal <[email protected]>
If a control group is frozen, all its descendants will report FROZEN
in freezer.state cgroup file.

OTOH cgroup v2 cgroup.freeze is not reporting the cgroup as frozen
unless it is frozen directly (i.e. not via an ancestor).

Fix the discrepancy between v1 and v2 drivers behavior by
looking into freezer.self_freezing cgroup file, which, according
to kernel documentation, will show 1 iff the cgroup was frozen directly.

Co-authored-by: Kir Kolyshkin <[email protected]>
Signed-off-by: Odin Ugedal <[email protected]>
Signed-off-by: Kir Kolyshkin <[email protected]>
This test the issues fixed by the two preceding commits.

Co-Authored-By: Kir Kolyshkin <[email protected]>
Signed-off-by: Odin Ugedal <[email protected]>
Signed-off-by: Kir Kolyshkin <[email protected]>
ci: enable unconvert linter, fix its warnings
Odin Ugedal (2):
  libct/cg/sd: Don't freeze cgroup on cgroup v2 Set
  Update device update tests

LGTMs: kolyshkin mrunalp cyphar
Closes #3067
This is necessary in order for runc to be able to configure device
cgroups with --systemd-cgroup on distributions that have very strict
SELinux policies such as openSUSE MicroOS[1].

The core issue here is that systemd is adding its own BPF policy that
has an SELinux label such that runc cannot interact with it. In order to
work around this, we can just ignore the policy -- in theory this
behaviour is not correct but given that the most obvious case
(--systemd-cgroup) will still handle updates correctly, this logic is
reasonable.

[1]: https://bugzilla.suse.com/show_bug.cgi?id=1182428

Fixes: ea58c90 ("cgroup2: devices: replace all existing filters when attaching")
Signed-off-by: Aleksa Sarai <[email protected]>
cgroupv2: ebpf: ignore inaccessible existing programs
m.Freeze method changes m.cgroups.Resources.Freezer field, which should
not be done while we're temporarily freezing the cgroup in Set. If this
field is changed, and r == m.cgroups.Resources (as it often happens),
this results in inability to freeze the container using Set().

To fix, add and use a method which does not change r.Freezer field.

A test case for the bug will be added separately.

Signed-off-by: Kir Kolyshkin <[email protected]>
Introduce freezeBeforeSet, which contains the logic of figuring out
whether we need to freeze/thaw around setting systemd unit properties.

In particular, if SkipDevices is set, and the current unit properties
allow all devices, there is no need to freeze and thaw, as systemd
won't write any device rules in this case.

Signed-off-by: Kir Kolyshkin <[email protected]>
This was initially added by commit 4a45e7a because Set (with
r.Freezer = Frozen) was not able to freeze a container.

Now (see a few previous commits) Set can do the freeze, so the explicit
Freeze is no longer needed.

Signed-off-by: Kir Kolyshkin <[email protected]>
TestPodSkipDevicesUpdate checks that updating a pod having SkipDevices: true
does not result in spurious "permission denied" errors in a container
running under the pod. The test is somewhat similar in nature to the
@test "update devices [minimal transition rules]" in tests/integration,
but uses a pod.

This tests the validity of freezeBeforeSet in v1.

Signed-off-by: Kir Kolyshkin <[email protected]>
cgroups: Set: fix freeze, avoid unnecessary freeze from systemd v1
This simplifies the code as no explicit cleanup is required.

Signed-off-by: Kir Kolyshkin <[email protected]>
Replace ioutil.TempDir (mostly) with t.TempDir, which require no
explicit cleanup.

While at it, fix incorrect usage of os.ModePerm in libcontainer/intelrdt
test. This is supposed to be a mask, not mode bits.

Signed-off-by: Kir Kolyshkin <[email protected]>
Drop Go 1.13 support, require go 1.15+
This is just moving the code around to ease the code review, no other
changes.

Signed-off-by: Kir Kolyshkin <[email protected]>
Since every cgroup directory is guaranteed to have cgroup.procs file,
we don't have to do filename comparison in GetAllPids() and just read
cgroup.procs in every directory.

While at it, switch readProcsFile to use our own OpenFile.

Signed-off-by: Kir Kolyshkin <[email protected]>
filepath.WalkDir function, introduced in Go 1.16, doesn't do stat(2)
on every entry, and is therefore somewhat faster (see below).

Since we have to support Go 1.15, keep the old version for backward
compatibility.

Add a quick benchmark, which shows approximately 3x improvement:

        $ go1.15.15 test -bench AllPid -run xxx .
	BenchmarkGetAllPids-4   	      48	  23528839 ns/op

        $ go version
        go version go1.16.6 linux/amd64
        $ go test -bench AllPid -run xxx .
	BenchmarkGetAllPids-4   	     147	   7700170 ns/op

(Unrelated but worth noting -- go 1.17rc2 is pushing it even further)

        $ go1.17rc2 test -bench AllPid -run xxx .
	BenchmarkGetAllPids-4   	     164	   6820994 ns/op

Signed-off-by: Kir Kolyshkin <[email protected]>
This was initially added by commits 41fdf95 and 950f2c6,
apparently to implement docker run --cgroup container:ID, which was
never merged. Therefore, this code is not and was never used.

It needs to be removed mainly because having it makes it much harder to
understand how cgroup manager works (because with this in place we have
not one or two but three sets of cgroup paths to think about).

Note if the paths are known and there is a need to add a PID to existing
cgroup, cgroup manager is not needed at all -- something like
cgroups.WriteCgroupProc or cgroups.EnterPid is sufficient (and the
latter is what runc exec uses in (*setnsProcess).start).

Signed-off-by: Kir Kolyshkin <[email protected]>
Fix reading cgroup files from the top cgroup directory, i.e.
/sys/fs/cgroup.

The code was working for for any subdirectory of /sys/fs/cgroup, but
for dir="/sys/fs/cgroup" a fallback (open and fstatfs) was used, because
of the way the function worked with the dir argument.

Fix those cases, and add unit tests to make sure they work. While at it,
make the rules for dir and name components more relaxed, and add test
cases for this, too.

While at it, improve OpenFile documentation, and remove a duplicated
doc comment for openFile.

Without these fixes, the unit test fails the following cases:

    file_test.go:67: case {path:/sys/fs/cgroup name:cgroup.controllers}: fallback
    file_test.go:67: case {path:/sys/fs/cgroup/ name:cgroup.controllers}: openat2 /sys/fs/cgroup//cgroup.controllers: invalid cross-device link
    file_test.go:67: case {path:/sys/fs/cgroup/ name:/cgroup.controllers}: openat2 /sys/fs/cgroup///cgroup.controllers: invalid cross-device link
    file_test.go:67: case {path:/ name:/sys/fs/cgroup/cgroup.controllers}: fallback
    file_test.go:67: case {path:/ name:sys/fs/cgroup/cgroup.controllers}: fallback
    file_test.go:67: case {path:/sys/fs/cgroup/cgroup.controllers name:}: openat2 /sys/fs/cgroup/cgroup.controllers/: not a directory

Here "fallback" means openat2-based implementation fails, and the fallback code
is used (and works).

Signed-off-by: Kir Kolyshkin <[email protected]>
Kir Kolyshkin (3):
  libct/cg: GetAllPids: optimize for go 1.16+
  libct/cg: improve GetAllPids and readProcsFile
  libct/cg: move GetAllPids out of utils.go

LGTMs: AkihiroSuda cyphar
Closes #3133
opencontainers/runc issue 3026 describes a scenario in which OpenFile
failed to open a legitimate existing cgroupfs file. Added debug
(similar to what this commit does) shown that cgroupFd is no longer
opened to "/sys/fs/cgroup", but to "/" (it's not clear what caused it,
and the source code is not available, but they might be using the same
process on the both sides of the container/chroot/pivot_root/mntns
boundary, or remounting /sys/fs/cgroup).

Consider such use incorrect, but give a helpful hint as two what is
going on by wrapping the error in a more useful message.

NB: this can potentially be fixed by reopening the cgroupFd once we
detected that it's screwed, and retrying openat2. Alas I do not have
a test case for this, so left this as a TODO suggestion.

Signed-off-by: Kir Kolyshkin <[email protected]>
Possibly there was a specific reason to use a rune for this, but I noticed
that there's various parts in the code that has to convert values from a
string to this type. Using a string as type for this can simplify some of
that code.

Signed-off-by: Sebastiaan van Stijn <[email protected]>
The two exceptions I had to add to codespellrc are:
 - CLOS (used by intelrtd);
 - creat (syscall name used in tests/integration/testdata/seccomp_*.json).

Signed-off-by: Kir Kolyshkin <[email protected]>
This fixes the behavior intended to avoid freezing containers/control
groups without it being necessary. This is important for end users of
libcontainer who rely on the behavior of no freeze.

The previous implementation would always get error trying to get
DevicePolicy from the Unit via dbus, since the Unit interface doesn't
contain DevicePolicy.

Signed-off-by: Odin Ugedal <[email protected]>
kolyshkin and others added 13 commits February 6, 2025 19:42
1. GetCgroupParamUint: drop strings.TrimSpace since it was already
   done by GetCgroupParamString.

2. GetCgroupParamInt: use GetCgroupParamString, drop strings.TrimSpace.

Signed-off-by: Kir Kolyshkin <[email protected]>
Using strings.CutPrefix (available since Go 1.20) instead of
strings.HasPrefix and/or strings.TrimPrefix makes the code
a tad more straightforward.

No functional change.

Signed-off-by: Kir Kolyshkin <[email protected]>
We were using utils.ProcThreadSelf since commit 29aafee,
which provides two things:
1. locking the OS tread;
2. fallback to /proc/self/task/$TID when /proc/thread-self
   is not available (kernel < 3.17).

Now, (1) is not needed since we only call readlink and not perform any
file data operation, and (2) is not needed here as this code is
only running when openat2 syscall is available, meaning kernel >= v5.6.

Also, check the error from readlink, so when it fails, we do not try to
enhance the error message.

Signed-off-by: Kir Kolyshkin <[email protected]>
The _defaultDirPath was only used for testing, and the test case
is quite easy to adopt to defaultDirPath.

Signed-off-by: Kir Kolyshkin <[email protected]>
The code which determines inner cgroup path from cgroup config is
identical in fs and fs2 drivers, and it is using utils.CleanPath.

In preparation to move libcontainer/cgroups to a separate repo,
we have to get rid of libcontainer/utils dependency. So,
 - copy the utils.CleanPath implementation to internal/path;
 - consolidate the two innerPath implementations to internal/path.

Signed-off-by: Kir Kolyshkin <[email protected]>
Instead, we can just do filepath.Clean("/"+path) here.

While at it, add a comment telling why this is needed and important.

Signed-off-by: Kir Kolyshkin <[email protected]>
This was needed for a test case only, but we can easily copy the data
needed.

The alternatives are:
 - keep things as is (and have cgroups depend on
   runc/libcontainer/specconv);
 - remove this test case;
 - move AllowedDevices to cgroups/devices/config.

Signed-off-by: Kir Kolyshkin <[email protected]>
libct/cg: eliminate remaining libct deps
Tomasz Duda (1):
  support cgroup v1 mounted with noprefix

LGTMs: kolyshkin cyphar
This modifies imports to reflect the move of
 github.com/opencontainers/runc/libcontainer/cgroups
to
 github.com/opencontainers/cgroups

Generated via

  find . -type f -name "*.go" -exec sed -i     's|github.com/opencontainers/runc/libcontainer/cgroups|github.com/opencontainers/cgroups|g'     {} +

Signed-off-by: Kir Kolyshkin <[email protected]>
Generated by:

    go mod init github.com/opencontainers/cgroups
    go mod edit -go=1.23.0
    go mod tidy

Signed-off-by: Kir Kolyshkin <[email protected]>
This is mostly result of the following command:

	git filter-repo --path libcontainer/cgroups/ --path-rename libcontainer/cgroups/:

Signed-off-by: Kir Kolyshkin <[email protected]>
@cyphar
Copy link
Member

cyphar commented Feb 26, 2025

I feel like this could've been force-pushed to remove the fake template history, but I'm okay either way.

@@ -0,0 +1,78 @@
package cgroups
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way we can maintain git history while also adding this to pkg/cgroups directory? it feels to me as being more idomatic modern go. I'm not terribly attached to it though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a way, but this repo is called github.com/opencontainers/cgroups, so what could be done is something like github.com/opencontainers/cgroups/pkg/cgroups which does not look good to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I was unsure about that too. fine with me

@kolyshkin
Copy link
Contributor Author

@AkihiroSuda @mrunalp @thaJeztah @odinuge @dims PTAL.

I'm going to merge it tomorrow or so (and address any concerns as followups), but would like more input before the merge if at all possible.

@haircommander
Copy link
Contributor

LGTM

@cyphar cyphar merged commit afa9f8f into opencontainers:main Feb 28, 2025
1 check passed
@odinuge
Copy link
Contributor

odinuge commented Feb 28, 2025

LGTM

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

post-merge LGTM (sorry, overflowing inbox and fighting some small fires here and there 😂)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.