Skip to content

cgroupv2 support #140

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 24, 2024
Merged

cgroupv2 support #140

merged 3 commits into from
Sep 24, 2024

Conversation

ehashman
Copy link
Contributor

@ehashman ehashman commented Aug 6, 2024

Closes #138.

Goals:

  • don't get rid of cgroupv1 support
  • add support for cgroupv2 for all existing metrics pulled out of cgroups
  • update the direct cgroup walking logic, the ctrstats code paths, and the prometheus metrics
    • as of 09/04, this PR updates the cgroup walking logic and ctrstats code paths but not the prometheus metrics (yet)

Design:

Details can be found at #138 (comment)

Summary of changes:

  • cgroup tree is now unified, no need to specify separate paths for cpu, cpuacct, memory, etc.
  • CPU:
    • cpuacct is gone
    • All cpu metrics can be gathered from cpu.stat file
    • cgroupv2 cpu metrics are now in microseconds, not nanoseconds
  • Memory:
    • An accurate measure of current memory usage is available at memory.current (unlike memory.usage_in_bytes on v1 which was an estimate)
    • We can now get all OOM information out of memory.events
    • If we need RSS, we can use anon field out of memory.stat file
  • For UMC, we need to update both the cgroup-walking code (this PR) and the ctrstats code that uses the CRI library (possibly a follow-up PR)

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 6, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ehashman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from campoy and ndipebot August 6, 2024 20:03
@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 6, 2024
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 13, 2024
@ehashman ehashman marked this pull request as ready for review August 13, 2024 21:10
@ehashman
Copy link
Contributor Author

Taking this out of draft so CI runs.

I was able to get the basic cgroupv2 integration test (copied from the "basic" test) to mostly pass. There was a little bit of weirdness in the output values that I am not sure if I should just update the test case for or if there's an actual problem.

I noticed min.slice on cgroupv1 had a -1 value in the input samples. This did not work with cgroupv2 so I changed those results to 0s, but the test data isn't quite as expected. I'm not sure if there is a special meaning to the -1 CPU usage on cgroupv1.

Otherwise, this should not break any cgroupv1 functionality. I will work on ctrstats next but that should be more of a trivial change, so if folks want to take a look at this now, it should be ready for initial review.

I will also edit my first comment with the dictionary for updating v1 to v2.

@ehashman
Copy link
Contributor Author

/retest-required

flake #126

@ehashman ehashman changed the title [WIP] cgroupv2 support cgroupv2 support Aug 26, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 26, 2024
@ehashman
Copy link
Contributor Author

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 26, 2024
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 4, 2024
@ehashman
Copy link
Contributor Author

/retest-required

flake #126

@ehashman
Copy link
Contributor Author

/retest-required

@ehashman ehashman merged commit c46a044 into kubernetes-sigs:main Sep 24, 2024
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add cgroupv2 support
2 participants