Skip to content

Cache and print devices for debugging future outages #2097

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

julianKatz
Copy link
Contributor

@julianKatz julianKatz commented May 21, 2025

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug

/kind cleanup # Is this right??

/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

This PR adds a cache that periodically (configured to every minute currently) looks at the /dev/disk/by-id/ directory and evaluates the symlinks there. It maintains a cache of the symlink and the real path it points to.

This will help with debugging future filesystem issues. In a past OMG, we found that our insight into changes in symlinks for specific disks hampered our ability to debug. Logging marked the real path of the disk at mount and unmount, but the change in between couldn't be detected.

This PR will print those links every minute, also logging when elements of the cache change.

An example:

I0521 00:23:51.418261      12 cache.go:62] periodic symlink cache read: /dev/disk/by-id/google-persistent-disk-0 -> /dev/sda; /dev/disk/by-id/google-pvc-f5418f78-dc07-4d69-9487-6c4a7232dd67 -> /dev/sdb; /dev/disk/by-id/scsi-0Google_PersistentDisk_persistent-disk-0 -> /dev/sda; /dev/disk/by-id/scsi-0Google_PersistentDisk_pvc-f5418f78-dc07-4d69-9487-6c4a7232dd67 -> /dev/sdb

The cache will also note if a symlink is broken.

NOTE: Currently this filters out any thing in by-id/ that ends with -part[0-9]*$. This removes partitions, which are noise. Mounting partitions directly isn't well supported in GKE, but we may want to test that in the future.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

None

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 21, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: julianKatz
Once this PR has been reviewed and has the lgtm label, please assign saikat-royc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 21, 2025
@k8s-ci-robot k8s-ci-robot requested a review from tonyzhc May 21, 2025 00:29
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 21, 2025
@julianKatz julianKatz force-pushed the logs-for-device-mappings branch from 3a8aba1 to 8f34a1f Compare May 21, 2025 18:01
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 21, 2025
@julianKatz
Copy link
Contributor Author

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label May 21, 2025
changes and printing the full list.

example: periodic symlink cache read:
/dev/disk/by-id/google-persistent-disk-0 -> /dev/sda;
/dev/disk/by-id/google-pvc-f5418f78-dc07-4d69-9487-6c4a7232dd67 -> /dev/sdb;
/dev/disk/by-id/scsi-0Google_PersistentDisk_persistent-disk-0 -> /dev/sda;
/dev/disk/by-id/scsi-0Google_PersistentDisk_pvc-f5418f78-dc07-4d69-9487-6c4a7232dd67 -> /dev/sdb
@julianKatz julianKatz marked this pull request as ready for review May 21, 2025 20:49
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 21, 2025
@julianKatz julianKatz changed the title Cache and print devices Cache and print devices for debugging future outages May 21, 2025

// TODO(juliankatz): To have certainty this works for all edge cases, we
// need to test this with a manually partitioned disk.
if partitionNameRegex.MatchString(entry.Name()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, could you provide more context on why we skip partitions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the boot disks there will be something like 12 partitions. Those end up being noise when we are answering the question of "which PV maps to which path on the VM".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add it as a comment in the code to make it easier for future readers

return fmt.Errorf("failed to read directory %s: %w", l.dir, err)
}

var errs []error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe split out the rest of the function and make unit tests, so that you avoid dependency on os.ReadDir.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out you can just make a mock filesystem. Added unit test with that.

@julianKatz julianKatz force-pushed the logs-for-device-mappings branch from d4bde0a to e38ef72 Compare May 21, 2025 22:33
@tonyzhc
Copy link
Contributor

tonyzhc commented May 21, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 21, 2025
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 21, 2025
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented May 22, 2025

@julianKatz: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2022 5330d24 link false /test pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2022

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.


// TODO(juliankatz): To have certainty this works for all edge cases, we
// need to test this with a manually partitioned disk.
if partitionNameRegex.MatchString(entry.Name()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add it as a comment in the code to make it easier for future readers

}
}

func (l *ListingCache) listAndUpdate() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do normal attach/detach flows show up with this? When we log, do we have a way to distinguish between normal attach/detach sequence vs when the device mapping unexpectantly changes post-mount?

I'm concerned about normal flows creating too much noise in the logs and drowning out the real issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants