-
Notifications
You must be signed in to change notification settings - Fork 158
Cache and print devices for debugging future outages #2097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Cache and print devices for debugging future outages #2097
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: julianKatz The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
3a8aba1
to
8f34a1f
Compare
/ok-to-test |
changes and printing the full list. example: periodic symlink cache read: /dev/disk/by-id/google-persistent-disk-0 -> /dev/sda; /dev/disk/by-id/google-pvc-f5418f78-dc07-4d69-9487-6c4a7232dd67 -> /dev/sdb; /dev/disk/by-id/scsi-0Google_PersistentDisk_persistent-disk-0 -> /dev/sda; /dev/disk/by-id/scsi-0Google_PersistentDisk_pvc-f5418f78-dc07-4d69-9487-6c4a7232dd67 -> /dev/sdb
|
||
// TODO(juliankatz): To have certainty this works for all edge cases, we | ||
// need to test this with a manually partitioned disk. | ||
if partitionNameRegex.MatchString(entry.Name()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding, could you provide more context on why we skip partitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the boot disks there will be something like 12 partitions. Those end up being noise when we are answering the question of "which PV maps to which path on the VM".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add it as a comment in the code to make it easier for future readers
return fmt.Errorf("failed to read directory %s: %w", l.dir, err) | ||
} | ||
|
||
var errs []error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe split out the rest of the function and make unit tests, so that you avoid dependency on os.ReadDir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out you can just make a mock filesystem. Added unit test with that.
d4bde0a
to
e38ef72
Compare
/lgtm |
New changes are detected. LGTM label has been removed. |
@julianKatz: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
||
// TODO(juliankatz): To have certainty this works for all edge cases, we | ||
// need to test this with a manually partitioned disk. | ||
if partitionNameRegex.MatchString(entry.Name()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add it as a comment in the code to make it easier for future readers
} | ||
} | ||
|
||
func (l *ListingCache) listAndUpdate() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do normal attach/detach flows show up with this? When we log, do we have a way to distinguish between normal attach/detach sequence vs when the device mapping unexpectantly changes post-mount?
I'm concerned about normal flows creating too much noise in the logs and drowning out the real issues.
What type of PR is this?
/kind cleanup # Is this right??
What this PR does / why we need it:
This PR adds a cache that periodically (configured to every minute currently) looks at the
/dev/disk/by-id/
directory and evaluates the symlinks there. It maintains a cache of the symlink and the real path it points to.This will help with debugging future filesystem issues. In a past OMG, we found that our insight into changes in symlinks for specific disks hampered our ability to debug. Logging marked the real path of the disk at mount and unmount, but the change in between couldn't be detected.
This PR will print those links every minute, also logging when elements of the cache change.
An example:
The cache will also note if a symlink is broken.
NOTE: Currently this filters out any thing in
by-id/
that ends with-part[0-9]*$
. This removes partitions, which are noise. Mounting partitions directly isn't well supported in GKE, but we may want to test that in the future.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: