use xfs_repair to check and repair xfs filesystem in method formatAndMount #126

27149chen · 2019-12-06T07:40:25Z

This PR tries to fix the issue mentioned in #125, changes are as following:

move the general check and repair method (using fsck) to an individual function.
add a new function "checkAndRepairXfsFilesystem" to check and repair xfs filesystem.
in method formatAndMount, add a switch to use different functions for different filesystems, and only run check tool on formatted disks.
fix some test cases and add add somre more.

Which are not included in this PR:

A clean log on a file system is required for xfs_repair to operate. If the file system was not cleanly unmounted, it should be mounted and unmounted prior to using xfs_repair

References:

k8s-ci-robot · 2019-12-06T07:40:33Z

Welcome @27149chen!

It looks like this is your first PR to kubernetes/utils 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/utils has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

27149chen · 2019-12-10T07:30:03Z

/assign @jsafrane

27149chen · 2019-12-11T06:50:14Z

@jsafrane , do you have time to have a look of this PR? Thank you very much.

jsafrane · 2019-12-11T13:16:14Z

mount/mount_linux.go

+	checkArgs := []string{"-n", source}
+
+	// check-only using "xfs_repair -n", if the exit status is not 0, perform a "xfs_repair"
+	out, err := mounter.Exec.Command("xfs_repair", checkArgs...).CombinedOutput()


xfs_repair -n on an empty device tries to find superblock on the whole device, taking really long time.

$ xfs_repair -n /dev/loop0 Phase 1 - find and verify superblock... bad primary superblock - bad magic number !!! attempting to find secondary superblock... ..........................................................................................................................................................................................................................................................................................................................................................................................................................................................

In this case, running blkid first to check the format and fsck / xfs_repair only when the right filesystem is detected would be better.

See kubernetes/kubernetes#77982 for such attempt.

@jsafrane , thank you for pointing out it. But if I add a GetDiskFormat here, I think the whole mount flow will be changed, like the order in kubernetes/kubernetes#77982.
So what is your suggestion here? merge 77982 into my change as one pr, or let 77982 merge first, then I continue my work?

I would suggest to contact original author if he's still interested in the PR and re-post it in this repo. With a concrete use case where we really need to call blkid before mount it should be easy to merge.

ok, thanks. I asked him in that pr.

fixed, run fsck/xfs_repair only when the device is formatted.

mount/safe_format_and_mount_test.go

27149chen · 2019-12-17T12:34:13Z

@jsafrane , I merged the latest master changes and update the code, please have a look again, thank you very much.

mount/mount_linux.go

gnufied · 2019-12-18T03:02:07Z

mount/mount_linux.go

+		return nil
+	} else {
+		return fmt.Errorf(string(output))
+	}


I wonder if this could result in breaking existing stuff. I am unsure about searching for "done" string at the end of the message and using combined output from xfs_repair. We have been burnt by such checks in past and command output itself may change across versions.

I think may be we should allow different CSI driver authors to opt-in for doing filesystem repair before mounting. We did not call xfs_repair before and now we do and we are performing checks based on output of command(which can spew random stuff and provides no guarantees about API) - could this cause breaking behaviour?

@gnufied , thank you for pointing it out. I checked several versions of xfs_repair, the output format is stable, so I think it is safe to search for "done" here. And I think it won't break existing stuff. Regarding "We have been burnt by such checks in past", do you remember the code and how did you fix it? can you show me so I can learn something.

I don't think we should let the CSI plugins to do the fs check. Because it is a common task, each type of filesystem has a specific check/repair method. Another reason is that xfs_repair must be run in a formatted disk, if we let the CSI plugins to run it, they must run blkid first, which is redundant.

Yes, for a xfs filesystem, We did not call xfs_repair before and now we do. But it won't break anything:

If xfs_repair is called before, the fs is repaired before enter mount,so xfs_repair -n will return ok and nothing will be repaireed again.

If xfs_repair is not called before, we will try to repair it before mount, it will make the mount method more robust.

What do you think?

It is okay to rely on following outputs of a program called via exec:

exit code

if program supports some kind of json or other machine readable output.

It is not okay to rely on human readable text printed on terminal for consumption by another program. Because in that case - called program(xfs_repair in this case) makes no guarantees about an stable format.

@gnufied , after checking the source code git://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git, I find that xfs_repair do return a non-zero status if it is failed to repair the filesystem. It works for all 4.x versions.
The code is in repair/xfs_repair.c.
In fact, the problem is the man page, it is wrong until v4.10.0 (Unfortunately, the documentation I found from google and my centos7 happens to be old), you can see the changelog of v4.10.0: https://abi-laboratory.pro/?view=changelog&l=xfsprogs&v=5.0.0 which fixed the description.

So we don't need to analysis the human readable output here, just check the return status

attached the diff here to make it more clear. @gnufied @jsafrane

$ git diff v4.9.0 v4.10.0 man/man8/xfs_repair.8 diff --git a/man/man8/xfs_repair.8 b/man/man8/xfs_repair.8 index 1b4d9e3..85e4dc9 100644 --- a/man/man8/xfs_repair.8 +++ b/man/man8/xfs_repair.8 @@ -65,7 +65,9 @@ Forces .B xfs_repair to zero the log even if it is dirty (contains metadata changes). When using this option the filesystem will likely appear to be corrupt, -and can cause the loss of user files and/or data. +and can cause the loss of user files and/or data. See the +.B "DIRTY LOGS" +section for more information. .TP .BI \-l " logdev" Specifies the device special file where the filesystem's external @@ -505,11 +507,38 @@ This message refers to a large directory. If the directory were small, the message would read "junking entry ...". .SH EXIT STATUS .B xfs_repair \-n -(no modify node) +(no modify mode) will return a status of 1 if filesystem corruption was detected and 0 if no filesystem corruption was detected. .B xfs_repair -run without the \-n option will always return a status code of 0. +run without the \-n option will always return a status code of 0 if +it completes without problems. If a runtime error is encountered +during operation, it will return a status of 1. In this case, +.B xfs_repair +should be restarted. If +.B xfs_repair is unable +to proceed due to a dirty log, it will return a status of 2. See below. +.SH DIRTY LOGS +Due to the design of the XFS log, a dirty log can only be replayed +by the kernel, on a machine having the same CPU architecture as the +machine which was writing to the log. +.B xfs_repair +cannot replay a dirty log and will exit with a status code of 2 +when it detects a dirty log. +.PP +In this situation, the log can be replayed by mounting and immediately +unmounting the filesystem on the same class of machine that crashed. +Please make sure that the machine's hardware is reliable before +replaying to avoid compounding the problems. +.PP +If mounting fails, the log can be erased by running +.B xfs_repair +with the -L option. +All metadata updates in progress at the time of the crash will be lost, +which may cause significant filesystem damage. +This should +.B only +be used as a last resort. .SH BUGS The filesystem to be checked and repaired must have been unmounted cleanly using normal system administration procedures

27149chen · 2019-12-20T10:10:59Z

/assign saad-ali

27149chen · 2019-12-26T05:49:25Z

@jsafrane , please take another look at the pr, thanks.

jsafrane · 2020-01-09T09:50:07Z

lgtm-ish, can you please rebase?
/approve

Signed-off-by: Lou <[email protected]>

27149chen · 2020-01-09T11:22:04Z

@jsafrane , finished the squash and rebase, please approve.

jsafrane · 2020-01-09T11:42:32Z

mount/mount_linux.go

+			}
+
+			if err != nil {
+				return err


Rebase is not that simple in this case: you should return NewMountError(mountErrorValue, err.Error()), to propagate potential FilesystemMismatch.

Actually, you should return FilesystemMismatch, if the FS type is wrong, or HasFilesystemErrors if the FS type is OK, but the filesystem has errors.

HasFilesystemErrors is retruned in checkAndRepairXfsFilesystem/checkAndRepairFilesystem, so it is ok to return err here directly.

I don't think we should return a FilesystemMismatch here, because the only error which will happen here is a failed_to_repair_filesystem_error, and the fs passed in is the existing fs, not the defined one, so I think there is no filesystem mismatch which can cause this error.

I forgot to return a HasFilesystemErrors in checkAndRepairXfsFilesystem, which is fixed now. @jsafrane PTAL.

ack, I missed checkAndRepairXfsFilesystem/checkAndRepairFilesystem

Signed-off-by: Lou <[email protected]>

jsafrane · 2020-01-09T14:19:04Z

/lgtm

k8s-ci-robot · 2020-01-09T14:19:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 27149chen, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~mount/OWNERS~~ [jsafrane]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 6, 2019

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 6, 2019

k8s-ci-robot requested review from andyzhangx and jingxu97 December 6, 2019 07:40

k8s-ci-robot assigned jsafrane Dec 10, 2019

jsafrane reviewed Dec 11, 2019

View reviewed changes

27149chen requested a review from jsafrane December 12, 2019 09:51

27149chen mentioned this pull request Dec 13, 2019

Validate the existence of filesystem on disk before attempting to mount it (linux) kubernetes/kubernetes#77982

Closed

mvisonneau mentioned this pull request Dec 13, 2019

Validate the existence of filesystem before attempting to mount it (linux) #127

Merged

gnufied reviewed Dec 18, 2019

View reviewed changes

mount/mount_linux.go Outdated Show resolved Hide resolved

gnufied reviewed Dec 18, 2019

View reviewed changes

k8s-ci-robot assigned saad-ali Dec 20, 2019

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 9, 2020

use xfs_repair to check and repair xfs filesystem

aa83e53

Signed-off-by: Lou <[email protected]>

27149chen force-pushed the add_xfs_repair branch from 623e4e0 to aa83e53 Compare January 9, 2020 11:19

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 9, 2020

jsafrane reviewed Jan 9, 2020

View reviewed changes

update after review

08f269a

Signed-off-by: Lou <[email protected]>

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 9, 2020

k8s-ci-robot merged commit 94aeca2 into kubernetes:master Jan 9, 2020

27149chen mentioned this pull request Jan 16, 2020

fsck is not a suitable command to check and repair an XFS filesystem ,use xfs_repair instead #125

Closed

27149chen mentioned this pull request Feb 10, 2020

REQUEST: New membership for 27149chen kubernetes/org#1625

Closed

6 tasks

nktpro mentioned this pull request Feb 28, 2020

[mount] Addition of "checkAndRepairXfsFilesystem" inadvertently prevents XFS self-recovery via mounting #141

Closed

gnufied mentioned this pull request Mar 24, 2020

Do not perform xfs_repair on xfs filesystem #150

Merged

alejandrox1 mentioned this pull request Mar 25, 2020

bump k8s.io/utils package kubernetes/kubernetes#89444

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use xfs_repair to check and repair xfs filesystem in method formatAndMount #126

use xfs_repair to check and repair xfs filesystem in method formatAndMount #126

27149chen commented Dec 6, 2019 •

edited

Loading

k8s-ci-robot commented Dec 6, 2019

27149chen commented Dec 10, 2019

27149chen commented Dec 11, 2019 •

edited

Loading

jsafrane Dec 11, 2019

jsafrane Dec 11, 2019

27149chen Dec 12, 2019 •

edited

Loading

jsafrane Dec 13, 2019

27149chen Dec 13, 2019

27149chen Dec 17, 2019

27149chen commented Dec 17, 2019

gnufied Dec 18, 2019

27149chen Dec 18, 2019

gnufied Dec 18, 2019

27149chen Dec 19, 2019 •

edited

Loading

27149chen Dec 19, 2019 •

edited

Loading

27149chen commented Dec 20, 2019

27149chen commented Dec 26, 2019

jsafrane commented Jan 9, 2020

27149chen commented Jan 9, 2020

jsafrane Jan 9, 2020

jsafrane Jan 9, 2020

27149chen Jan 9, 2020 •

edited

Loading

jsafrane Jan 9, 2020

jsafrane commented Jan 9, 2020

k8s-ci-robot commented Jan 9, 2020

use xfs_repair to check and repair xfs filesystem in method formatAndMount #126

use xfs_repair to check and repair xfs filesystem in method formatAndMount #126

Conversation

27149chen commented Dec 6, 2019 • edited Loading

k8s-ci-robot commented Dec 6, 2019

27149chen commented Dec 10, 2019

27149chen commented Dec 11, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

27149chen Dec 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

27149chen commented Dec 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

27149chen Dec 19, 2019 • edited Loading

Choose a reason for hiding this comment

27149chen Dec 19, 2019 • edited Loading

Choose a reason for hiding this comment

27149chen commented Dec 20, 2019

27149chen commented Dec 26, 2019

jsafrane commented Jan 9, 2020

27149chen commented Jan 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

27149chen Jan 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsafrane commented Jan 9, 2020

k8s-ci-robot commented Jan 9, 2020

27149chen commented Dec 6, 2019 •

edited

Loading

27149chen commented Dec 11, 2019 •

edited

Loading

27149chen Dec 12, 2019 •

edited

Loading

27149chen Dec 19, 2019 •

edited

Loading

27149chen Dec 19, 2019 •

edited

Loading

27149chen Jan 9, 2020 •

edited

Loading