-
Notifications
You must be signed in to change notification settings - Fork 18k
x/tools/gopls: test failures with invalid character '\x00'
#59289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Found new dashboard test flakes for:
2023-03-27 15:26 darwin-amd64-12_0 tools@e0ecd1bc go@a0c9d153 x/tools/gopls/internal/lsp/cmd/test.TestLinks (log)
|
CC @adonovan, who might have ideas if this could be related to the filecache, but I suspect it may be my bug. There have been a few of these causing test failures. I need to greplogs to confirm, but I think they're all related to diagnostics, in which case it suggests I missed handling an error somewhere (since this looks like corrupt data in the filecache). |
invalid character '\x00'
Found new dashboard test flakes for:
2023-03-31 19:13 darwin-amd64-nocgo tools@65208701 go@99276aeb x/tools/gopls/internal/lsp/cmd/test.TestPrepareRename (log)
|
Strange. What we know:
The error message corresponds to a decode of a non-empty message starting with a NUL byte: https://go.dev/play/p/RQxCWnWZY90. But I can't see how the value in the cache can be anything but the successful result of encoding the correct data type. The filecache.Set and Get operations should be robust against incomplete writes. |
Ok, so some more thoughts on this:
All of these suggest to me that the fast termination of the cmd tests leads to filecache corruption when the process exits during a call to filecache.Set (or perhaps Gets occurring during shutdown get corrupted somehow?!). Since this has only occurred on Darwin thus far, maybe the filecache operations are not robust on Darwin. |
Ok, I think I understand this. Looking at the code, I don't think the lockedfile logic is actually robust to interrupted writes. It could certainly be terminated without finishing the write. It looks like the Go command has handling for this, by looking specifically for null bytes resulting from the truncation. So we need a similar form of validation. |
That's the purpose of writing the length before the payload in Set: and checking that the read is exactly as long as the written length in Get: Are you suggesting that the write in fact writes the length prefix and expands the file size to the correct length but somehow fails to write the complete payload into it? |
I was deducing from the presence of logic to handle this in the Go command (which also uses lockedfile) that it may be possible for a lockedfile write to result in a corrupted file that contains null bytes. That would mean we have a file of length N, where some byte in [0, N) is null. I don't a priori know why we could conclude that N != len(data). It may also be worth noting that almost all of these payloads are JSON "null", a byte slice of length 4. Given their short length, perhaps it is more likely that a corrupted write actually has the correct size. |
Note also that when using io.MultiReader, the header and payload will be written in separate writes. Furthermore, there could be many concurrent processes writing this file, so it seems likely that we are overwriting an existing copy with identical content. So I think the necessary conditions could be that the first write succeeds to write the length (4, in the common case), and the second write increases the file size but fails to write the data. I didn't know that could happen, but "once you eliminate the impossible...", etc. Either that, or the truncate partially succeeds?! |
The linked comment:
talks about null bytes written by the truncate operation, but the argument to ftruncate from the call in lockedfile is zero, meaning that it never extends a file with zeroes. I do not expect ftruncate(0) to cause reads through an existing descriptor to return zeros (they should return EOF). Nor would I expect there to be concurrent reads through existing descriptors since the whole point of filelock is that the operations should be mutually exclusive.
Yes, I think this issue is evidence of a bug in lockedfile, but I still don't understand it. Even in the presence of concurrent access I would not expect a file that has been truncated to length zero and then written with N bytes to be observable in a middle-state in which only M < N of the bytes have been written and the remaining bytes are zero. That would be a patent kernel bug.
We could easily implement checksums but I'd much rather understand and fix the bug in lockedfile than add yet another layer of workarounds. |
Yes, but that's ok. The integrity invariant is preserved at all times.
No, filelock should exclude such concurrency. That is its one job.
My point is: that should never happen. Perhaps @bcmills has more ideas. |
Neither would I.
I didn't mean to say that the writes are concurrent. I mean that the data previously stored would have had the same length, if that makes any difference.
But it does happen! |
Filesystem bugs on
Actually, come to think of it #52493 looks like an almost exact match for this failure mode, except that it's a different OS version. 🤔 |
I would suggest two mitigations:
|
Thanks, will do.
Not sure I agree with this one: our programs must be prepared for sudden death (i.e., be idempotent and transactional in their updates of persistent state) since they will have to deal with this in actual use. It's easier to detect bugs--such as this one--when we don't build in layers of imperfect mitigation such as "graceful shutdown". kill -KILL should be an acceptable way to terminate the process. |
Change https://go.dev/cl/482818 mentions this issue: |
This change appends a 32-bit CRC checksum to the end of each cache file in Set, and checks it in Get, and dumps the file content if it doesn't match the content. This is intended to help investigate golang/go#59289. Updates golang/go#59289 Change-Id: Ia439d915c37d33b0af5d0da11a8db63762cd80a7 Reviewed-on: https://go-review.googlesource.com/c/tools/+/482818 TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Robert Findley <[email protected]> Auto-Submit: Alan Donovan <[email protected]> Run-TryBot: Alan Donovan <[email protected]>
Fixed by https://go.dev/cl/492035. Not sure what else we can do. |
Issue created automatically to collect these failures.
Example (log):
— watchflakes
The text was updated successfully, but these errors were encountered: