-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Kernel bug running BOINC tasks, 3.12.18/19 #600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This post (http://www.raspberrypi.org/forums/viewtopic.php?f=29&t=75895&start=25#p552161) suggests that Wheezy running 3.12.y is okay, but updating to Jessie introduced the problem. Which are you using? I need to see the rest of the panic message (i.e. press enter for more). |
Thanks for the reply. I'm running Wheezy I believe, so I don't think updating to Jessie causes the issue. "uname -a" reports: Correct, keyboard/ssh no longer usable. Unfortunately I don't have a serial connection. Out of curiosity, do you expect that running Raspbian from the SD card under QEMU on a Linux x86_64 computer would provide a close-enough replication of the environment to extract the whole panic message? Or is there a way to lengthen the amount displayed on screen (a kernel command line argument or something)? |
If you remove the parameter kgdboc=ttyAMA0,115200 from /boot/cmdline.txt then the stacktrace won't end at the kdb prompt. The full stack trace should spew out on the screen. |
Thanks for the hint P33M. That allowed me to extract the entire trace through dmesg (see below for further discussion). Here's the trace:
I can supply the entire dmesg output if you'd like. The super weird thing is that I was able to get this from dmesg, because it hasn't hung. It's now been running for over two days straight without freezing and without getting that error again. I'm not sure this isn't a fluke so I'm continuing to monitor it. However, is it possible that waiting for a response to the "more>" prompt on a nonexistant serial device ttyAMA0 was causing the RPi to hang and become unresponsive to other inputs? I know it was hung because the USB hard drive would turn off, which it wouldn't do if BOINC was still running tasks from it in the background. But it hasn't run for over a few hours without hanging since updating to 3.12.y, so removing that kgdboc parameter seems to have changed something about how it handles this bug. Very keen to hear your thoughts on what's going on! UPDATE: Although the RPi itself seems not to hang any more, the BOINC task doesn't progress any further. This could be the reason the bug wasn't recurring. |
Although you hit an illegal instruction exception in kernel mode, it looks like the handler just killed the process. With kgdb loaded, this would have hit a trap to enter the debugger which is why you got the more> prompt - if you had a serial cable you could potentially resume execution and you would get the result as if you hadn't entered kgdb. This smells like an upstream bug. There's assembly magic in there that does the job of moving values out of the VFP register file. |
OK, thanks for looking into it. I'll further my discussions with the BOINC people and see how it goes. EDIT: Any idea why this would have started happening between 3.10.y and 3.12.y, as reported on the BOINC boards? Was anything deprecated between versions that may have led to this? |
Sounds like this was fixed in a BOINC update: |
Sorry if I may differ. From a security point of view - kernel is king, and if userspace can crash kernel, it is always ALWAYS ALWAYS the kernel fault. The kernel is supposed to protect itself very well against ANY input possible from userspace, because it is encased in a security castle. This point is sometime hard to understand by many people, but definitely NOT the security hacker - many times in the past just a simple hack/crack introduced through the userspace can easily compromise the king, I mean kernel. References (please don't irritate Linus, this simple logic is supposed to be well known to core kernel developer): https://felipec.wordpress.com/2013/10/07/the-linux-way/ https://lkml.org/lkml/2012/12/23/75 Mantra for the day: userspace never break kernel, it is always kernel fault. Userspace bugs will only end worst case with application crashes. Sorry my point is: any kernel crashes, have to be reported as kernel crashes, it is likely to persists. |
I have to agree with tthtlc, if you're hitting an ilegal instruction in kernel mode, it's a kernel bug, not a userspace bug (with the possible exception of cases like BPF where userpspace passes code into the kernel). BOINC may have updated such that they don't hit the bug, but it's still there regardless, and should be fixed. |
@hartacus has this issue been resolved? If yes, then please close this issue. |
Still the same issue with seti@boinc
after 10 min I've got
|
This is still an open issue on the Pi all the way up from the 3.12.x kernel to and including the 4.4.46+ kernel, On the Seti v8 app, the issue doesn't occur on Low Angle Range tasks (Where the telescope is focused at a point in space), |
Here is my latest kern.log:
|
This patch might be relevant: http://lists.infradead.org/pipermail/linux-arm-kernel/2015-March/332633.html |
How do I get hold of a kernel with this patch applied? |
Did a rpi-update, and rebooted, no change:
|
next try, no change:
|
This issue is now fixed with kernels 4.4.48+ 964 or 4.9.9+ 965 or later, see this issue for details: |
Experiencing a "BUG: unsupported FP instruction in kernel mode" error when running various BOINC tasks (see image). Seems to occur at a random time after starting BOINC computation, usually in the order of a few hours but sometimes after a few minutes. Has been observed with the 3.12.18 and 3.12.19 kernels running tasks from several BOINC projects. Reported by multiple users here:
http://www.raspberrypi.org/forums/viewtopic.php?p=547713#p547713
http://boinc.berkeley.edu/dev/forum_thread.php?id=9222
Some info from the above threads indicate this issue was not present in 3.10.y kernels.
The text was updated successfully, but these errors were encountered: