-
Notifications
You must be signed in to change notification settings - Fork 65
[ML] process can hang if large forecast job fails to delete temporary storage #350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for opening this issue! I am available to dig into the "why" of the initial error when cleaning up the temporary folder. I get "permission denied" at this line which causes
Disclaimer: this happens on my linux box that I use to run CI, which I have just reinstalled. This test fails consistently the same way from within jenkins as well as when manually run. I am not getting what may be causing the permission denied error. I tried disabling selinux but that does not help. I would expect the cause to be some subtle misconfiguration but I haven't figured it out yet. The operating system is Fedora 29. I have Fedora 28 on my laptop where the test runs fine. |
One possibility of the deletion problem: We install a system call filter on linux: https://github.com/elastic/ml-cpp/blob/master/lib/seccomp/CSystemCallFilter_Linux.cc It might be that forecast overflow requires additional whitelisting in this filter on Fedora 29. Unfortunately it is not easy to find out, seccomp can not be easily disabled, you need a custom build of either ml-cpp or the linux kernel. We could consider a special environment variable or cmdline parameter to disable seccomp if my assumption is true and if we get more reports like this. |
I don't think we should do this. If we did then any malware that wanted to use an exploit that seccomp is currently blocking would just switch it off, so we might as well just remove seccomp entirely. See also elastic/elasticsearch#27645 (comment)
I think we need to put effort into finding out. @davidkyle isn't there a way to get seccomp to report which system call it's blocking? I seem to remember you used something like that during initial development. Then we just need to hire a Fedora 29 VM in EC2 or GCP and copy over a test program that uses This is not desperately urgent as Fedora 29 is not a supported platform. But whatever changes are in its glibc will eventually be included in some supported platform, so we need to head off the problem before an affected platform is added to the support matrix. |
I am here to help, as I said I can reproduce this 100% of the times. I would love if this could be given some kind of priority so I can run the full suite of tests on my server again, either the fix that Hendrik mentioned for the bug that's causing the test to hang, or digging deeper to find out what the problem is with |
I believe the fix to the symptoms of the problem in #352 will allow the tests to pass again, and it should be possible to merge that the day @hendrikmuhs is back at work. |
The only way I found to reliably track which syscalls are blocked across all linux's is to change the seccomp filter to return SECCOMP_RET_KILL instead of SECCOMP_RET_ERRNO. This kills the calling thread, which should be the main thread killing the process and logs a message with the offending syscall number which then must be looked up in syscall.h. I think this is worth investigating even if fedora isn't officially supported as a blocked syscall silently fails and it very hard to debug. It's a pretty easy change to make, I would have to go back to look at exactly what I did before and create a unit test for it.
|
great @droberts195 I had not seen that PR, thank you for pointing me to that. I can help testing it out if needed, not sure how though :) ping me if I can do anything. |
My PR only fixes the hang but not the issue that on Fedora 29 we very likely block a system call as part of the seccomp filter which we should not. I am pretty sure it is the seccomp filter because "eaccess" is what we get reported. Killing the process would be quite dramatic as we sacrifice the process for a failed forecast. IMHO to much to make this the default. I suggest that we do this only on snaphot builds but keep the current behavior on release builds. It requires some makefile and Note that the unit tests are not executed if you use the binary blops. we get more coverage with the above. LBNL we have to check that the crashhandler reports the root cause (blocked syscall) correctly. |
I wasn't suggesting we change this for any official build, just as a one-off debug build someone does on their laptop to find the offending syscall number. This problem only affects us when one of the OS libraries we use is changed to use different kernel calls to implement an existing user land OS function. I don't imagine that happens very often. |
I wasn't suggesting we change the response to kill the process only that this was the hack I made to find which syscalls were being called as I was developing the PR. More recent kernels (after 4.14) have better debugging for BPF but I needed to test this on all the versions we support. There is a unit test for CSystemCallFilter_Linux.cc that asserts which operations are allowed/disallowed after installing the seccomp filter. This test is by no means exhaustive but the next step should be to add calls to |
I'm pretty sure the missing syscall is In fact, given that code it's surprising that this worked on any version of Linux from the last decade or so. I guess older glibc's must have had some sort of workaround that I can't find that fell back to |
I noticed this a couple of times where if a syscall was blocked another was tried. For example @hendrikmuhs found a bug where the high resolution timer was blocked so it fell back to the 1 second timer. Regarding the unit test for We need to run the unit tests on Fedora 29 hopefully it will fail and hopefully adding |
Fedora 29 uses different system calls to platforms we've previously tested on, and hence suffers from certain functionality failing due to the seccomp filter. This commit permits 3 additional system calls: 1. __NR_gettimeofday 2. __NR_unlinkat 3. __NR_getdents64 (It is likely that other Linux distributions using modern glibc would also hit one or more of these system calls. Non-fatal problems probably got progressively worse in the lead up to the fatal problem that surfaced in Fedora 29.) Fixes elastic#350
Fedora 29 uses different system calls to platforms we've previously tested on, and hence suffers from certain functionality failing due to the seccomp filter. This commit permits 3 additional system calls: 1. __NR_gettimeofday 2. __NR_unlinkat 3. __NR_getdents64 (It is likely that other Linux distributions using modern glibc would also hit one or more of these system calls. Non-fatal problems probably got progressively worse in the lead up to the fatal problem that surfaced in Fedora 29.) Fixes #350
Fedora 29 uses different system calls to platforms we've previously tested on, and hence suffers from certain functionality failing due to the seccomp filter. This commit permits 3 additional system calls: 1. __NR_gettimeofday 2. __NR_unlinkat 3. __NR_getdents64 (It is likely that other Linux distributions using modern glibc would also hit one or more of these system calls. Non-fatal problems probably got progressively worse in the lead up to the fatal problem that surfaced in Fedora 29.) Fixes elastic#350 Backport of elastic#354
Fedora 29 uses different system calls to platforms we've previously tested on, and hence suffers from certain functionality failing due to the seccomp filter. This commit permits 3 additional system calls: 1. __NR_gettimeofday 2. __NR_unlinkat 3. __NR_getdents64 (It is likely that other Linux distributions using modern glibc would also hit one or more of these system calls. Non-fatal problems probably got progressively worse in the lead up to the fatal problem that surfaced in Fedora 29.) Fixes #350 Backport of #354
thanks a lot everybody involved! |
fix a race condition if a forecast job requires overflowing to disk but cleanup of temporary storage fails. This can cause the autodetect process to hang on exit, if more forecast requests are in the queue relates to #350
found using integration tests:
if - for unknown reason - forecasting fails to delete tmp storage[*], the worker thread returns, potentially leaving open jobs in the queue, which causes the process to hang on job close.
[*] Note, this is a bug in itself, but this problem should still not lead to the issue above
The text was updated successfully, but these errors were encountered: