Skip to content

[ML] Consider spawning processes from dedicated threads in controller #1503

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
droberts195 opened this issue Sep 23, 2020 · 3 comments
Closed
Labels

Comments

@droberts195
Copy link
Contributor

We have observed that when security software is running on a machine spawning a new process can take a very long time - over 20 seconds has been observed between the command being received in the controller and the resulting posix_spawn call returning. This invalidates the assumption that commands issued to controller by the JVM will be near instantaneous. It causes a problem because the timeout waiting for the named pipes to connect starts immediately after the command is issued, but the process may not actually start until considerably later.

Although spawn times over 20 seconds have been observed, the security software might not be slowing down process spawning that much. Currently controller has a single thread for processing commands. If 4 commands to spawn new processes are received in quick succession then they are actioned sequentially at the moment. So if the security software were adding 5 seconds to the time taken to spawn a process then the 4 processes would spawn 5, 10, 15 and 20 seconds after being requested. Effectively controller is serialising the work of the security software. Instead if controller created a new thread to process each request to spawn a new process then the security software could do its checks in parallel on all 4 requests.

@droberts195
Copy link
Contributor Author

This is actually trickier than it seems because of the way the CThreadTracker is locked during spawn and adding the PID to the tracker. Something will need to be changed there.

@droberts195
Copy link
Contributor Author

Further analysis of the problem suggests that it's not the actual posix_spawn() call that's being slowed down by the security software, but the overall process startup is being slowed by how busy the machine is. This implies that starting new threads to call posix_spawn() from may just add complexity to the code without helping the overall situation.

elastic/elasticsearch#62823 is probably a more useful change to avoid the problem.

I will leave this issue open as it might be useful to revisit in the future, but nobody should work on it in the short term.

droberts195 added a commit to droberts195/ml-cpp that referenced this issue Oct 12, 2020
By writing results in an array it's possible to reuse the Java
ProcessResultsParser class to read them.

Also, making the output thread safe will make implementing elastic#1503
easier if that's ever done.
@droberts195
Copy link
Contributor Author

elastic/elasticsearch#62823 was implemented, so closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant