[ML] Consider spawning processes from dedicated threads in controller #1503

droberts195 · 2020-09-23T12:14:55Z

We have observed that when security software is running on a machine spawning a new process can take a very long time - over 20 seconds has been observed between the command being received in the controller and the resulting posix_spawn call returning. This invalidates the assumption that commands issued to controller by the JVM will be near instantaneous. It causes a problem because the timeout waiting for the named pipes to connect starts immediately after the command is issued, but the process may not actually start until considerably later.

Although spawn times over 20 seconds have been observed, the security software might not be slowing down process spawning that much. Currently controller has a single thread for processing commands. If 4 commands to spawn new processes are received in quick succession then they are actioned sequentially at the moment. So if the security software were adding 5 seconds to the time taken to spawn a process then the 4 processes would spawn 5, 10, 15 and 20 seconds after being requested. Effectively controller is serialising the work of the security software. Instead if controller created a new thread to process each request to spawn a new process then the security software could do its checks in parallel on all 4 requests.

The text was updated successfully, but these errors were encountered:

droberts195 · 2020-09-23T15:55:34Z

This is actually trickier than it seems because of the way the CThreadTracker is locked during spawn and adding the PID to the tracker. Something will need to be changed there.

droberts195 · 2020-09-24T09:15:07Z

Further analysis of the problem suggests that it's not the actual posix_spawn() call that's being slowed down by the security software, but the overall process startup is being slowed by how busy the machine is. This implies that starting new threads to call posix_spawn() from may just add complexity to the code without helping the overall situation.

elastic/elasticsearch#62823 is probably a more useful change to avoid the problem.

I will leave this issue open as it might be useful to revisit in the future, but nobody should work on it in the short term.

By writing results in an array it's possible to reuse the Java ProcessResultsParser class to read them. Also, making the output thread safe will make implementing elastic#1503 easier if that's ever done.

droberts195 · 2024-01-11T10:24:16Z

elastic/elasticsearch#62823 was implemented, so closing this one.

droberts195 added the :ml label Sep 23, 2020

droberts195 mentioned this issue Sep 23, 2020

[ML] Processes that fail to connect to the JVM within a reasonable time should exit #1504

Closed

droberts195 added v7.11.0 v8.0.0 and removed v7.11.0 v8.0.0 labels Oct 6, 2020

droberts195 closed this as completed Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Consider spawning processes from dedicated threads in controller #1503

[ML] Consider spawning processes from dedicated threads in controller #1503

droberts195 commented Sep 23, 2020

droberts195 commented Sep 23, 2020

droberts195 commented Sep 24, 2020

droberts195 commented Jan 11, 2024

[ML] Consider spawning processes from dedicated threads in controller #1503

[ML] Consider spawning processes from dedicated threads in controller #1503

Comments

droberts195 commented Sep 23, 2020

droberts195 commented Sep 23, 2020

droberts195 commented Sep 24, 2020

droberts195 commented Jan 11, 2024