[ML] Handling the NLP model 'the inference process queue is full' error #85319

davidkyle · 2022-03-24T11:23:11Z

Description

NLP model deployments have a queue on each node to buffer inference requests as the work is processed. Once the queue is full any new requests will be rejected with the error message "inference process queue is full. Unable to execute command"

An enhancement would be to automatically detect this scenario and adjust to the accommodate the high input rate or suggest ways for the user resolve the problem, perhaps in the UI.

Currently there are a number of strategies to resolve the problem:

1. Increase the number of inference_threads and/or model_threads
This increases the amount of CPU the PyTorch process can use. The value should not be greater than the number of physical CPU cores on the machine.
Use the _stop and _start APIs to update the setting.

POST _ml/trained_models/MODEL_NAME/deployment/_stop?force=true

POST _ml/trained_models/MODEL_NAME/deployment/_start?inference_threads=4

2. Add another ML node.
The NLP model will automatically be deployed to new ml nodes when they join the cluster, the new node will take a share of the work increasing throughput in the cluster.

3. Work with smaller batches
If using bulk upload avoid large uploads that will fill the queue or exceed the queue size.
If using reindex use the size parameter to reduce the batch size.

4. Increase the queue size
If the input is bursty increasing the queue size can help get over those peaks. This will not help if a high input rate is sustained.

POST _ml/trained_models/MODEL_NAME/deployment/_start?queue_capacity=2000

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-03-24T11:23:14Z

Pinging @elastic/ml-core (Team:ML)

Winterflower · 2022-03-24T14:17:30Z

As an end-user (who has, admittedly not done a deep dive into the documentation), I find the fact that we have two "performance tuning knobs" model_threads and inference_threads a bit confusing. Do I get a better performance putting all of my vCPUs into one or the other or maybe a combo of both? I'm assuming the answer to this is going to depend on the task but would be good to

The value should not be greater than the number of physical CPU cores on the machine.
I assume that model_threads + inference_threads = "total vCpus" ?

dimitris-athanasiou · 2022-03-24T14:22:23Z

@Winterflower The two knobs model_threads and inference_threads improve throughput and latency respectively. The approach would be:

Check latency. If happy with current latency go to step 3.
If improved latency is needed, increase number of inference_threads. The improvement should be close to linear given available cores exist (which is not guaranteed if multiple models are deployed with the current way we do allocatation but we're working on it). Restart deployment. Note that while inference_threads targets latency, with decreased latency an improvement in throughput is probable too. However, increasing model_threads will not affect latency at all.
Check throughput. If improved throughput is needed increase model_threads. Restart deployment. In the future it will be possible to change model_threads without restarting deployment (but not inference_threads).

A final note. Scaling latency has a ceiling. It cannot go beyond what a single node can offer in terms of CPU cores. Scaling throughput is unbounded, as long as more nodes are added.

davidkyle added >enhancement :ml Machine learning labels Mar 24, 2022

elasticmachine added the Team:ML Meta label for the ML team label Mar 24, 2022

maxhniebergall mentioned this issue Oct 2, 2024

[ML] Overview of reindex issues with NLP #113948

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Handling the NLP model 'the inference process queue is full' error #85319

[ML] Handling the NLP model 'the inference process queue is full' error #85319

davidkyle commented Mar 24, 2022 •

edited

Loading

elasticmachine commented Mar 24, 2022

Winterflower commented Mar 24, 2022

dimitris-athanasiou commented Mar 24, 2022 •

edited

Loading

[ML] Handling the NLP model 'the inference process queue is full' error #85319

[ML] Handling the NLP model 'the inference process queue is full' error #85319

Comments

davidkyle commented Mar 24, 2022 • edited Loading

Description

elasticmachine commented Mar 24, 2022

Winterflower commented Mar 24, 2022

dimitris-athanasiou commented Mar 24, 2022 • edited Loading

davidkyle commented Mar 24, 2022 •

edited

Loading

dimitris-athanasiou commented Mar 24, 2022 •

edited

Loading