Skip to content

[ML] Handling the NLP model 'the inference process queue is full' error #85319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
davidkyle opened this issue Mar 24, 2022 · 3 comments
Open
Labels
>enhancement :ml Machine learning Team:ML Meta label for the ML team

Comments

@davidkyle
Copy link
Member

davidkyle commented Mar 24, 2022

Description

NLP model deployments have a queue on each node to buffer inference requests as the work is processed. Once the queue is full any new requests will be rejected with the error message "inference process queue is full. Unable to execute command"

An enhancement would be to automatically detect this scenario and adjust to the accommodate the high input rate or suggest ways for the user resolve the problem, perhaps in the UI.

Currently there are a number of strategies to resolve the problem:

1. Increase the number of inference_threads and/or model_threads
This increases the amount of CPU the PyTorch process can use. The value should not be greater than the number of physical CPU cores on the machine.
Use the _stop and _start APIs to update the setting.

POST _ml/trained_models/MODEL_NAME/deployment/_stop?force=true

POST _ml/trained_models/MODEL_NAME/deployment/_start?inference_threads=4

2. Add another ML node.
The NLP model will automatically be deployed to new ml nodes when they join the cluster, the new node will take a share of the work increasing throughput in the cluster.

3. Work with smaller batches
If using bulk upload avoid large uploads that will fill the queue or exceed the queue size.
If using reindex use the size parameter to reduce the batch size.

4. Increase the queue size
If the input is bursty increasing the queue size can help get over those peaks. This will not help if a high input rate is sustained.

POST _ml/trained_models/MODEL_NAME/deployment/_start?queue_capacity=2000
@davidkyle davidkyle added >enhancement :ml Machine learning labels Mar 24, 2022
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Mar 24, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@Winterflower
Copy link

As an end-user (who has, admittedly not done a deep dive into the documentation), I find the fact that we have two "performance tuning knobs" model_threads and inference_threads a bit confusing. Do I get a better performance putting all of my vCPUs into one or the other or maybe a combo of both? I'm assuming the answer to this is going to depend on the task but would be good to

The value should not be greater than the number of physical CPU cores on the machine.
I assume that model_threads + inference_threads = "total vCpus" ?

@dimitris-athanasiou
Copy link
Contributor

dimitris-athanasiou commented Mar 24, 2022

@Winterflower The two knobs model_threads and inference_threads improve throughput and latency respectively. The approach would be:

  1. Check latency. If happy with current latency go to step 3.
  2. If improved latency is needed, increase number of inference_threads. The improvement should be close to linear given available cores exist (which is not guaranteed if multiple models are deployed with the current way we do allocatation but we're working on it). Restart deployment. Note that while inference_threads targets latency, with decreased latency an improvement in throughput is probable too. However, increasing model_threads will not affect latency at all.
  3. Check throughput. If improved throughput is needed increase model_threads. Restart deployment. In the future it will be possible to change model_threads without restarting deployment (but not inference_threads).

A final note. Scaling latency has a ceiling. It cannot go beyond what a single node can offer in terms of CPU cores. Scaling throughput is unbounded, as long as more nodes are added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :ml Machine learning Team:ML Meta label for the ML team
Projects
None yet
Development

No branches or pull requests

4 participants