You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NLP model deployments have a queue on each node to buffer inference requests as the work is processed. Once the queue is full any new requests will be rejected with the error message "inference process queue is full. Unable to execute command"
An enhancement would be to automatically detect this scenario and adjust to the accommodate the high input rate or suggest ways for the user resolve the problem, perhaps in the UI.
Currently there are a number of strategies to resolve the problem:
1. Increase the number of inference_threads and/or model_threads
This increases the amount of CPU the PyTorch process can use. The value should not be greater than the number of physical CPU cores on the machine.
Use the _stop and _start APIs to update the setting.
POST _ml/trained_models/MODEL_NAME/deployment/_stop?force=true
POST _ml/trained_models/MODEL_NAME/deployment/_start?inference_threads=4
2. Add another ML node.
The NLP model will automatically be deployed to new ml nodes when they join the cluster, the new node will take a share of the work increasing throughput in the cluster.
3. Work with smaller batches
If using bulk upload avoid large uploads that will fill the queue or exceed the queue size.
If using reindex use the size parameter to reduce the batch size.
4. Increase the queue size
If the input is bursty increasing the queue size can help get over those peaks. This will not help if a high input rate is sustained.
POST _ml/trained_models/MODEL_NAME/deployment/_start?queue_capacity=2000
The text was updated successfully, but these errors were encountered:
As an end-user (who has, admittedly not done a deep dive into the documentation), I find the fact that we have two "performance tuning knobs" model_threads and inference_threads a bit confusing. Do I get a better performance putting all of my vCPUs into one or the other or maybe a combo of both? I'm assuming the answer to this is going to depend on the task but would be good to
The value should not be greater than the number of physical CPU cores on the machine.
I assume that model_threads + inference_threads = "total vCpus" ?
@Winterflower The two knobs model_threads and inference_threads improve throughput and latency respectively. The approach would be:
Check latency. If happy with current latency go to step 3.
If improved latency is needed, increase number of inference_threads. The improvement should be close to linear given available cores exist (which is not guaranteed if multiple models are deployed with the current way we do allocatation but we're working on it). Restart deployment. Note that while inference_threads targets latency, with decreased latency an improvement in throughput is probable too. However, increasing model_threads will not affect latency at all.
Check throughput. If improved throughput is needed increase model_threads. Restart deployment. In the future it will be possible to change model_threads without restarting deployment (but not inference_threads).
A final note. Scaling latency has a ceiling. It cannot go beyond what a single node can offer in terms of CPU cores. Scaling throughput is unbounded, as long as more nodes are added.
Description
NLP model deployments have a queue on each node to buffer inference requests as the work is processed. Once the queue is full any new requests will be rejected with the error message
"inference process queue is full. Unable to execute command"
An enhancement would be to automatically detect this scenario and adjust to the accommodate the high input rate or suggest ways for the user resolve the problem, perhaps in the UI.
Currently there are a number of strategies to resolve the problem:
1. Increase the number of
inference_threads
and/ormodel_threads
This increases the amount of CPU the PyTorch process can use. The value should not be greater than the number of physical CPU cores on the machine.
Use the
_stop
and_start
APIs to update the setting.2. Add another ML node.
The NLP model will automatically be deployed to new ml nodes when they join the cluster, the new node will take a share of the work increasing throughput in the cluster.
3. Work with smaller batches
If using bulk upload avoid large uploads that will fill the queue or exceed the queue size.
If using reindex use the
size
parameter to reduce the batch size.4. Increase the queue size
If the input is bursty increasing the queue size can help get over those peaks. This will not help if a high input rate is sustained.
The text was updated successfully, but these errors were encountered: