@@ -30,20 +30,20 @@ in an ingest pipeline or directly in the <<infer-trained-model>> API.
30
30
Scaling inference performance can be achieved by setting the parameters
31
31
`number_of_allocations` and `threads_per_allocation`.
32
32
33
- Increasing `threads_per_allocation` means more threads are used when
34
- an inference request is processed on a node. This can improve inference speed
35
- for certain models. It may also result in improvement to throughput.
33
+ Increasing `threads_per_allocation` means more threads are used when an
34
+ inference request is processed on a node. This can improve inference speed for
35
+ certain models. It may also result in improvement to throughput.
36
36
37
- Increasing `number_of_allocations` means more threads are used to
38
- process multiple inference requests in parallel resulting in throughput
39
- improvement. Each model allocation uses a number of threads defined by
37
+ Increasing `number_of_allocations` means more threads are used to process
38
+ multiple inference requests in parallel resulting in throughput improvement.
39
+ Each model allocation uses a number of threads defined by
40
40
`threads_per_allocation`.
41
41
42
- Model allocations are distributed across {ml} nodes. All allocations assigned
43
- to a node share the same copy of the model in memory. To avoid
44
- thread oversubscription which is detrimental to performance, model allocations
45
- are distributed in such a way that the total number of used threads does not
46
- surpass the node's allocated processors.
42
+ Model allocations are distributed across {ml} nodes. All allocations assigned to
43
+ a node share the same copy of the model in memory. To avoid thread
44
+ oversubscription which is detrimental to performance, model allocations are
45
+ distributed in such a way that the total number of used threads does not surpass
46
+ the node's allocated processors.
47
47
48
48
[[start-trained-model-deployment-path-params]]
49
49
== {api-path-parms-title}
@@ -57,33 +57,36 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=model-id]
57
57
58
58
`cache_size`::
59
59
(Optional, <<byte-units,byte value>>)
60
- The inference cache size (in memory outside the JVM heap) per node for the model.
61
- The default value is the same size as the `model_size_bytes`. To disable the cache, `0b` can be provided.
60
+ The inference cache size (in memory outside the JVM heap) per node for the
61
+ model. The default value is the size of the model as reported by the
62
+ `model_size_bytes` field in the <<get-trained-models-stats>>. To disable the
63
+ cache, `0b` can be provided.
62
64
63
65
`number_of_allocations`::
64
66
(Optional, integer)
65
67
The total number of allocations this model is assigned across {ml} nodes.
66
- Increasing this value generally increases the throughput.
67
- Defaults to 1.
68
+ Increasing this value generally increases the throughput. Defaults to 1.
68
69
69
70
`queue_capacity`::
70
71
(Optional, integer)
71
72
Controls how many inference requests are allowed in the queue at a time.
72
73
Every machine learning node in the cluster where the model can be allocated
73
74
has a queue of this size; when the number of requests exceeds the total value,
74
- new requests are rejected with a 429 error. Defaults to 1024. Max allowed value is 1000000.
75
+ new requests are rejected with a 429 error. Defaults to 1024. Max allowed value
76
+ is 1000000.
75
77
76
78
`threads_per_allocation`::
77
79
(Optional, integer)
78
- Sets the number of threads used by each model allocation during inference. This generally increases
79
- the speed per inference request. The inference process is a compute-bound process;
80
- `threads_per_allocations` must not exceed the number of available allocated processors per node.
81
- Defaults to 1. Must be a power of 2. Max allowed value is 32.
80
+ Sets the number of threads used by each model allocation during inference. This
81
+ generally increases the speed per inference request. The inference process is a
82
+ compute-bound process; `threads_per_allocations` must not exceed the number of
83
+ available allocated processors per node. Defaults to 1. Must be a power of 2.
84
+ Max allowed value is 32.
82
85
83
86
`timeout`::
84
87
(Optional, time)
85
- Controls the amount of time to wait for the model to deploy. Defaults
86
- to 20 seconds.
88
+ Controls the amount of time to wait for the model to deploy. Defaults to 20
89
+ seconds.
87
90
88
91
`wait_for`::
89
92
(Optional, string)
0 commit comments