Skip to content

Commit c86e65f

Browse files
committed
Fix merge with main
1 parent 1da576b commit c86e65f

File tree

6 files changed

+1048
-17
lines changed

6 files changed

+1048
-17
lines changed

Diff for: src/deepsparse/benchmark/README.md

+172-2
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,176 @@ See the License for the specific language governing permissions and
1414
limitations under the License.
1515
-->
1616

17-
# DeepSparse Benchmarking
17+
## 📜 Benchmarking ONNX Models
1818

19-
[Checkout DeepSparse Benchmarking User Guide for usage details](../../../docs/user-guide/deepsparse-benchmarking.md)
19+
`deepsparse.benchmark` is a command-line (CLI) tool for benchmarking the DeepSparse Engine with ONNX models. The tool will parse the arguments, download/compile the network into the engine, generate input tensors, and execute the model depending on the chosen scenario. By default, it will choose a multi-stream or asynchronous mode to optimize for throughput.
20+
21+
### Quickstart
22+
23+
After `pip install deepsparse`, the benchmark tool is available on your CLI. For example, to benchmark a dense BERT ONNX model fine-tuned on the SST2 dataset where the model path is the minimum input required to get started, run:
24+
25+
```
26+
deepsparse.benchmark zoo:nlp/text_classification/bert-base/pytorch/huggingface/sst2/base-none
27+
```
28+
__ __
29+
### Usage
30+
31+
In most cases, good performance will be found in the default options so it can be as simple as running the command with a SparseZoo model stub or your local ONNX model. However, if you prefer to customize benchmarking for your personal use case, you can run `deepsparse.benchmark -h` or with `--help` to view your usage options:
32+
33+
CLI Arguments:
34+
```
35+
positional arguments:
36+
37+
model_path Path to an ONNX model file or SparseZoo model stub.
38+
39+
optional arguments:
40+
41+
-h, --help show this help message and exit.
42+
43+
-b BATCH_SIZE, --batch_size BATCH_SIZE
44+
The batch size to run the analysis for. Must be
45+
greater than 0.
46+
47+
-shapes INPUT_SHAPES, --input_shapes INPUT_SHAPES
48+
Override the shapes of the inputs, i.e. -shapes
49+
"[1,2,3],[4,5,6],[7,8,9]" results in input0=[1,2,3]
50+
input1=[4,5,6] input2=[7,8,9].
51+
52+
-ncores NUM_CORES, --num_cores NUM_CORES
53+
The number of physical cores to run the analysis on,
54+
defaults to all physical cores available on the system.
55+
56+
-s {async,sync,elastic}, --scenario {async,sync,elastic}
57+
Choose between using the async, sync and elastic
58+
scenarios. Sync and async are similar to the single-
59+
stream/multi-stream scenarios. Elastic is a newer
60+
scenario that behaves similarly to the async scenario
61+
but uses a different scheduling backend. Default value
62+
is async.
63+
64+
-t TIME, --time TIME
65+
The number of seconds the benchmark will run. Default
66+
is 10 seconds.
67+
68+
-w WARMUP_TIME, --warmup_time WARMUP_TIME
69+
The number of seconds the benchmark will warmup before
70+
running.Default is 2 seconds.
71+
72+
-nstreams NUM_STREAMS, --num_streams NUM_STREAMS
73+
The number of streams that will submit inferences in
74+
parallel using async scenario. Default is
75+
automatically determined for given hardware and may be
76+
sub-optimal.
77+
78+
-pin {none,core,numa}, --thread_pinning {none,core,numa}
79+
Enable binding threads to cores ('core' the default),
80+
threads to cores on sockets ('numa'), or disable
81+
('none').
82+
83+
-e {deepsparse,onnxruntime}, --engine {deepsparse,onnxruntime}
84+
Inference engine backend to run eval on. Choices are
85+
'deepsparse', 'onnxruntime'. Default is 'deepsparse'.
86+
87+
-q, --quiet Lower logging verbosity.
88+
89+
-x EXPORT_PATH, --export_path EXPORT_PATH
90+
Store results into a JSON file.
91+
```
92+
💡**PRO TIP**💡: save your benchmark results in a convenient JSON file!
93+
94+
Example CLI command for benchmarking an ONNX model from the SparseZoo and saving the results to a `benchmark.json` file:
95+
96+
```
97+
deepsparse.benchmark zoo:nlp/text_classification/bert-base/pytorch/huggingface/sst2/base-none -x benchmark.json
98+
```
99+
Output of the JSON file:
100+
101+
![alt text](./img/json_output.png)
102+
103+
#### Sample CLI Argument Configurations
104+
105+
To run a sparse FP32 MobileNetV1 at batch size 16 for 10 seconds for throughput using 8 streams of requests:
106+
107+
```
108+
deepsparse.benchmark zoo:cv/classification/mobilenet_v1-1.0/pytorch/sparseml/imagenet/pruned-moderate --batch_size 16 --time 10 --scenario async --num_streams 8
109+
```
110+
111+
To run a sparse quantized INT8 6-layer BERT at batch size 1 for latency:
112+
113+
```
114+
deepsparse.benchmark zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_quant_6layers-aggressive_96 --batch_size 1 --scenario sync
115+
```
116+
__ __
117+
### ⚡ Inference Scenarios
118+
119+
#### Synchronous (Single-stream) Scenario
120+
121+
Set by the `--scenario sync` argument, the goal metric is latency per batch (ms/batch). This scenario submits a single inference request at a time to the engine, recording the time taken for a request to return an output. This mimics an edge deployment scenario.
122+
123+
The latency value reported is the mean of all latencies recorded during the execution period for the given batch size.
124+
125+
#### Asynchronous (Multi-stream) Scenario
126+
127+
Set by the `--scenario async` argument, the goal metric is throughput in items per second (i/s). This scenario submits `--num_streams` concurrent inference requests to the engine, recording the time taken for each request to return an output. This mimics a model server or bulk batch deployment scenario.
128+
129+
The throughput value reported comes from measuring the number of finished inferences within the execution time and the batch size.
130+
131+
#### Example Benchmarking Output of Synchronous vs. Asynchronous
132+
133+
**BERT 3-layer FP32 Sparse Throughput**
134+
135+
No need to add *scenario* argument since `async` is the default option:
136+
```
137+
deepsparse.benchmark zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_3layers-aggressive_83
138+
[INFO benchmark_model.py:202 ] Thread pinning to cores enabled
139+
DeepSparse Engine, Copyright 2021-present / Neuralmagic, Inc. version: 0.10.0 (9bba6971) (optimized) (system=avx512, binary=avx512)
140+
[INFO benchmark_model.py:247 ] deepsparse.engine.Engine:
141+
onnx_file_path: /home/mgoin/.cache/sparsezoo/c89f3128-4b87-41ae-91a3-eae8aa8c5a7c/model.onnx
142+
batch_size: 1
143+
num_cores: 18
144+
scheduler: Scheduler.multi_stream
145+
cpu_avx_type: avx512
146+
cpu_vnni: False
147+
[INFO onnx.py:176 ] Generating input 'input_ids', type = int64, shape = [1, 384]
148+
[INFO onnx.py:176 ] Generating input 'attention_mask', type = int64, shape = [1, 384]
149+
[INFO onnx.py:176 ] Generating input 'token_type_ids', type = int64, shape = [1, 384]
150+
[INFO benchmark_model.py:264 ] num_streams default value chosen of 9. This requires tuning and may be sub-optimal
151+
[INFO benchmark_model.py:270 ] Starting 'async' performance measurements for 10 seconds
152+
Original Model Path: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_3layers-aggressive_83
153+
Batch Size: 1
154+
Scenario: multistream
155+
Throughput (items/sec): 83.5037
156+
Latency Mean (ms/batch): 107.3422
157+
Latency Median (ms/batch): 107.0099
158+
Latency Std (ms/batch): 12.4016
159+
Iterations: 840
160+
```
161+
162+
**BERT 3-layer FP32 Sparse Latency**
163+
164+
To select a *synchronous inference scenario*, add `-s sync`:
165+
166+
```
167+
deepsparse.benchmark zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_3layers-aggressive_83 -s sync
168+
[INFO benchmark_model.py:202 ] Thread pinning to cores enabled
169+
DeepSparse Engine, Copyright 2021-present / Neuralmagic, Inc. version: 0.10.0 (9bba6971) (optimized) (system=avx512, binary=avx512)
170+
[INFO benchmark_model.py:247 ] deepsparse.engine.Engine:
171+
onnx_file_path: /home/mgoin/.cache/sparsezoo/c89f3128-4b87-41ae-91a3-eae8aa8c5a7c/model.onnx
172+
batch_size: 1
173+
num_cores: 18
174+
scheduler: Scheduler.single_stream
175+
cpu_avx_type: avx512
176+
cpu_vnni: False
177+
[INFO onnx.py:176 ] Generating input 'input_ids', type = int64, shape = [1, 384]
178+
[INFO onnx.py:176 ] Generating input 'attention_mask', type = int64, shape = [1, 384]
179+
[INFO onnx.py:176 ] Generating input 'token_type_ids', type = int64, shape = [1, 384]
180+
[INFO benchmark_model.py:270 ] Starting 'sync' performance measurements for 10 seconds
181+
Original Model Path: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_3layers-aggressive_83
182+
Batch Size: 1
183+
Scenario: singlestream
184+
Throughput (items/sec): 62.1568
185+
Latency Mean (ms/batch): 16.0732
186+
Latency Median (ms/batch): 15.7850
187+
Latency Std (ms/batch): 1.0427
188+
Iterations: 622
189+
```

Diff for: src/deepsparse/image_classification/README.md

+198-2
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,199 @@
1-
# Image Classification Use Case
1+
# Image Classification Inference Pipelines
22

3-
[Checkout DeepSparse Use Cases for usage details](../../../docs/use-cases/cv/image-classification.md)
3+
4+
[DeepSparse] Image Classification integration allows accelerated inference,
5+
serving, and benchmarking of sparsified image classification models.
6+
This integration allows for leveraging the DeepSparse Engine to run
7+
sparsified image classification inference with GPU-class performance directly
8+
on the CPU.
9+
10+
The DeepSparse Engine takes advantage of sparsity within neural networks to
11+
reduce compute as well as accelerate memory-bound workloads.
12+
The Engine is particularly effective when leveraging sparsification methods
13+
such as [pruning](https://neuralmagic.com/blog/pruning-overview/) and
14+
[quantization](https://arxiv.org/abs/1609.07061). These techniques result in
15+
significantly more performant and smaller models with limited to no effect on
16+
the baseline metrics.
17+
18+
## Getting Started
19+
20+
Before you start your adventure with the DeepSparse Engine, make sure that
21+
your machine is compatible with our [hardware requirements].
22+
23+
### Installation
24+
25+
```pip install deepsparse```
26+
27+
### Model Format
28+
29+
By default, to deploy image classification models using the DeepSparse Engine,
30+
the model should be supplied in the [ONNX] format.
31+
This grants the Engine the flexibility to serve any model in a framework-agnostic
32+
manner.
33+
34+
Below we describe two possibilities to obtain the required ONNX model.
35+
36+
#### Exporting the onnx file from the contents of a local checkpoint
37+
38+
This pathway is relevant if you intend to deploy a model created using [SparseML] library.
39+
For more information refer to the appropriate integration documentation in [SparseML].
40+
41+
1. The output of the [SparseML] training is saved to output directory `/{save_dir}` (e.g. `/trained_model`)
42+
2. Depending on the chosen framework, the model files are saved to `model_path`=`/{save_dir}/{framework_name}/{model_tag}` (e.g `/trained_model/pytorch/resnet50/`)
43+
3. To generate an onnx model, refer to the [script for image classification ONNX export](https://github.com/neuralmagic/sparseml/blob/main/src/sparseml/pytorch/image_classification/export.py).
44+
45+
Example:
46+
```bash
47+
sparseml.image_classification.export_onnx \
48+
--arch-key resnet50 \
49+
--dataset imagenet \
50+
--dataset-path ~/datasets/ILSVRC2012 \
51+
--checkpoint-path ~/checkpoints/resnet50_checkpoint.pth
52+
```
53+
This creates `model.onnx` file, in the parent directory of your `model_path`
54+
55+
#### Directly using the SparseZoo stub
56+
57+
Alternatively, you can skip the process of onnx model export by downloading all the required model data directly from Neural Magic's [SparseZoo](https://sparsezoo.neuralmagic.com/).
58+
Example:
59+
```python
60+
from sparsezoo import Model
61+
62+
# you can lookup an appropriate model stub here: https://sparsezoo.neuralmagic.com/
63+
model_stub = "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95-none"
64+
model = Model(model_stub)
65+
66+
# directly download the model data to your local directory
67+
model_path = model.path
68+
69+
# the onnx model file is there, ready for deployment
70+
import os
71+
os.path.isfile(model.onnx_model.path)
72+
>>>True
73+
```
74+
75+
76+
## Deployment APIs
77+
78+
DeepSparse provides both a python Pipeline API and an out-of-the-box model
79+
server that can be used for end-to-end inference in either existing python
80+
workflows or as an HTTP endpoint. Both options provide similar specifications
81+
for configurations and support a variety of Image Classification models.
82+
83+
### Python API
84+
85+
Pipelines are the default interface for running the inference with the
86+
DeepSparse Engine.
87+
88+
Once a model is obtained, either through [SparseML] training or directly from [SparseZoo],
89+
`deepsparse.Pipeline` can be used to easily facilitate end to end inference and deployment
90+
of the sparsified image classification model.
91+
92+
If no model is specified to the `Pipeline` for a given task, the `Pipeline` will automatically
93+
select a pruned and quantized model for the task from the `SparseZoo` that can be used for accelerated
94+
inference. Note that other models in the [SparseZoo] will have different tradeoffs between speed, size,
95+
and accuracy.
96+
97+
To learn about sparsification in more detail, refer to [SparseML docs](https://docs.neuralmagic.com/sparseml/)
98+
99+
### HTTP Server
100+
101+
As an alternative to Python API, the DeepSparse inference server allows you to
102+
serve ONNX models and pipelines in HTTP. Both configuring and making requests
103+
to the server follow the same parameters and schemas as the Pipelines enabling
104+
simple deployment. Once launched, a `/docs` endpoint is created with full
105+
endpoint descriptions and support for making sample requests.
106+
107+
Example deployment using a 95% pruned resnet50 is given below
108+
For full documentation on deploying sparse image classification models with the
109+
DeepSparse Server, see the [documentation](https://github.com/neuralmagic/deepsparse/tree/main/src/deepsparse/server).
110+
111+
##### Installation
112+
113+
The deepsparse server requirements can be installed by specifying the `server`
114+
extra dependency when installing DeepSparse.
115+
116+
```bash
117+
pip install deepsparse[server]
118+
```
119+
120+
## Deployment Use Cases
121+
122+
The following section includes example usage of the Pipeline and server APIs for
123+
various image classification models.
124+
125+
[List of Image Classification SparseZoo Models](https://sparsezoo.neuralmagic.com/?domain=cv&sub_domain=classification&page=1)
126+
127+
128+
#### Python Pipeline
129+
130+
```python
131+
from deepsparse import Pipeline
132+
cv_pipeline = Pipeline.create(
133+
task='image_classification',
134+
model_path='zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95-none', # Path to checkpoint or SparseZoo stub
135+
)
136+
input_image = "my_image.png" # path to input image
137+
inference = cv_pipeline(images=input_image)
138+
```
139+
140+
#### HTTP Server
141+
142+
Spinning up:
143+
```bash
144+
deepsparse.server \
145+
task image_classification \
146+
--model_path "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95-none" \
147+
--port 5543
148+
```
149+
150+
Making a request:
151+
```python
152+
import requests
153+
154+
url = 'http://0.0.0.0:5543/predict/from_files'
155+
path = ['goldfish.jpeg'] # just put the name of images in here
156+
files = [('request', open(img, 'rb')) for img in path]
157+
resp = requests.post(url=url, files=files)
158+
```
159+
160+
### Benchmarking
161+
162+
The mission of Neural Magic is to enable GPU-class inference performance on commodity CPUs.
163+
Want to find out how fast our sparse ONNX models perform inference?
164+
You can quickly do benchmarking tests on your own with a single CLI command!
165+
166+
You only need to provide the model path of a SparseZoo ONNX model or your own local ONNX model to get started:
167+
```bash
168+
deepsparse.benchmark zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95-none
169+
```
170+
Output:
171+
```bash
172+
Original Model Path: zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95-none
173+
Batch Size: 1
174+
Scenario: async
175+
Throughput (items/sec): 299.2372
176+
Latency Mean (ms/batch): 16.6677
177+
Latency Median (ms/batch): 16.6748
178+
Latency Std (ms/batch): 0.1728
179+
Iterations: 2995
180+
```
181+
182+
To learn more about benchmarking, refer to the appropriate documentation.
183+
Also, check out our [Benchmarking tutorial](https://github.com/neuralmagic/deepsparse/tree/main/src/deepsparse/benchmark)!
184+
185+
## Tutorials:
186+
For a deeper dive into using image classification models within the Neural Magic
187+
ecosystem, refer to the detailed tutorials on our [website](https://neuralmagic.com/):
188+
- [CV Use Cases](https://neuralmagic.com/use-cases/#computervision)
189+
190+
## Support
191+
For Neural Magic Support, sign up or log in to our [Deep Sparse Community Slack](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ). Bugs, feature requests, or additional questions can also be posted to our [GitHub Issue Queue](https://github.com/neuralmagic/deepsparse/issues).
192+
193+
194+
[DeepSparse]: https://github.com/neuralmagic/deepsparse
195+
[hardware requirements]: https://docs.neuralmagic.com/deepsparse/source/hardware.html
196+
[ONNX]: https://onnx.ai/
197+
[SparseML]: https://github.com/neuralmagic/sparseml
198+
[SparseML Image Classification Documentation]: https://github.com/neuralmagic/sparseml/tree/main/src/sparseml/pytorch/image_classification/README_image_classification.md
199+
[SparseZoo]: https://sparsezoo.neuralmagic.com/

Diff for: src/deepsparse/server/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -152,3 +152,4 @@ All you need is to add `/docs` at the end of your host URL:
152152
localhost:5543/docs
153153

154154
![alt text](./img/swagger_ui.png)
155+

0 commit comments

Comments
 (0)