neuralmagic
diff --git a/Diff for: ‎src/deepsparse/benchmark/README.md
+172-2 b/Diff for: ‎src/deepsparse/benchmark/README.md
+172-2
diff --git a/Diff for: ‎src/deepsparse/image_classification/README.md
+198-2 b/Diff for: ‎src/deepsparse/image_classification/README.md
+198-2
diff --git a/Diff for: ‎src/deepsparse/server/README.md
+1 b/Diff for: ‎src/deepsparse/server/README.md
+1
@@ -14,6 +14,176 @@ See the License for the specific language governing permissions and
 limitations under the License.
 -->
 
-# DeepSparse Benchmarking 
+## 📜 Benchmarking ONNX Models
 
-[Checkout DeepSparse Benchmarking User Guide for usage details](../../../docs/user-guide/deepsparse-benchmarking.md)
+`deepsparse.benchmark` is a command-line (CLI) tool for benchmarking the DeepSparse Engine with ONNX models. The tool will parse the arguments, download/compile the network into the engine, generate input tensors, and execute the model depending on the chosen scenario. By default, it will choose a multi-stream or asynchronous mode to optimize for throughput.
+
+### Quickstart
+
+After `pip install deepsparse`, the benchmark tool is available on your CLI. For example, to benchmark a dense BERT ONNX model fine-tuned on the SST2 dataset where the model path is the minimum input required to get started, run:
+
+```
+deepsparse.benchmark zoo:nlp/text_classification/bert-base/pytorch/huggingface/sst2/base-none
+```
+__ __
+### Usage
+
+In most cases, good performance will be found in the default options so it can be as simple as running the command with a SparseZoo model stub or your local ONNX model. However, if you prefer to customize benchmarking for your personal use case, you can run `deepsparse.benchmark -h` or with `--help` to view your usage options:
+
+CLI Arguments:
+```
+positional arguments:
+
+        model_path                    Path to an ONNX model file or SparseZoo model stub.
+
+optional arguments:
+
+        -h, --help                    show this help message and exit.
+
+        -b BATCH_SIZE, --batch_size BATCH_SIZE
+                                        The batch size to run the analysis for. Must be
+                                        greater than 0.
+
+        -shapes INPUT_SHAPES, --input_shapes INPUT_SHAPES
+                                        Override the shapes of the inputs, i.e. -shapes
+                                        "[1,2,3],[4,5,6],[7,8,9]" results in input0=[1,2,3]
+                                        input1=[4,5,6] input2=[7,8,9].
+
+        -ncores NUM_CORES, --num_cores NUM_CORES
+                                        The number of physical cores to run the analysis on,
+                                        defaults to all physical cores available on the system.
+
+        -s {async,sync,elastic}, --scenario {async,sync,elastic}
+                                        Choose between using the async, sync and elastic
+                                        scenarios. Sync and async are similar to the single-
+                                        stream/multi-stream scenarios. Elastic is a newer
+                                        scenario that behaves similarly to the async scenario
+                                        but uses a different scheduling backend. Default value
+                                        is async.
+
+        -t TIME, --time TIME            
+                                        The number of seconds the benchmark will run. Default
+                                        is 10 seconds.
+
+        -w WARMUP_TIME, --warmup_time WARMUP_TIME
+                                        The number of seconds the benchmark will warmup before
+                                        running.Default is 2 seconds.
+
+        -nstreams NUM_STREAMS, --num_streams NUM_STREAMS
+                                        The number of streams that will submit inferences in
+                                        parallel using async scenario. Default is
+                                        automatically determined for given hardware and may be
+                                        sub-optimal.
+
+        -pin {none,core,numa}, --thread_pinning {none,core,numa}
+                                        Enable binding threads to cores ('core' the default),
+                                        threads to cores on sockets ('numa'), or disable
+                                        ('none').
+
+        -e {deepsparse,onnxruntime}, --engine {deepsparse,onnxruntime}
+                                        Inference engine backend to run eval on. Choices are
+                                        'deepsparse', 'onnxruntime'. Default is 'deepsparse'.
+
+        -q, --quiet                     Lower logging verbosity.
+
+        -x EXPORT_PATH, --export_path EXPORT_PATH
+                                        Store results into a JSON file.
+```
+💡**PRO TIP**💡: save your benchmark results in a convenient JSON file!
+
+Example CLI command for benchmarking an ONNX model from the SparseZoo and saving the results to a `benchmark.json` file:
+
+```
+deepsparse.benchmark zoo:nlp/text_classification/bert-base/pytorch/huggingface/sst2/base-none -x benchmark.json
+```
+Output of the JSON file:
+
+![alt text](./img/json_output.png)
+
+#### Sample CLI Argument Configurations
+
+To run a sparse FP32 MobileNetV1 at batch size 16 for 10 seconds for throughput using 8 streams of requests:
+
+```
+deepsparse.benchmark zoo:cv/classification/mobilenet_v1-1.0/pytorch/sparseml/imagenet/pruned-moderate --batch_size 16 --time 10 --scenario async --num_streams 8
+```
+
+To run a sparse quantized INT8 6-layer BERT at batch size 1 for latency:
+
+```
+deepsparse.benchmark zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_quant_6layers-aggressive_96 --batch_size 1 --scenario sync
+```
+__ __
+### ⚡ Inference Scenarios
+
+#### Synchronous (Single-stream) Scenario
+
+Set by the `--scenario sync` argument, the goal metric is latency per batch (ms/batch). This scenario submits a single inference request at a time to the engine, recording the time taken for a request to return an output. This mimics an edge deployment scenario.
+
+The latency value reported is the mean of all latencies recorded during the execution period for the given batch size.
+
+#### Asynchronous (Multi-stream) Scenario
+
+Set by the `--scenario async` argument, the goal metric is throughput in items per second (i/s). This scenario submits `--num_streams` concurrent inference requests to the engine, recording the time taken for each request to return an output. This mimics a model server or bulk batch deployment scenario.
+
+The throughput value reported comes from measuring the number of finished inferences within the execution time and the batch size.
+
+#### Example Benchmarking Output of Synchronous vs. Asynchronous
+
+**BERT 3-layer FP32 Sparse Throughput**
+
+No need to add *scenario* argument since `async` is the default option:
+```
+deepsparse.benchmark zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_3layers-aggressive_83
+[INFO benchmark_model.py:202 ] Thread pinning to cores enabled
+DeepSparse Engine, Copyright 2021-present / Neuralmagic, Inc. version: 0.10.0 (9bba6971) (optimized) (system=avx512, binary=avx512)
+[INFO benchmark_model.py:247 ] deepsparse.engine.Engine:
+        onnx_file_path: /home/mgoin/.cache/sparsezoo/c89f3128-4b87-41ae-91a3-eae8aa8c5a7c/model.onnx
+        batch_size: 1
+        num_cores: 18
+        scheduler: Scheduler.multi_stream
+        cpu_avx_type: avx512
+        cpu_vnni: False
+[INFO            onnx.py:176 ] Generating input 'input_ids', type = int64, shape = [1, 384]
+[INFO            onnx.py:176 ] Generating input 'attention_mask', type = int64, shape = [1, 384]
+[INFO            onnx.py:176 ] Generating input 'token_type_ids', type = int64, shape = [1, 384]
+[INFO benchmark_model.py:264 ] num_streams default value chosen of 9. This requires tuning and may be sub-optimal
+[INFO benchmark_model.py:270 ] Starting 'async' performance measurements for 10 seconds
+Original Model Path: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_3layers-aggressive_83
+Batch Size: 1
+Scenario: multistream
+Throughput (items/sec): 83.5037
+Latency Mean (ms/batch): 107.3422
+Latency Median (ms/batch): 107.0099
+Latency Std (ms/batch): 12.4016
+Iterations: 840
+```
+
+**BERT 3-layer FP32 Sparse Latency**
+
+To select a *synchronous inference scenario*, add `-s sync`:
+
+```
+deepsparse.benchmark zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_3layers-aggressive_83 -s sync
+[INFO benchmark_model.py:202 ] Thread pinning to cores enabled
+DeepSparse Engine, Copyright 2021-present / Neuralmagic, Inc. version: 0.10.0 (9bba6971) (optimized) (system=avx512, binary=avx512)
+[INFO benchmark_model.py:247 ] deepsparse.engine.Engine:
+        onnx_file_path: /home/mgoin/.cache/sparsezoo/c89f3128-4b87-41ae-91a3-eae8aa8c5a7c/model.onnx
+        batch_size: 1
+        num_cores: 18
+        scheduler: Scheduler.single_stream
+        cpu_avx_type: avx512
+        cpu_vnni: False
+[INFO            onnx.py:176 ] Generating input 'input_ids', type = int64, shape = [1, 384]
+[INFO            onnx.py:176 ] Generating input 'attention_mask', type = int64, shape = [1, 384]
+[INFO            onnx.py:176 ] Generating input 'token_type_ids', type = int64, shape = [1, 384]
+[INFO benchmark_model.py:270 ] Starting 'sync' performance measurements for 10 seconds
+Original Model Path: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_3layers-aggressive_83
+Batch Size: 1
+Scenario: singlestream
+Throughput (items/sec): 62.1568
+Latency Mean (ms/batch): 16.0732
+Latency Median (ms/batch): 15.7850
+Latency Std (ms/batch): 1.0427
+Iterations: 622
+```
@@ -1,3 +1,199 @@
-# Image Classification Use Case
+# Image Classification Inference Pipelines
 
-[Checkout DeepSparse Use Cases for usage details](../../../docs/use-cases/cv/image-classification.md)
+
+[DeepSparse] Image Classification integration allows accelerated inference, 
+serving, and benchmarking of sparsified image classification models.
+This integration allows for leveraging the DeepSparse Engine to run 
+sparsified image classification inference with GPU-class performance directly 
+on the CPU.
+
+The DeepSparse Engine takes advantage of sparsity within neural networks to 
+reduce compute as well as accelerate memory-bound workloads. 
+The Engine is particularly effective when leveraging sparsification methods 
+such as [pruning](https://neuralmagic.com/blog/pruning-overview/) and 
+[quantization](https://arxiv.org/abs/1609.07061). These techniques result in 
+significantly more performant and smaller models with limited to no effect on 
+the baseline metrics.
+
+## Getting Started
+
+Before you start your adventure with the DeepSparse Engine, make sure that 
+your machine is compatible with our [hardware requirements].
+
+### Installation
+
+```pip install deepsparse```
+
+### Model Format
+
+By default, to deploy image classification models using the DeepSparse Engine,
+the model should be supplied in the [ONNX] format. 
+This grants the Engine the flexibility to serve any model in a framework-agnostic
+manner. 
+
+Below we describe two possibilities to obtain the required ONNX model.
+
+#### Exporting the onnx file from the contents of a local checkpoint
+
+This pathway is relevant if you intend to deploy a model created using [SparseML] library. 
+For more information refer to the appropriate integration documentation in [SparseML].
+
+1. The output of the [SparseML] training is saved to output directory `/{save_dir}` (e.g. `/trained_model`)
+2. Depending on the chosen framework, the model files are saved to `model_path`=`/{save_dir}/{framework_name}/{model_tag}` (e.g `/trained_model/pytorch/resnet50/`)
+3. To generate an onnx model, refer to the [script for image classification ONNX export](https://github.com/neuralmagic/sparseml/blob/main/src/sparseml/pytorch/image_classification/export.py).
+
+Example:
+```bash
+sparseml.image_classification.export_onnx \
+    --arch-key resnet50 \
+    --dataset imagenet \
+    --dataset-path ~/datasets/ILSVRC2012 \
+    --checkpoint-path ~/checkpoints/resnet50_checkpoint.pth
+```
+This creates `model.onnx` file, in the parent directory of your `model_path`
+
+####  Directly using the SparseZoo stub
+
+Alternatively, you can skip the process of onnx model export by downloading all the required model data directly from Neural Magic's [SparseZoo](https://sparsezoo.neuralmagic.com/).
+Example:
+```python
+from sparsezoo import Model
+
+# you can lookup an appropriate model stub here: https://sparsezoo.neuralmagic.com/
+model_stub = "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95-none"
+model = Model(model_stub)
+
+# directly download the model data to your local directory
+model_path = model.path
+
+# the onnx model file is there, ready for deployment
+import os 
+os.path.isfile(model.onnx_model.path)
+>>>True
+```
+
+
+## Deployment APIs
+
+DeepSparse provides both a python Pipeline API and an out-of-the-box model 
+server that can be used for end-to-end inference in either existing python 
+workflows or as an HTTP endpoint. Both options provide similar specifications 
+for configurations and support a variety of Image Classification models.
+
+### Python API
+
+Pipelines are the default interface for running the inference with the 
+DeepSparse Engine.
+
+Once a model is obtained, either through [SparseML] training or directly from [SparseZoo],
+`deepsparse.Pipeline` can be used to easily facilitate end to end inference and deployment
+of the sparsified image classification model.
+
+If no model is specified to the `Pipeline` for a given task, the `Pipeline` will automatically
+select a pruned and quantized model for the task from the `SparseZoo` that can be used for accelerated
+inference. Note that other models in the [SparseZoo] will have different tradeoffs between speed, size,
+and accuracy.
+
+To learn about sparsification in more detail, refer to [SparseML docs](https://docs.neuralmagic.com/sparseml/)
+
+### HTTP Server
+
+As an alternative to Python API, the DeepSparse inference server allows you to 
+serve ONNX models and pipelines in HTTP. Both configuring and making requests 
+to the server follow the same parameters and schemas as the Pipelines enabling 
+simple deployment. Once launched, a `/docs` endpoint is created with full 
+endpoint descriptions and support for making sample requests.
+
+Example deployment using a 95% pruned resnet50 is given below
+For full documentation on deploying sparse image classification models with the
+DeepSparse Server, see the [documentation](https://github.com/neuralmagic/deepsparse/tree/main/src/deepsparse/server).
+
+##### Installation
+
+The deepsparse server requirements can be installed by specifying the `server` 
+extra dependency when installing DeepSparse.
+
+```bash
+pip install deepsparse[server]
+```
+
+## Deployment Use Cases
+
+The following section includes example usage of the Pipeline and server APIs for
+various image classification models. 
+
+[List of Image Classification SparseZoo Models](https://sparsezoo.neuralmagic.com/?domain=cv&sub_domain=classification&page=1)
+
+
+#### Python Pipeline
+
+```python
+from deepsparse import Pipeline
+cv_pipeline = Pipeline.create(
+  task='image_classification', 
+  model_path='zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95-none',  # Path to checkpoint or SparseZoo stub
+)
+input_image = "my_image.png" # path to input image
+inference = cv_pipeline(images=input_image)
+```
+
+#### HTTP Server
+
+Spinning up:
+```bash
+deepsparse.server \
+    task image_classification \
+    --model_path "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95-none" \
+    --port 5543
+```
+
+Making a request:
+```python
+import requests
+
+url = 'http://0.0.0.0:5543/predict/from_files'
+path = ['goldfish.jpeg'] # just put the name of images in here
+files = [('request', open(img, 'rb')) for img in path]
+resp = requests.post(url=url, files=files)
+```
+
+### Benchmarking
+
+The mission of Neural Magic is to enable GPU-class inference performance on commodity CPUs. 
+Want to find out how fast our sparse ONNX models perform inference? 
+You can quickly do benchmarking tests on your own with a single CLI command!
+
+You only need to provide the model path of a SparseZoo ONNX model or your own local ONNX model to get started:
+```bash
+deepsparse.benchmark zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95-none
+```
+Output:
+```bash
+Original Model Path: zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95-none
+Batch Size: 1
+Scenario: async
+Throughput (items/sec): 299.2372
+Latency Mean (ms/batch): 16.6677
+Latency Median (ms/batch): 16.6748
+Latency Std (ms/batch): 0.1728
+Iterations: 2995
+```
+
+To learn more about benchmarking, refer to the appropriate documentation.
+Also, check out our [Benchmarking tutorial](https://github.com/neuralmagic/deepsparse/tree/main/src/deepsparse/benchmark)!
+
+## Tutorials:
+For a deeper dive into using image classification models within the Neural Magic
+ecosystem, refer to the detailed tutorials on our [website](https://neuralmagic.com/):
+- [CV Use Cases](https://neuralmagic.com/use-cases/#computervision)
+
+## Support
+For Neural Magic Support, sign up or log in to our [Deep Sparse Community Slack](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ). Bugs, feature requests, or additional questions can also be posted to our [GitHub Issue Queue](https://github.com/neuralmagic/deepsparse/issues).
+
+
+[DeepSparse]: https://github.com/neuralmagic/deepsparse
+[hardware requirements]: https://docs.neuralmagic.com/deepsparse/source/hardware.html
+[ONNX]: https://onnx.ai/
+[SparseML]: https://github.com/neuralmagic/sparseml
+[SparseML Image Classification Documentation]: https://github.com/neuralmagic/sparseml/tree/main/src/sparseml/pytorch/image_classification/README_image_classification.md
+[SparseZoo]: https://sparsezoo.neuralmagic.com/
@@ -152,3 +152,4 @@ All you need is to add `/docs` at the end of your host URL:
     localhost:5543/docs
 
 ![alt text](./img/swagger_ui.png)
+
Original file line number	Diff line number	Diff line change
@@ -152,3 +152,4 @@ All you need is to add `/docs` at the end of your host URL:
`152`	`152`	`localhost:5543/docs`
`153`	`153`
`154`	`154`	`![alt text](./img/swagger_ui.png)`
	`155`	`+`