Skip to content

Commit 743bd95

Browse files
committed
update README instructions; fix linalg import
1 parent c637dbd commit 743bd95

File tree

3 files changed

+59
-8
lines changed

3 files changed

+59
-8
lines changed

Diff for: setup.py

+5-1
Original file line numberDiff line numberDiff line change
@@ -162,7 +162,11 @@ def _parse_requirements_file(file_path):
162162
"haystack_reqs.txt",
163163
)
164164
_haystack_integration_deps = _parse_requirements_file(_haystack_requirements_file_path)
165-
_clip_deps = ["open_clip_torch==2.20.0", "scipy==1.10.1"]
165+
_clip_deps = [
166+
"open_clip_torch==2.20.0",
167+
"scipy==1.10.1",
168+
f"{'nm-transformers' if is_release else 'nm-transformers-nightly'}",
169+
]
166170

167171

168172
def _check_supported_system():

Diff for: src/deepsparse/clip/README.md

+51-4
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ DeepSparse allows inference on [CLIP](https://github.com/mlfoundations/open_clip
44

55
The CLIP integration currently supports the following task:
66
- **Zero-shot Image Classification** - Classifying images given possible classes
7+
- **Caption Generation** - Generate a caption given an image
78

89
## Getting Started
910

@@ -13,20 +14,34 @@ Before you start your adventure with the DeepSparse Engine, make sure that your
1314
```pip install deepsparse[clip]```
1415

1516
### Model Format
16-
By default, to deploy CLIP models using the DeepSparse Engine, it is required to supply the model in the ONNX format. This grants the engine the flexibility to serve any model in a framework-agnostic environment. To see examples of pulling CLIP models and exporting them to ONNX, please see the [sparseml documentation](https://github.com/neuralmagic/sparseml/tree/main/integrations/clip). For the Zero-shot image classification workflow, two ONNX models are required, a visual model for CLIP's visual branch, and a text model for CLIP's text branch. Both of these model should be produced through the sparseml integration linked above.
17+
By default, to deploy CLIP models using the DeepSparse Engine, it is required to supply the model in the ONNX format. This grants the engine the flexibility to serve any model in a framework-agnostic environment. To see examples of pulling CLIP models and exporting them to ONNX, please see the [sparseml documentation](https://github.com/neuralmagic/sparseml/tree/main/integrations/clip).
18+
19+
For the Zero-shot image classification workflow, two ONNX models are required, a visual model for CLIP's visual branch, and a text model for CLIP's text branch. Both of these models can be produced through the sparseml integration linked above. For caption generation, specific models called CoCa models are required and instructions on how to export CoCa models are also provided in the sparseml documentation above. The CoCa exporting pathway will generate one additional decoder model, along with the text and visual models.
1720

1821
### Deployment examples:
19-
The following example uses pipelines to run the CLIP models for inference. As input, the pipeline ingests a list of images and a list of possible classes. A class is returned for each of the provided images.
22+
The following example uses pipelines to run the CLIP models for inference. For Zero-shot prediction, the pipeline ingests a list of images and a list of possible classes. A class is returned for each of the provided images. For caption generation, only an image file is required.
2023

2124
If you don't have images ready, pull down the sample images using the following commands:
2225

2326
```bash
2427
wget -O basilica.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg
28+
```
2529

30+
```bash
2631
wget -O buddy.jpeg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/tests/deepsparse/pipelines/sample_images/buddy.jpeg
2732
```
2833

29-
This will pull down two images, one with a happy dog and one with St.Peter's basilica.
34+
```bash
35+
wget -O thailand.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolact/sample_images/thailand.jpg
36+
```
37+
38+
<img width="333" src="https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg">
39+
40+
<img width="333" src="https://raw.githubusercontent.com/neuralmagic/deepsparse/main/tests/deepsparse/pipelines/sample_images/buddy.jpeg">
41+
42+
<img width="333" src="https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolact/sample_images/thailand.jpg">
43+
44+
This will pull down 3 images, a happy dog, St.Peter's basilica, and two elephants.
3045

3146
#### Zero-shot Prediction
3247

@@ -43,7 +58,7 @@ from deepsparse.clip import (
4358
)
4459

4560
possible_classes = ["ice cream", "an elephant", "a dog", "a building", "a church"]
46-
images = ["basilica.jpg", "buddy.jpeg"]
61+
images = ["basilica.jpg", "buddy.jpeg", "thailand.jpg"]
4762

4863
model_path_text = "zeroshot_research/text/model.onnx"
4964
model_path_visual = "zeroshot_research/visual/model.onnx"
@@ -72,4 +87,36 @@ DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.6.0.20230727 C
7287
7388
Image basilica.jpg is a picture of a church
7489
Image buddy.jpeg is a picture of a dog
90+
Image thailand.jpg is a picture of an elephant
91+
```
92+
93+
#### Caption Generation
94+
Let's try a caption generation example. We'll leverage the `thailand.jpg` file that was pulled down earlier.
95+
96+
```python
97+
from deepsparse import BasePipeline
98+
from deepsparse.clip import CLIPCaptionInput, CLIPVisualInput
99+
100+
root = "caption_models"
101+
model_path_visual = f"{root}/clip_visual.onnx"
102+
model_path_text = f"{root}/clip_text.onnx"
103+
model_path_decoder = f"{root}/clip_text_decoder.onnx"
104+
105+
kwargs = {
106+
"visual_model_path": model_path_visual,
107+
"text_model_path": model_path_text,
108+
"decoder_model_path": model_path_decoder,
109+
}
110+
pipeline = BasePipeline.create(task="clip_caption", **kwargs)
111+
112+
pipeline_input = CLIPCaptionInput(image=CLIPVisualInput(images="thailand.jpg"))
113+
output = pipeline(pipeline_input)
114+
print(output[0])
115+
```
116+
Running the code above, we get the following caption:
117+
118+
```
119+
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.6.0.20230727 COMMUNITY | (3cb4a3e5) (optimized) (system=avx2, binary=avx2)
120+
121+
an adult elephant and a baby elephant .
75122
```

Diff for: src/deepsparse/clip/zeroshot_pipeline.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
from typing import Any, List, Type
1616

1717
import numpy as np
18-
from numpy import linalg as la
18+
from numpy import linalg as lingalg
1919
from pydantic import BaseModel, Field
2020

2121
from deepsparse.clip import CLIPTextInput, CLIPVisualInput
@@ -82,8 +82,8 @@ def __call__(self, *args, **kwargs):
8282
visual_output = self.visual(pipeline_inputs.image).image_embeddings[0]
8383
text_output = self.text(pipeline_inputs.text).text_embeddings[0]
8484

85-
visual_output /= la.norm(visual_output, axis=-1, keepdims=True)
86-
text_output /= la.norm(text_output, axis=-1, keepdims=True)
85+
visual_output /= lingalg.norm(visual_output, axis=-1, keepdims=True)
86+
text_output /= lingalg.norm(text_output, axis=-1, keepdims=True)
8787

8888
output_product = 100.0 * visual_output @ text_output.T
8989
text_probs = softmax(output_product, axis=-1)

0 commit comments

Comments
 (0)