Skip to content

Commit 635b1aa

Browse files
Llm docs 2 (#1313)
* 1) edited text generation pipeline * fixed up pages * Update text-generation-pipeline.md * Update text-generation-pipeline.md * Update text-generation-pipeline.md * Update text-generation-pipeline.md * Update text-generation-pipeline.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md
1 parent 92c6d54 commit 635b1aa

File tree

2 files changed

+73
-60
lines changed

2 files changed

+73
-60
lines changed

docs/llms/text-generation-pipeline.md

+33-23
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,14 @@ limitations under the License.
1616

1717
# **Text Generation Pipelines**
1818

19-
This user guide describes how to run inference of text generation models with DeepSparse.
19+
This user guide explains how to run inference of text generation models with DeepSparse.
2020

2121
## **Installation**
2222

23-
DeepSparse support for LLMs is currently available on DeepSparse's nightly build on PyPi:
23+
DeepSparse support for LLMs is available on DeepSparse's nightly build on PyPi:
2424

2525
```bash
26-
pip install -U deepsparse-nightly==1.6.0.20231007[transformers]
26+
pip install -U deepsparse-nightly[transformers]==1.6.0.20231007
2727
```
2828

2929
#### **System Requirements**
@@ -41,8 +41,8 @@ DeepSparse exposes a Pipeline interface called `TextGeneration`, which is used t
4141
from deepsparse import TextGeneration
4242

4343
# construct a pipeline
44-
MODEL_PATH = "zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none"
45-
pipeline = TextGeneration(model_path=MODEL_PATH)
44+
model_path = "zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized"
45+
pipeline = TextGeneration(model=model_path)
4646

4747
# generate text
4848
prompt = "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: What is Kubernetes? ### Response:"
@@ -52,27 +52,29 @@ print(output.generations[0].text)
5252
# >> Kubernetes is an open-source container orchestration system for automating deployment, scaling, and management of containerized applications.
5353
```
5454

55-
> **Note:** The 7B model takes about 2 minutes to compile. Set `MODEL_PATH` to `hf:mgoin/TinyStories-33M-quant-deepsparse` to use a small TinyStories model for quick compilation if you are just experimenting.
55+
> **Note:** The 7B model takes about 2 minutes to compile. Set `model_path = hf:mgoin/TinyStories-33M-quant-deepsparse` to use a small TinyStories model for quick compilation if you are just experimenting.
56+
5657
## **Model Format**
5758

5859
DeepSparse accepts models in ONNX format, passed either as SparseZoo stubs or local directories.
5960

60-
> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs from SparseZoo.***
61+
> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs created by Neural Magic.***
62+
>
6163
### **SparseZoo Stubs**
6264

63-
SparseZoo stubs identify a model in SparseZoo. For instance, `zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none` identifes a 50% pruned-quantized MPT-7b model fine-tuned on the Dolly dataset. We can pass the stub to `TextGeneration`, which downloads and caches the ONNX file.
65+
SparseZoo stubs identify a model in SparseZoo. For instance, `zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized` identifes a 50% pruned-quantized pretrained MPT-7b model fine-tuned on the Dolly dataset. We can pass the stub to `TextGeneration`, which downloads and caches the ONNX file.
6466

6567
```python
66-
model_path = "zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none"
67-
pipeline = TextGeneration(model_path=model_path)
68+
model_path = "zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized"
69+
pipeline = TextGeneration(model=model_path)
6870
```
6971

7072
### **Local Deployment Directory**
7173

7274
Additionally, we can pass a local path to a deployment directory. Use the SparseZoo API to download an example deployment directory:
7375
```python
74-
import sparsezoo
75-
sz_model = sparsezoo.Model("zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none", "./local-model")
76+
from sparsezoo import Model
77+
sz_model = Model("zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized", "./local-model")
7678
sz_model.deployment.download()
7779
```
7880

@@ -84,8 +86,16 @@ ls ./local-model/deployment
8486

8587
We can pass the local directory path to `TextGeneration`:
8688
```python
87-
model_path = "./local-model/deployment"
88-
pipeline = TextGeneration(model_path=model_path)
89+
from deepsparse import TextGeneration
90+
pipeline = TextGeneration(model="./local-model/deployment")
91+
```
92+
93+
### **Hugging Face Models**
94+
Hugging Face models which conform to the directory structure listed above can also be run with DeepSparse by prepending `hf:` to a model id. The following runs a [60% pruned-quantized MPT-7b model trained on GSM](https://huggingface.co/neuralmagic/mpt-7b-gsm8k-pruned60-quant).
95+
96+
```python
97+
from deepsparse import TextGeneration
98+
pipeline = TextGeneration(model="hf:neuralmagic/mpt-7b-gsm8k-pruned60-quant")
8999
```
90100

91101
## **Input and Output Formats**
@@ -96,8 +106,7 @@ The following examples use a quantized 33M parameter TinyStories model for quick
96106
```python
97107
from deepsparse import TextGeneration
98108

99-
MODEL_PATH = "hf:mgoin/TinyStories-33M-quant-deepsparse"
100-
pipeline = TextGeneration(model_path=MODEL_PATH)
109+
pipeline = TextGeneration(model="hf:mgoin/TinyStories-33M-quant-deepsparse")
101110
```
102111

103112
### Input Format
@@ -112,13 +121,14 @@ for prompt_i, generation_i in zip(output.prompts, output.generations):
112121
print(f"{prompt_i}{generation_i.text}")
113122

114123
# >> Princess Peach jumped from the balcony and landed on the ground. She was so happy that she had found her treasure. She thanked the old
124+
115125
# >> Mario ran into the castle and started to explore. He ran around the castle and climbed on the throne. He even tried to climb
116126
```
117127

118128
- `streaming`: Boolean determining whether to stream response. If True, then the results are returned as a generator object which yields the results as they are generated.
119129

120130
```python
121-
prompt = "Princess peach jumped from the balcony"
131+
prompt = "Princess Peach jumped from the balcony"
122132
output_iterator = pipeline(prompt=prompt, streaming=True, max_new_tokens=20)
123133

124134
print(prompt, end="")
@@ -172,8 +182,8 @@ The following examples use a quantized 33M parameter TinyStories model for quick
172182
```python
173183
from deepsparse import TextGeneration
174184

175-
MODEL_PATH = "hf:mgoin/TinyStories-33M-quant-deepsparse"
176-
pipeline = TextGeneration(model_path=MODEL_PATH)
185+
model_id = "hf:mgoin/TinyStories-33M-quant-deepsparse"
186+
pipeline = TextGeneration(model=model_id)
177187
```
178188

179189
### **Creating A `GenerationConfig`**
@@ -213,7 +223,7 @@ We can pass a `GenerationConfig` to `TextGeneration.__init__` or `TextGeneration
213223

214224
```python
215225
# set generation_config during __init__
216-
pipeline_w_gen_config = TextGeneration(model_path=MODEL_PATH, generation_config={"max_new_tokens": 10})
226+
pipeline_w_gen_config = TextGeneration(model=model_id, generation_config={"max_new_tokens": 10})
217227

218228
# generation_config is the default during __call__
219229
output = pipeline_w_gen_config(prompt=prompt)
@@ -225,7 +235,7 @@ print(f"{prompt}{output.generations[0].text}")
225235

226236
```python
227237
# no generation_config set during __init__
228-
pipeline_w_no_gen_config = TextGeneration(model_path=MODEL_PATH)
238+
pipeline_w_no_gen_config = TextGeneration(model=model_id)
229239

230240
# generation_config is the passed during __call__
231241
output = pipeline_w_no_gen_config(prompt=prompt, generation_config= {"max_new_tokens": 10})
@@ -295,7 +305,7 @@ import numpy
295305
# only 20 logits are not set to -inf == only 20 logits used to sample token
296306
output = pipeline(prompt=prompt, do_sample=True, top_k=20, max_new_tokens=15, output_scores=True)
297307
print(numpy.isfinite(output.generations[0].score).sum(axis=1))
298-
# >> array([20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20])
308+
# >> [20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20]
299309
```
300310

301311
- `top_p`: Float to define the tokens that are considered with nucleus sampling. If `0.0`, `top_p` is turned off. Default is `0.0`
@@ -306,7 +316,7 @@ import numpy
306316
output = pipeline(prompt=prompt, do_sample=True, top_p=0.9, max_new_tokens=15, output_scores=True)
307317
print(numpy.isfinite(output.generations[0].score).sum(axis=1))
308318

309-
# >> array([20, 15, 10, 5, 25, 3, 10, 7, 6, 6, 15, 12, 11, 3, 4, 4])
319+
# >> [ 5 119 18 14 204 6 7 367 191 20 12 7 46 6 2 35]
310320
```
311321
- `repetition_penalty`: The more a token is used within generation the more it is penalized to not be picked in successive generation passes. If `0.0`, `repetation_penalty` is turned off. Default is `0.0`
312322

research/mpt/README.md

+40-37
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,37 @@
1-
# **Sparse Finetuned LLMs with DeepSparse**
2-
3-
DeepSparse has support for performant inference of sparse large language models, starting with Mosaic's MPT.
1+
*LAST UPDATED: 10/11/2023*
42

5-
In this overview, we will discuss:
6-
1. [Current status of our sparse fine-tuning research](#sparse-fine-tuning-research)
7-
2. [How to try text generation with DeepSparse](#try-it-now)
3+
# **Sparse Finetuned LLMs with DeepSparse**
84

9-
For detailed usage instructions, [see the text generation user guide](https://github.com/neuralmagic/deepsparse/tree/main/docs/llms/text-generation-pipeline.md).
5+
DeepSparse has support for performant inference of sparse large language models, starting with Mosaic's MPT.
106

11-
![deepsparse_mpt_gsm_speedup](https://github.com/neuralmagic/deepsparse/assets/3195154/8687401c-f479-4999-ba6b-e01c747dace9)
7+
In this research overview, we will discuss:
8+
1. [Our Sparse Fineuning Research](#sparse-finetuning-research)
9+
2. [How to try Text Generation with DeepSparse](#try-it-now)
1210

1311
## **Sparse Finetuning Research**
1412

15-
Sparsity is a powerful model compression technique, where weights are removed from the network with limited accuracy drop.
13+
We show that MPT-7B can be pruned to ~60% sparsity with INT8 quantization (and 70% sparsity without quantization), with no accuracy drop, using a technique called **Sparse Finetuning**, where we prune the network during the finetuning process.
1614

17-
We show that MPT-7B can be pruned to ~60% sparsity with INT8 quantization, without loss, using a technique called **Sparse Finetuning**, where we prune the network during the fine-tuning process.
15+
When running the pruned network with DeepSparse, we can accelerate inference by ~7x over the dense-FP32 baseline!
1816

1917
### **Sparse Finetuning on Grade-School Math (GSM)**
2018

21-
Open-source LLMs are typically fine-tuned onto downstream datasets for two reasons:
22-
* **Instruction Tuning**: show the LLM examples of how to respond to human input or prompts properly
23-
* **Domain Adaptation**: show the LLM examples with information it does not currently understand
19+
Training LLMs consist of two steps. First, the model is pre-trained on a very large corpus of text (typically >1T tokens). Then, the model is adapted for downstream use by continuing training with a much smaller high quality curated dataset. This second step is called finetuning.
20+
21+
Fine-tuning is useful for two main reasons:
22+
1. It can teach the model *how to respond* to input (often called **instruction tuning**).
23+
2. It can teach the model *new information* (often called **domain adaptation**).
24+
2425

25-
An example of how domain adaptation is helpful is solving the [Grade-school math (GSM) dataset](https://huggingface.co/datasets/gsm8k). GSM is a set of grade school word problems and a notoriously difficult task for LLMs, as evidenced by the 0% zero-shot accuracy of MPT-7B-base. By fine-tuning with a very small set of ~7k training examples, however, we can boost the model's accuracy on the test set to 28.2%.
26+
An example of how domain adaptation is helpful is solving the [Grade-school math (GSM) dataset](https://huggingface.co/datasets/gsm8k). GSM is a set of grade school word problems and a notoriously difficult task for LLMs, as evidenced by the 0% zero-shot accuracy of MPT-7B. By fine-tuning with a very small set of ~7k training examples, however, we can boost the model's accuracy on the test set to 28.2%.
2627

27-
The key insight from our paper is that we can prune the network during the finetuning process. We apply [SparseGPT](https://arxiv.org/pdf/2301.00774.pdf) to prune the network after dense finetuning and retrain for 2 epochs with L2 distillation. The result is a 60% sparse-quantized model with limited accuracy drop on GSM8k runs 6.7x faster than the dense baseline with DeepSparse!
28+
The key insight from our paper is that we can prune the network during the finetuning process. We apply [SparseGPT](https://arxiv.org/pdf/2301.00774.pdf) to prune the network after dense finetuning and retrain for 2 epochs with L2 distillation. The result is a 60% sparse-quantized model with no accuracy drop on GSM8k runs 7x faster than the dense baseline with DeepSparse!
2829

29-
Paper: (link to paper)
30+
<div align="center">
31+
<img src="https://github.com/neuralmagic/deepsparse/assets/3195154/8687401c-f479-4999-ba6b-e01c747dace9" width="60%"/>
32+
</div>
33+
34+
- [See the paper on Arxiv]() << UPDATE >>
3035

3136
### **How Is This Useful For Real World Use?**
3237

@@ -37,18 +42,20 @@ While GSM is a "toy" math dataset, it serves as an example of how LLMs can be ad
3742
Install the DeepSparse Nightly build (requires Linux):
3843

3944
```bash
40-
pip install deepsparse-nightly[transformers]
45+
pip install deepsparse-nightly[transformers]==1.6.0.20231007
4146
```
4247

48+
The models generated in the paper are hosted on [SparseZoo](https://sparsezoo.neuralmagic.com/?ungrouped=true&sort=null&datasets=gsm8k&architectures=mpt) and [Hugging Face](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d).
49+
4350
### MPT-7B on GSM
4451

45-
We can run inference on the 60% sparse-quantized MPT-7B GSM model using DeepSparse's `TextGeneration` Pipeline:
52+
We can run inference on the models using DeepSparse's `TextGeneration` Pipeline:
4653

4754
```python
4855
from deepsparse import TextGeneration
4956

50-
MODEL_PATH = "zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/gsm8k/pruned60_quant-none"
51-
pipeline = TextGeneration(model_path=MODEL_PATH)
57+
model = "zoo:mpt-7b-gsm8k_mpt_pretrain-pruned60_quantized"
58+
pipeline = TextGeneration(model_path=model)
5259

5360
prompt = "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May"
5461
output = pipeline(prompt=prompt)
@@ -59,13 +66,13 @@ print(output.generations[0].text)
5966
### >> #### 72
6067
```
6168

62-
It is also possible to run models directly from Hugging Face by prepending `"hf:"` to a model id, such as:
69+
It is also possible to run the models directly from Hugging Face by prepending `"hf:"` to a model id, such as:
6370

6471
```python
6572
from deepsparse import TextGeneration
6673

67-
MODEL_PATH = "hf:neuralmagic/mpt-7b-gsm8k-pruned60-quant"
68-
pipeline = TextGeneration(model_path=MODEL_PATH)
74+
hf_model_id = "hf:neuralmagic/mpt-7b-gsm8k-pruned60-quant"
75+
pipeline = TextGeneration(model=hf_model_id)
6976

7077
prompt = "Question: Marty has 100 centimeters of ribbon that he must cut into 4 equal parts. Each of the cut parts must be divided into 5 equal parts. How long will each final cut be?"
7178
output = pipeline(prompt=prompt)
@@ -76,26 +83,22 @@ print(output.generations[0].text)
7683
### >> #### 5
7784
```
7885

86+
> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs created by Neural Magic's team***
87+
88+
7989
#### Other Resources
8090
- [Check out all the MPT GSM models on SparseZoo](https://sparsezoo.neuralmagic.com/?datasets=gsm8k&ungrouped=true)
8191
- [Try out the live demo on Hugging Face Spaces](https://huggingface.co/spaces/neuralmagic/sparse-mpt-7b-gsm8k) and view the [collection of paper, demos, and models](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d)
92+
- [Check out the detailed `TextGeneration` Pipeline documentation](https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md)
8293

83-
### **MPT-7B on Dolly-HHRLHF**
94+
## **Roadmap**
8495

85-
We have also made a 50% sparse-quantized MPT-7B fine-tuned on [Dolly-hhrlhf](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) available on SparseZoo. We can run inference with the following:
96+
Following these initial results, we are rapidly expanding our support for LLMs across the Neural Magic stack, including:
8697

87-
```python
88-
from deepsparse import TextGeneration
89-
90-
MODEL_PATH = "zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none"
91-
pipeline = TextGeneration(model_path=MODEL_PATH)
92-
93-
prompt = "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is Kubernetes? ### Response:"
94-
output = pipeline(prompt=prompt)
95-
print(output.generations[0].text)
96-
97-
### >> Kubernetes is an open-source container orchestration system for automating deployment, scaling, and management of containerized applications.
98-
```
98+
- **Productizing Sparse Fine Tuning**: Enable external users to apply the sparse fine-tuning to business datasets
99+
- **Expanding Model Support**: Apply sparse fine-tuning results to Llama2 and Mistral models
100+
- **Pushing to Higher Sparsity**: Improving our pruning algorithms to reach higher sparsity
101+
- **Building General Sparse Model**: Create sparse model that can perform well on general tasks like OpenLLM leaderboard
99102

100103
## **Feedback / Roadmap Requests**
101104

0 commit comments

Comments
 (0)