You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
prompt ="Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: What is Kubernetes? ### Response:"
# >> Kubernetes is an open-source container orchestration system for automating deployment, scaling, and management of containerized applications.
53
53
```
54
54
55
-
> **Note:** The 7B model takes about 2 minutes to compile. Set `MODEL_PATH` to `hf:mgoin/TinyStories-33M-quant-deepsparse` to use a small TinyStories model for quick compilation if you are just experimenting.
55
+
> **Note:** The 7B model takes about 2 minutes to compile. Set `model_path = hf:mgoin/TinyStories-33M-quant-deepsparse` to use a small TinyStories model for quick compilation if you are just experimenting.
56
+
56
57
## **Model Format**
57
58
58
59
DeepSparse accepts models in ONNX format, passed either as SparseZoo stubs or local directories.
59
60
60
-
> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs from SparseZoo.***
61
+
> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs created by Neural Magic.***
62
+
>
61
63
### **SparseZoo Stubs**
62
64
63
-
SparseZoo stubs identify a model in SparseZoo. For instance, `zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none` identifes a 50% pruned-quantized MPT-7b model fine-tuned on the Dolly dataset. We can pass the stub to `TextGeneration`, which downloads and caches the ONNX file.
65
+
SparseZoo stubs identify a model in SparseZoo. For instance, `zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized` identifes a 50% pruned-quantized pretrained MPT-7b model fine-tuned on the Dolly dataset. We can pass the stub to `TextGeneration`, which downloads and caches the ONNX file.
Hugging Face models which conform to the directory structure listed above can also be run with DeepSparse by prepending `hf:` to a model id. The following runs a [60% pruned-quantized MPT-7b model trained on GSM](https://huggingface.co/neuralmagic/mpt-7b-gsm8k-pruned60-quant).
@@ -112,13 +121,14 @@ for prompt_i, generation_i in zip(output.prompts, output.generations):
112
121
print(f"{prompt_i}{generation_i.text}")
113
122
114
123
# >> Princess Peach jumped from the balcony and landed on the ground. She was so happy that she had found her treasure. She thanked the old
124
+
115
125
# >> Mario ran into the castle and started to explore. He ran around the castle and climbed on the throne. He even tried to climb
116
126
```
117
127
118
128
-`streaming`: Boolean determining whether to stream response. If True, then the results are returned as a generator object which yields the results as they are generated.
-`repetition_penalty`: The more a token is used within generation the more it is penalized to not be picked in successive generation passes. If `0.0`, `repetation_penalty` is turned off. Default is `0.0`
Copy file name to clipboardExpand all lines: research/mpt/README.md
+40-37
Original file line number
Diff line number
Diff line change
@@ -1,32 +1,37 @@
1
-
# **Sparse Finetuned LLMs with DeepSparse**
2
-
3
-
DeepSparse has support for performant inference of sparse large language models, starting with Mosaic's MPT.
1
+
*LAST UPDATED: 10/11/2023*
4
2
5
-
In this overview, we will discuss:
6
-
1.[Current status of our sparse fine-tuning research](#sparse-fine-tuning-research)
7
-
2.[How to try text generation with DeepSparse](#try-it-now)
3
+
# **Sparse Finetuned LLMs with DeepSparse**
8
4
9
-
For detailed usage instructions, [see the text generation user guide](https://github.com/neuralmagic/deepsparse/tree/main/docs/llms/text-generation-pipeline.md).
5
+
DeepSparse has support for performant inference of sparse large language models, starting with Mosaic's MPT.
2.[How to try Text Generation with DeepSparse](#try-it-now)
12
10
13
11
## **Sparse Finetuning Research**
14
12
15
-
Sparsity is a powerful model compression technique, where weights are removed from the network with limited accuracy drop.
13
+
We show that MPT-7B can be pruned to ~60% sparsity with INT8 quantization (and 70% sparsity without quantization), with no accuracy drop, using a technique called **Sparse Finetuning**, where we prune the network during the finetuning process.
16
14
17
-
We show that MPT-7B can be pruned to ~60% sparsity with INT8 quantization, without loss, using a technique called **Sparse Finetuning**, where we prune the network during the fine-tuning process.
15
+
When running the pruned network with DeepSparse, we can accelerate inference by ~7x over the dense-FP32 baseline!
18
16
19
17
### **Sparse Finetuning on Grade-School Math (GSM)**
20
18
21
-
Open-source LLMs are typically fine-tuned onto downstream datasets for two reasons:
22
-
***Instruction Tuning**: show the LLM examples of how to respond to human input or prompts properly
23
-
***Domain Adaptation**: show the LLM examples with information it does not currently understand
19
+
Training LLMs consist of two steps. First, the model is pre-trained on a very large corpus of text (typically >1T tokens). Then, the model is adapted for downstream use by continuing training with a much smaller high quality curated dataset. This second step is called finetuning.
20
+
21
+
Fine-tuning is useful for two main reasons:
22
+
1. It can teach the model *how to respond* to input (often called **instruction tuning**).
23
+
2. It can teach the model *new information* (often called **domain adaptation**).
24
+
24
25
25
-
An example of how domain adaptation is helpful is solving the [Grade-school math (GSM) dataset](https://huggingface.co/datasets/gsm8k). GSM is a set of grade school word problems and a notoriously difficult task for LLMs, as evidenced by the 0% zero-shot accuracy of MPT-7B-base. By fine-tuning with a very small set of ~7k training examples, however, we can boost the model's accuracy on the test set to 28.2%.
26
+
An example of how domain adaptation is helpful is solving the [Grade-school math (GSM) dataset](https://huggingface.co/datasets/gsm8k). GSM is a set of grade school word problems and a notoriously difficult task for LLMs, as evidenced by the 0% zero-shot accuracy of MPT-7B. By fine-tuning with a very small set of ~7k training examples, however, we can boost the model's accuracy on the test set to 28.2%.
26
27
27
-
The key insight from our paper is that we can prune the network during the finetuning process. We apply [SparseGPT](https://arxiv.org/pdf/2301.00774.pdf) to prune the network after dense finetuning and retrain for 2 epochs with L2 distillation. The result is a 60% sparse-quantized model with limited accuracy drop on GSM8k runs 6.7x faster than the dense baseline with DeepSparse!
28
+
The key insight from our paper is that we can prune the network during the finetuning process. We apply [SparseGPT](https://arxiv.org/pdf/2301.00774.pdf) to prune the network after dense finetuning and retrain for 2 epochs with L2 distillation. The result is a 60% sparse-quantized model with no accuracy drop on GSM8k runs 7x faster than the dense baseline with DeepSparse!
The models generated in the paper are hosted on [SparseZoo](https://sparsezoo.neuralmagic.com/?ungrouped=true&sort=null&datasets=gsm8k&architectures=mpt) and [Hugging Face](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d).
49
+
43
50
### MPT-7B on GSM
44
51
45
-
We can run inference on the 60% sparse-quantized MPT-7B GSM model using DeepSparse's `TextGeneration` Pipeline:
52
+
We can run inference on the models using DeepSparse's `TextGeneration` Pipeline:
prompt ="Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May"
prompt ="Question: Marty has 100 centimeters of ribbon that he must cut into 4 equal parts. Each of the cut parts must be divided into 5 equal parts. How long will each final cut be?"
> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs created by Neural Magic's team***
87
+
88
+
79
89
#### Other Resources
80
90
-[Check out all the MPT GSM models on SparseZoo](https://sparsezoo.neuralmagic.com/?datasets=gsm8k&ungrouped=true)
81
91
-[Try out the live demo on Hugging Face Spaces](https://huggingface.co/spaces/neuralmagic/sparse-mpt-7b-gsm8k) and view the [collection of paper, demos, and models](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d)
92
+
-[Check out the detailed `TextGeneration` Pipeline documentation](https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md)
82
93
83
-
###**MPT-7B on Dolly-HHRLHF**
94
+
## **Roadmap**
84
95
85
-
We have also made a 50% sparse-quantized MPT-7B fine-tuned on [Dolly-hhrlhf](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) available on SparseZoo. We can run inference with the following:
96
+
Following these initial results, we are rapidly expanding our support for LLMs across the Neural Magic stack, including:
prompt ="Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is Kubernetes? ### Response:"
94
-
output = pipeline(prompt=prompt)
95
-
print(output.generations[0].text)
96
-
97
-
### >> Kubernetes is an open-source container orchestration system for automating deployment, scaling, and management of containerized applications.
98
-
```
98
+
-**Productizing Sparse Fine Tuning**: Enable external users to apply the sparse fine-tuning to business datasets
99
+
-**Expanding Model Support**: Apply sparse fine-tuning results to Llama2 and Mistral models
100
+
-**Pushing to Higher Sparsity**: Improving our pruning algorithms to reach higher sparsity
101
+
-**Building General Sparse Model**: Create sparse model that can perform well on general tasks like OpenLLM leaderboard
0 commit comments