Skip to content

Commit add4625

Browse files
dbogunowiczcorey-nm
authored andcommitted
[CodeGen][Documentation] (#956)
* initial commit * coreys simplifications * finishing the second model static * ready, time for beautification * ready for review * moved the code to examples * fix eos logic * add argument num_tokens_to_generate * initial commit * change order * Update examples/codegen/README.md Co-authored-by: corey-nm <[email protected]> --------- Co-authored-by: corey-nm <[email protected]>
1 parent 0a3f48d commit add4625

File tree

1 file changed

+75
-6
lines changed

1 file changed

+75
-6
lines changed

Diff for: examples/codegen/README.md

+75-6
Original file line numberDiff line numberDiff line change
@@ -14,17 +14,86 @@ See the License for the specific language governing permissions and
1414
limitations under the License.
1515
-->
1616

17-
Example of how to run the pipeline:
17+
## ONNX Export
18+
Firstly, we need to install HuggingFace optimum library
19+
```bash
20+
pip install optimum
21+
```
22+
23+
### Patch the original PyTorch Model
24+
First apply the following modification to this file in your transformers installation:
25+
https://github.com/huggingface/transformers/blob/main/src/transformers/models/codegen/modeling_codegen.py#L212
26+
27+
\```diff
28+
-offset = layer_past[0].shape[-2]
29+
+offset = (attention_mask[0] == 0.0).sum() - 1.0
30+
\```
31+
32+
We need to do this because the existing with_past implementations assume there is no padding in the inputs. With deepsparse, we need to use static sequence length, which means our offset for the embeddings will depend on how many non-padded inputs we receive.
33+
34+
The new line checks this with the attention_mask. At this point in the code, attention_mask has been transformed from a tensor with 0s and 1s, to a tensor of `float.min` and `0.0`. So when we compare `attention_mask == 0.0` we are actually saying everywhere the attention_mask is 1.
35+
36+
We also need to subtract 1 from this count, because the attention mask is applied AFTER the kv cache is concatenated to the new token, which means the attention mask will actually be sequence length + 1 items. So we subtract 1 to get the current sequence length.
37+
38+
### Export the model to ONNX
39+
40+
```bash
41+
optimum-cli export onnx --model Salesforce/codegen-350M-multi codegen-350M-multi
42+
```
43+
This saves the model to directory `codegen-350-multi`
44+
45+
### Updating Model's Inputs Outputs Dimension Sizes
46+
TODO
47+
48+
## Running in the DeepSparse Pipeline
49+
50+
First, we need to rename `decoder_with_past_model.onnx` to `model.onnx` inside
51+
the `static-codegen-350-multi`, to abide the naming convention
52+
53+
Finally, run the pipeline:
1854

1955
```python
2056
from examples.codegen.text_generation import TextGenerationPipeline
2157

2258
codegen = TextGenerationPipeline(
2359
model_path="/network/damian/static-codegen-350M-multi",
2460
engine_type="onnxruntime",
25-
sequence_length=128, )
61+
sequence_length=128)
62+
63+
out = codegen(sequences="def hello_world():")
64+
print(out.sequences[0])
65+
```
66+
67+
```bash
68+
def hello_world():
69+
return 'Hello World!'
70+
71+
def hello_world_2():
72+
return 'Hello World!'
73+
74+
def hello_world_3():
75+
return 'Hello World!'
76+
77+
def hello_world_4():
78+
return 'Hello World!'
79+
80+
def hello_world_5():
81+
return 'Hello World!'
82+
83+
def hello_world_6():
84+
return 'Hello World!'
85+
86+
def hello_world_7():
87+
return 'Hello World!'
88+
89+
def hello_world_8():
90+
return 'Hello World!'
91+
92+
def hello
93+
```
2694

27-
out = codegen(sequences=["def hello_world():", "def fibonacci(x):"])
28-
for seq in out.sequences:
29-
print(seq)
30-
```
95+
Modifying pipeline behaviour:
96+
1. By adding argument `deterministic=False`, the next token of the sequence will not be chosen deterministically (using argmax), but will be
97+
sampled from the probablility distribution.
98+
2. By setting `sampling_temperature` when `deterministic=False`, we are allowing more or less randomness in the sampling method (https://towardsdatascience.com/how-to-sample-from-language-models-682bceb97277)
99+
3. By setting `num_tokens_to_generate`, we strictly specify how many tokens we want to generate per input.

0 commit comments

Comments
 (0)