@@ -109,18 +109,18 @@ python convert_llama_weights_to_hf.py --input_dir /path/to/downloaded/llama/weig
109
109
# Benchmark language generation with 4-bit LLaMA-7B:
110
110
111
111
# Save compressed model
112
- CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-hf/llama-7b c4 --wbits 4 --true-sequential --act-order --save llama7b-4bit.pt
112
+ CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-hf/llama-7b c4 --wbits 4 --true-sequential --act-order --groupsize 128 -- save llama7b-4bit-128g .pt
113
113
# Or save compressed `.safetensors` model
114
- CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-hf/llama-7b c4 --wbits 4 --true-sequential --act-order --save_safetensors llama7b-4bit.safetensors
114
+ CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-hf/llama-7b c4 --wbits 4 --true-sequential --act-order --groupsize 128 -- save_safetensors llama7b-4bit-128g .safetensors
115
115
# Benchmark generating a 2048 token sequence with the saved model
116
- CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-hf/llama-7b c4 --wbits 4 --load llama7b-4bit.pt --benchmark 2048 --check
116
+ CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-hf/llama-7b c4 --wbits 4 --groupsize 128 -- load llama7b-4bit-128g .pt --benchmark 2048 --check
117
117
# Benchmark FP16 baseline, note that the model will be split across all listed GPUs
118
118
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python llama.py ./llama-hf/llama-7b c4 --benchmark 2048 --check
119
119
120
120
# model inference with the saved model
121
- CUDA_VISIBLE_DEVICES=0 python llama_inference.py ./llama-hf/llama-7b --wbits 4 --load llama7b-4bit.pt --text "this is llama"
121
+ CUDA_VISIBLE_DEVICES=0 python llama_inference.py ./llama-hf/llama-7b --wbits 4 --groupsize 128 -- load llama7b-4bit-128g .pt --text "this is llama"
122
122
# model inference with the saved model with offload(This is very slow. This is a simple implementation and could be improved with technologies like flexgen(https://github.com/FMInference/FlexGen).
123
- CUDA_VISIBLE_DEVICES=0 python llama_inference_offload.py ./llama-hf/llama-7b --wbits 4 --load llama7b-4bit.pt --text "this is llama" --pre_layer 16
123
+ CUDA_VISIBLE_DEVICES=0 python llama_inference_offload.py ./llama-hf/llama-7b --wbits 4 --groupsize 128 -- load llama7b-4bit-128g .pt --text "this is llama" --pre_layer 16
124
124
It takes about 180 seconds to generate 45 tokens(5->50 tokens) on single RTX3090 based on LLaMa-65B. pre_layer is set to 50.
125
125
```
126
126
Basically, 4-bit quantization and 128 groupsize are recommended.
0 commit comments