-
Notifications
You must be signed in to change notification settings - Fork 11.9k
MPT support in llama.cpp #3417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
MPT support in llama.cpp #3417
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
b49792b
CUDA: added support for ggml_clamp (see also: https://github.com/gger…
jploski 15236e8
mpt : added an implementation based (mostly) on falcon integration, m…
jploski 84e30e8
mpt : protect against "clip_qkv": null in mpt-7b
jploski 00e8c5c
mpt : quick fix to avoid "Strange model" warning when quantizing MPT …
jploski 1be89c4
mpt : addendum to changeset:84e30e8 - leave parameter clamp_kqv out f…
jploski 26c253e
mpt : standardized all tensor names to follow GGUF spec
jploski df072d2
mpt : addendum to changeset:1be89c40 - use "req" parameter of GGUF_GE…
jploski 90e7d6d
mpt : fixed comment s/gptneox/mpt/
jploski 4708012
mpt : remove tabs, trailing whitespace
jploski 1364bcd
mpt : removed ne01 + n_past == ne00 assertion from alibi (cuda/f32) a…
jploski 7d6a24a
mpt : updated convert-mpt-hf-to-gguf.py to reflect changes made to co…
jploski 292363e
Merge branch 'master' of https://github.com/ggerganov/llama.cpp into …
cebtenzzre ad3c2f3
comment out n_past instead of marking it unused
cebtenzzre 1a454eb
mpt : removed hardcoded +178 from convert script in favor of utilizin…
jploski 32172f1
mpt : remove unused tokenizer_json in convert script
cebtenzzre 96cf3f5
ggml : remove obsolete n_past assert in ggml_alibi
ggerganov 9b66378
llama : print clam_kqv and max_alibi_bias hparams
ggerganov File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,216 @@ | ||
#!/usr/bin/env python3 | ||
# HF mpt--> gguf conversion | ||
|
||
from __future__ import annotations | ||
|
||
import argparse | ||
import json | ||
import os | ||
import struct | ||
import sys | ||
from pathlib import Path | ||
from typing import Any | ||
|
||
import numpy as np | ||
import torch | ||
from transformers import AutoTokenizer # type: ignore[import] | ||
|
||
if 'NO_LOCAL_GGUF' not in os.environ: | ||
sys.path.insert(1, str(Path(__file__).parent / 'gguf-py' / 'gguf')) | ||
import gguf | ||
|
||
|
||
def count_model_parts(dir_model: Path) -> int: | ||
num_parts = 0 | ||
for filename in os.listdir(dir_model): | ||
if filename.startswith("pytorch_model-"): | ||
num_parts += 1 | ||
|
||
if num_parts > 0: | ||
print("gguf: found " + str(num_parts) + " model parts") | ||
return num_parts | ||
|
||
|
||
def parse_args() -> argparse.Namespace: | ||
parser = argparse.ArgumentParser(description="Convert an MPT model to a GGML compatible file") | ||
parser.add_argument( | ||
"--vocab-only", action="store_true", | ||
help="extract only the vocab", | ||
) | ||
parser.add_argument( | ||
"--outfile", type=Path, | ||
help="path to write to; default: based on input", | ||
) | ||
parser.add_argument( | ||
"model", type=Path, | ||
help="directory containing model file, or model file itself (*.bin)", | ||
) | ||
parser.add_argument( | ||
"ftype", type=int, choices=[0, 1], default=1, nargs='?', | ||
help="output format - use 0 for float32, 1 for float16", | ||
) | ||
return parser.parse_args() | ||
|
||
args = parse_args() | ||
|
||
dir_model = args.model | ||
ftype = args.ftype | ||
if not dir_model.is_dir(): | ||
print(f'Error: {args.model} is not a directory', file = sys.stderr) | ||
sys.exit(1) | ||
|
||
# possible tensor data types | ||
# ftype == 0 -> float32 | ||
# ftype == 1 -> float16 | ||
|
||
# map from ftype to string | ||
ftype_str = ["f32", "f16"] | ||
|
||
if args.outfile is not None: | ||
fname_out = args.outfile | ||
else: | ||
# output in the same directory as the model by default | ||
fname_out = dir_model / f'ggml-model-{ftype_str[ftype]}.gguf' | ||
|
||
print("gguf: loading model "+dir_model.name) | ||
|
||
with open(dir_model / "config.json", "r", encoding="utf-8") as f: | ||
hparams = json.load(f) | ||
|
||
if hparams["architectures"][0] != "MPTForCausalLM": | ||
print("Model architecture not supported: " + hparams["architectures"][0]) | ||
|
||
sys.exit() | ||
|
||
# get number of model parts | ||
num_parts = count_model_parts(dir_model) | ||
|
||
ARCH=gguf.MODEL_ARCH.MPT | ||
gguf_writer = gguf.GGUFWriter(fname_out, gguf.MODEL_ARCH_NAMES[ARCH]) | ||
|
||
print("gguf: get model metadata") | ||
|
||
block_count = hparams["n_layers"] | ||
|
||
gguf_writer.add_name(dir_model.name) | ||
gguf_writer.add_context_length(hparams["max_seq_len"]) | ||
gguf_writer.add_embedding_length(hparams["d_model"]) | ||
gguf_writer.add_block_count(block_count) | ||
gguf_writer.add_feed_forward_length(4 * hparams["d_model"]) | ||
gguf_writer.add_head_count(hparams["n_heads"]) | ||
gguf_writer.add_layer_norm_eps(1e-05) | ||
if hparams["attn_config"]["clip_qkv"] is not None: | ||
gguf_writer.add_clamp_kqv(hparams["attn_config"]["clip_qkv"]) | ||
gguf_writer.add_max_alibi_bias(hparams["attn_config"]["alibi_bias_max"]) | ||
|
||
# TOKENIZATION | ||
|
||
print("gguf: get tokenizer metadata") | ||
|
||
tokens: list[bytearray] = [] | ||
scores: list[float] = [] | ||
toktypes: list[int] = [] | ||
|
||
# gpt2 tokenizer | ||
gguf_writer.add_tokenizer_model("gpt2") | ||
|
||
print("gguf: get gpt2 tokenizer vocab") | ||
|
||
# MPT token embedding tensors have dimension 50432 (hparams["vocab_size"]), but | ||
# there are only 50254 (len(tokenizer.vocab)) tokens in the vocab, presumably to | ||
# accomodate some "reserved" tokens; this is causing problems down the line in | ||
# llama.cpp, so we pad the vocab with dummy tokens: | ||
|
||
vocab_size = hparams["vocab_size"] | ||
|
||
# ref: https://github.com/cmp-nct/ggllm.cpp/blob/master/falcon_convert.py | ||
tokenizer = AutoTokenizer.from_pretrained(dir_model) | ||
|
||
reverse_vocab = {id: encoded_tok for encoded_tok, id in tokenizer.vocab.items()} | ||
|
||
for i in range(vocab_size): | ||
tokens.append(reverse_vocab[i] if i in reverse_vocab else f"[PAD{i}]") | ||
scores.append(0.0) # dummy | ||
toktypes.append(gguf.TokenType.NORMAL) | ||
|
||
gguf_writer.add_token_list(tokens) | ||
gguf_writer.add_token_scores(scores) | ||
gguf_writer.add_token_types(toktypes) | ||
|
||
special_vocab = gguf.SpecialVocab(dir_model, load_merges = True) | ||
special_vocab.add_to_gguf(gguf_writer) | ||
|
||
# TENSORS | ||
|
||
tensor_map = gguf.get_tensor_name_map(ARCH,block_count) | ||
|
||
# tensor info | ||
print("gguf: get tensor metadata") | ||
|
||
if num_parts == 0: | ||
part_names = iter(("pytorch_model.bin",)) | ||
else: | ||
part_names = ( | ||
f"pytorch_model-{n:05}-of-{num_parts:05}.bin" for n in range(1, num_parts + 1) | ||
) | ||
|
||
for part_name in part_names: | ||
if args.vocab_only: | ||
break | ||
print("gguf: loading model part '" + part_name + "'") | ||
model_part = torch.load(f"{dir_model}/{part_name}", map_location="cpu") | ||
|
||
for name in model_part.keys(): | ||
data = model_part[name] | ||
|
||
old_dtype = data.dtype | ||
|
||
# convert any unsupported data types to float32 | ||
if data.dtype != torch.float16 and data.dtype != torch.float32: | ||
data = data.to(torch.float32) | ||
|
||
data = data.squeeze().numpy() | ||
|
||
# map tensor names | ||
new_name = tensor_map.get_name(name, try_suffixes = (".weight", ".bias")) | ||
if new_name is None: | ||
print("Cannot map tensor '" + name + "'") | ||
continue # for the sake of compatibility with some old published models, don't quit | ||
sys.exit() | ||
|
||
n_dims = len(data.shape) | ||
data_dtype = data.dtype | ||
|
||
# if f32 desired, convert any float16 to float32 | ||
if ftype == 0 and data_dtype == np.float16: | ||
data = data.astype(np.float32) | ||
|
||
# TODO: Why cant we use these float16 as-is? There should be not reason to store float16 as float32 | ||
if ftype == 1 and data_dtype == np.float16 and n_dims == 1: | ||
data = data.astype(np.float32) | ||
|
||
# if f16 desired, convert any float32 2-dim weight tensors to float16 | ||
if ftype == 1 and data_dtype == np.float32 and name.endswith(".weight") and n_dims == 2: | ||
data = data.astype(np.float16) | ||
|
||
print(new_name + ", n_dims = " + str(n_dims) + ", " + str(old_dtype) + " --> " + str(data.dtype)) | ||
|
||
gguf_writer.add_tensor(new_name, data) | ||
|
||
# note: MPT output is tied to (same as) wte in original model; | ||
# for easier implementation in llama.cpp it's duplicated in GGUF, though :/ | ||
if new_name == "token_embd.weight": | ||
gguf_writer.add_tensor("output.weight", data) | ||
|
||
print("gguf: write header") | ||
gguf_writer.write_header_to_file() | ||
print("gguf: write metadata") | ||
gguf_writer.write_kv_data_to_file() | ||
if not args.vocab_only: | ||
print("gguf: write tensors") | ||
gguf_writer.write_tensors_to_file() | ||
|
||
gguf_writer.close() | ||
|
||
print(f"gguf: model successfully exported to '{fname_out}'") | ||
print("") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.