-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Add support to ArcticForCausalLM #6877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I looked briefly into possible support of this model in
and the default value of It's possible that only the following changes are needed to make
So a brave soul may try to apply this patch and make the quants, but someone with more expertise shall verify this. I ran it on Amazon EC2 instance and it worked up to writing the GGUF file. Good luck! |
I'm currently downloading the model weights (Instruct) and will see if conversion works on my local machine (don't have access to cloud compute), but it'd take a while. If someone with cloud computing capabilities could take a go, that would be way faster, but I'll try in the meantime. For what it's worth, it uses the ChatML prompt template and LLaMA 2 tokenizer. I don't know anything else about the architecture yet. |
Based on the image in this article it's a hybrid combining a dense transformer with a residual MoE component. I guess a new LLM architecture will be necessary to support this in llama.cpp. |
The team behind it (very capable people!) are willing to help making this model more accessible. I have seen their PRs in DeepSpeed to introduce new FP [6,8,12]:
If needed, we can reach out to them for some insight into how to go about this better? |
I was thinking only load dense part onto GPU and leave moe part on CPU ram. That way a lot of people can run this in decent speed |
That's actually not a bad idea! Would it be possible to make this more dynamic in all MoE models, allowing users to select whether the experts to be offloaded on GPUs/CPUs? |
Not an expert (haha) but I don't know if there's any way to determine which layers correspond to which experts. In the first place, the way MoE inference works shouldn't really allow this to be possible - the gate network sends different tokens to different experts, so you'd need all those experts to be on GPU for any increase in inference speed. I don't know how it works with the additional dense part in this particular architecture, but I assume it would get bottlenecked by the experts; I would be happy to be wrong, though. |
Thanks. |
Some quants for testing: https://huggingface.co/sszymczyk/snowflake-arctic-instruct-GGUF/ |
I think ignoring these tensors likely won't work after all, I just found that in https://huggingface.co/Snowflake/snowflake-arctic-instruct/blob/main/config.json they have parallel_attn_mlp_res set to true, so residual_layernorm and residual_mlp are created and used. Sorry for the confusion. |
I got your Q4_K_M and stitched it into a singular file. Does it output tokens on your end? I didn't have the required RAM, but was hoping for the model to run on disc. GGML_ASSERT: /home/user/llama.cpp/llama.cpp:3728: hparams.n_expert <= LLAMA_MAX_EXPERTS Could not attach to process. If your uid matches the uid of the target process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf ptrace: Operation not permitted. |
You do not need to merge the sharded model.
probably we need to remove this limit or at least increase it to 128. |
Yes, it loaded and generated some tokens (there is no need to join the files), but the output was nonsensical like in @cpumaxx screenshot. I think there's no chance for it to work correctly without using these layers that we ignored. If there is someone running the model with HF transformers library it would be interesting to see how does the model behave with parallel_attn_mlp_res set to false in config.json. |
πππ |
You can try my branch: https://github.com/fairydreaming/llama.cpp/tree/snowflake-arctic |
I might. Keep us updated! |
@ggerganov can you offer any advice about how to map snowflake-arctic-instruct tensors in convert-hf-to-gguf.py? This model has attention + FFN and MoE running in parallel as shown on this image. Since there are both FFN and MoE blocks in the model I decided to map the FFN blocks to the usual FFN_GATE, FFN_DOWN and FFN_UP blocks, and leave the MoE mapping same as in Mixtral. The problem is mainly in mapping of the normalization blocks. There are three of them:
Currently in my branch I temporarily removed post_attention_layernorm from the FFN_NORM mappings and moved it to a newly created MODEL_TENSOR.FFN_NORM_EXP mapping (following the FFN_GATE_EXP, FFN_UP_EXP, FFN_DOWN_EXP convention). This of course breaks other architectures. Some ideas I had about fixing this situation:
Would be grateful for any ideas. |
@fairydreaming I have an idea for a (temporary?) workaround which would not break other models: Keep This way, you could use your newly-added
Yes, architecture-dependent mappings seem like a good idea to me. It seems worthwhile to explore in the future. To make it cleaner than making new classes for each architecture, it could possibly be a field in the base |
I'm going to try this once a couple of other experiments I have going finish. Thanks for getting something sensible out of this. I'm really excited to see performance of this model on some of my workloads! |
@fairydreaming Are you still using ./convert.py with --skip-unknown, or is there a different command to convert to gguf? |
@cpumaxx no, in my branch I added support for these tensors (they are vital to the model), so --skip-unknown is not needed. I used the convert-hf-to-gguf.py script. However, the official tokenizer.model file is broken (it has wrong BOS/EOS tokens), so I kind of forced the conversion script to use the tokenizer.json instead by removing tokenizer.model from the model directory (some additional minor tweaks were needed for this). Then I set tokenizer.ggml.add_bos_token in the resulting GGUF to False with gguf-set-metadata.py. So unfortunately the model conversion process is still messy and long (like hours per iteration). :( Now I see that you can select the --vocab-type in convert.py, so maybe all this is not needed. Will try it the next time I will convert the model. I'd try it now, but I have no disk space left. I thought that 4TB SSD will be enough to play with LLMs, but after recent model releases I'm running out of space on all my disks. |
For now I settled on yet another solution. I added in a TensorNameMap:
which is the same as block_mappings_cfg, but architecture-specific. Then in TensorNameMap init method I do:
and use block_mappings later. This seemed like the least intrusive and the cleanest solution. |
@fairydreaming Before I dedicate time/space/compute to trying, does this patch mean that I can convert to a usable gguf in some form? |
@cpumaxx I uploaded corrected quants, you can try them with my snowflake-arctic branch. |
I usually like to do all my own conversions and quants from the official safetensors, but I'll make an exception to test this one out : ) |
@cpumaxx If you want to do your own quants then convert-hf-to-gguf.py shall now work correctly. The only remaining problem is that add_bos_token is unnecessarily (I think) set to true, but you can change that after conversion with:
I guess it was originally intended only for MODEL_ARCH.LLAMA, so I'm not going to commit these changes. |
@fairydreaming couldn't convert the downloaded instruct model. Transformers 4.40.1 I was running with `python convert-hf-to-gguf.py /media/user/6/snowflake_instruct/Snowflake_snowflake-arctic-instruct --out type f16
|
You can simply comment out the assertion. |
@cebtenzzre can you tell us the story behind this assert? It looks like the ArcticTokenizer that has LlamaTokenizer base class (and there is no ArcticTokenizerFast) fails the assertion. In the convert.py file few lines above the assertion there is: |
@BarfingLemurs I commited the assertion removal to my snowflake-arctic branch. Thanks for reporting the problem. I had this line removed locally (but not commited), so I didn't notice that it affected convert-hf-to-gguf.py as well. |
There was a change in the snowflake-arctic-instruct model a while ago. Grouped query attention (GQA) is now used with 8 KV heads instead of 56 KV heads. This reduces KV buffer size sevenfold, also inference is a bit faster: before (Epyc 9374F, Q5_K_M):
now (Epyc 9374F, Q5_K_M):
I uploaded the updated model quants here: https://huggingface.co/sszymczyk/snowflake-arctic-instruct-gqa-GGUF The updated model works just fine, no changes in the code were necessary. |
First open LLM from @SnowflakeDB! Arctic is 480B Dense-MoE with a 10B dense transformer model and a 128x3.66B MoE MLP designed specifically for enterprise AI. π€
TL;DR:
π§ 480B parameters with 17B active during generation
π¨βπ« 128 experts with 2 active in generation
2οΈβ£ Instruct & Base versions released
ποΈ Focused on Enterprise task (Code, SQL, Reasoning, Following)
π Released under Apache 2.0
π» in fp16 ~900GB Memory & in int4 ~240GB
π€ Available on @huggingface
ππ» Trained with DeepSpeed-MoE
Blog: https://snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/
Models: https://huggingface.co/Snowflake/snowflake-arctic-instruct
The text was updated successfully, but these errors were encountered: