-
Notifications
You must be signed in to change notification settings - Fork 11.9k
fix: don't add space after special tokens in SPM #7697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jaime-m-p I found that this issue was introduced in #7375; I think this was a mistake. |
@giladgd Indeed, this fixes the problem for this concrete issue, but unfortunately breaks other models. The problem is the I already have a PR trying to fix this: #7685. Also note that the extra space when detokenizing is the expeced result: tokenizer = AutoTokenizer.from_pretrained("./models/tokenizers/functionary-small-v2.2/")
text = "<|from|>system"
ids = tokenizer.encode(text, add_special_tokens=False)
re = tokenizer.decode(ids)
print(text)
print(re)
Output:
<|from|>system
<|from|> system |
@jaime-m-p I see what you mean. |
@jaime-m-p I've just checked it again with the old |
@giladgd, @snichols Sorry, I think I'm missing something. Maybe I'm making some mistake, but seems AutoTokenizer output is
I see. If AutoTokenizer is not our ground truth, do you know what is the ground truth implementatión? so I can compare and try to match outputs. I generated vocab files from the link you provided ( |
Looking for ground truth is not the way. Theoretically, prompt format/tokenization should match one which was used for fine-tuning. That principle applies to space insertion too, and it may vary between fine-tunes of the same base model. But ultimately, what matters is which format works best with the given model. Sometimes a different format works better than model's original format. To achieve best results, one has to test the model, preferably on their particular use case, with and without spaces in different places. So it is best to leave the choice to the user. Now the problem is that if the space is inserted by llama.cpp, but is not needed, it is difficult to get rid of it. I used to have a workaround to remove the unwanted space, but it was so ugly and annoying that at some point, I just uprooted the piece of code that inserts space, and I add it to the prompt on client side when needed. |
We want to match AutoTokenizer as much as possible - it seems to be the best choice in the current ecosystem. Of course if some finetune chooses some arbitrary tokenization, then |
I originally considered the change to the default tokenization a breaking change since it affected the generated outputs of older models, but I now understand that the previous default tokenization behavior was indeed incorrect. @jaime-m-p Thanks for your help as I stumbled through this :) |
There's an issue where tokenizing a text with special tokens introduces an additional space after special tokens.
For example, using this model to tokenize
<|from|>system
with special tokens enabled and detokenizing it backresulted in
<|from|> system
.This PR fixes that.
Fixes #7629