Skip to content

fix: don't add space after special tokens in SPM #7697

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

giladgd
Copy link
Contributor

@giladgd giladgd commented Jun 2, 2024

There's an issue where tokenizing a text with special tokens introduces an additional space after special tokens.

For example, using this model to tokenize <|from|>system with special tokens enabled and detokenizing it back
resulted in <|from|> system.

This PR fixes that.

Fixes #7629

@giladgd
Copy link
Contributor Author

giladgd commented Jun 2, 2024

@jaime-m-p I found that this issue was introduced in #7375; I think this was a mistake.
Can you please check?

@giladgd giladgd changed the title fix: don't add space after special tokens when using SPM fix: don't add space after special tokens in SPM Jun 2, 2024
@jaime-m-p
Copy link
Collaborator

@giladgd Indeed, this fixes the problem for this concrete issue, but unfortunately breaks other models.

The problem is the rtrim property of some special tokens.
The model functionary-small-v2.2 has all rtrim and ltrim to false.
So I guess it will work just setting this variable to false:
https://github.com/ggerganov/llama.cpp/blob/fb1fef99622c6c8004e974b6d2a765a758e96400/llama.cpp#L13364

I already have a PR trying to fix this: #7685.
I tested the functionary-small-v2.2 tokenizer and seems to work correctly.

Also note that the extra space when detokenizing is the expeced result:

tokenizer = AutoTokenizer.from_pretrained("./models/tokenizers/functionary-small-v2.2/")
text = "<|from|>system"
ids  = tokenizer.encode(text, add_special_tokens=False)
re   = tokenizer.decode(ids)
print(text)
print(re)

Output:
<|from|>system
<|from|> system

@giladgd
Copy link
Contributor Author

giladgd commented Jun 2, 2024

@jaime-m-p I see what you mean.
I think preserving the original rtrim property for all special tokens inside a GGUF file and using it in the tokenizer is indeed the right solution, but ideally, it should be done in a non-breaking manner, so when rtrim is not present in the GGUF file (or has a specific setting like in #7685), the default should be false so that older models will continue functioning as they used to.

@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jun 3, 2024
Copy link
Contributor

github-actions bot commented Jun 3, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 537 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8692.47ms p(95)=21612.06ms fails=, finish reason: stop=479 truncated=58
  • Prompt processing (pp): avg=98.28tk/s p(95)=439.61tk/s
  • Token generation (tg): avg=32.03tk/s p(95)=46.25tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=fixSpmTokenizer commit=67a43f94fad6abbcd2c18f0e2837a5b35d224e46

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 537 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717423546 --> 1717424168
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 513.42, 513.42, 513.42, 513.42, 513.42, 676.59, 676.59, 676.59, 676.59, 676.59, 675.94, 675.94, 675.94, 675.94, 675.94, 666.37, 666.37, 666.37, 666.37, 666.37, 749.75, 749.75, 749.75, 749.75, 749.75, 747.59, 747.59, 747.59, 747.59, 747.59, 746.19, 746.19, 746.19, 746.19, 746.19, 773.6, 773.6, 773.6, 773.6, 773.6, 779.53, 779.53, 779.53, 779.53, 779.53, 798.76, 798.76, 798.76, 798.76, 798.76, 819.93, 819.93, 819.93, 819.93, 819.93, 824.4, 824.4, 824.4, 824.4, 824.4, 820.85, 820.85, 820.85, 820.85, 820.85, 828.92, 828.92, 828.92, 828.92, 828.92, 829.98, 829.98, 829.98, 829.98, 829.98, 836.88, 836.88, 836.88, 836.88, 836.88, 836.6, 836.6, 836.6, 836.6, 836.6, 833.31, 833.31, 833.31, 833.31, 833.31, 798.81, 798.81, 798.81, 798.81, 798.81, 800.14, 800.14, 800.14, 800.14, 800.14, 796.66, 796.66, 796.66, 796.66, 796.66, 802.3, 802.3, 802.3, 802.3, 802.3, 803.76, 803.76, 803.76, 803.76, 803.76, 809.5, 809.5, 809.5, 809.5, 809.5, 816.05, 816.05, 816.05, 816.05, 816.05, 819.07, 819.07, 819.07, 819.07, 819.07, 821.23, 821.23, 821.23, 821.23, 821.23, 832.07, 832.07, 832.07, 832.07, 832.07, 829.42, 829.42, 829.42, 829.42, 829.42, 829.66, 829.66, 829.66, 829.66, 829.66, 831.67, 831.67, 831.67, 831.67, 831.67, 835.97, 835.97, 835.97, 835.97, 835.97, 833.95, 833.95, 833.95, 833.95, 833.95, 836.58, 836.58, 836.58, 836.58, 836.58, 841.97, 841.97, 841.97, 841.97, 841.97, 853.69, 853.69, 853.69, 853.69, 853.69, 852.67, 852.67, 852.67, 852.67, 852.67, 855.34, 855.34, 855.34, 855.34, 855.34, 853.69, 853.69, 853.69, 853.69, 853.69, 853.39, 853.39, 853.39, 853.39, 853.39, 857.54, 857.54, 857.54, 857.54, 857.54, 859.31, 859.31, 859.31, 859.31, 859.31, 856.59, 856.59, 856.59, 856.59, 856.59, 833.65, 833.65, 833.65, 833.65, 833.65, 833.37, 833.37, 833.37, 833.37, 833.37, 832.57, 832.57, 832.57, 832.57, 832.57, 831.12, 831.12, 831.12, 831.12, 831.12, 838.12, 838.12, 838.12, 838.12, 838.12, 837.56, 837.56, 837.56, 837.56, 837.56, 839.06, 839.06, 839.06, 839.06, 839.06, 841.86, 841.86, 841.86, 841.86, 841.86, 844.6, 844.6, 844.6, 844.6, 844.6, 840.53, 840.53, 840.53, 840.53, 840.53, 837.89, 837.89, 837.89, 837.89, 837.89, 831.51, 831.51, 831.51, 831.51, 831.51, 832.59, 832.59, 832.59, 832.59, 832.59, 831.92, 831.92, 831.92, 831.92, 831.92, 832.72, 832.72, 832.72, 832.72, 832.72, 833.0, 833.0, 833.0, 833.0, 833.0, 834.97, 834.97, 834.97, 834.97, 834.97, 834.97, 834.97]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 537 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717423546 --> 1717424168
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 40.58, 40.58, 40.58, 40.58, 40.58, 31.61, 31.61, 31.61, 31.61, 31.61, 30.39, 30.39, 30.39, 30.39, 30.39, 25.45, 25.45, 25.45, 25.45, 25.45, 26.71, 26.71, 26.71, 26.71, 26.71, 27.32, 27.32, 27.32, 27.32, 27.32, 28.55, 28.55, 28.55, 28.55, 28.55, 30.04, 30.04, 30.04, 30.04, 30.04, 30.89, 30.89, 30.89, 30.89, 30.89, 31.59, 31.59, 31.59, 31.59, 31.59, 31.36, 31.36, 31.36, 31.36, 31.36, 31.22, 31.22, 31.22, 31.22, 31.22, 30.76, 30.76, 30.76, 30.76, 30.76, 30.61, 30.61, 30.61, 30.61, 30.61, 29.53, 29.53, 29.53, 29.53, 29.53, 29.43, 29.43, 29.43, 29.43, 29.43, 28.76, 28.76, 28.76, 28.76, 28.76, 29.1, 29.1, 29.1, 29.1, 29.1, 29.26, 29.26, 29.26, 29.26, 29.26, 29.15, 29.15, 29.15, 29.15, 29.15, 28.72, 28.72, 28.72, 28.72, 28.72, 28.56, 28.56, 28.56, 28.56, 28.56, 28.59, 28.59, 28.59, 28.59, 28.59, 28.85, 28.85, 28.85, 28.85, 28.85, 28.9, 28.9, 28.9, 28.9, 28.9, 29.2, 29.2, 29.2, 29.2, 29.2, 29.55, 29.55, 29.55, 29.55, 29.55, 29.52, 29.52, 29.52, 29.52, 29.52, 29.33, 29.33, 29.33, 29.33, 29.33, 29.38, 29.38, 29.38, 29.38, 29.38, 29.61, 29.61, 29.61, 29.61, 29.61, 29.74, 29.74, 29.74, 29.74, 29.74, 29.83, 29.83, 29.83, 29.83, 29.83, 30.01, 30.01, 30.01, 30.01, 30.01, 30.05, 30.05, 30.05, 30.05, 30.05, 30.09, 30.09, 30.09, 30.09, 30.09, 29.85, 29.85, 29.85, 29.85, 29.85, 29.83, 29.83, 29.83, 29.83, 29.83, 29.47, 29.47, 29.47, 29.47, 29.47, 29.4, 29.4, 29.4, 29.4, 29.4, 29.57, 29.57, 29.57, 29.57, 29.57, 29.69, 29.69, 29.69, 29.69, 29.69, 29.9, 29.9, 29.9, 29.9, 29.9, 29.63, 29.63, 29.63, 29.63, 29.63, 29.52, 29.52, 29.52, 29.52, 29.52, 29.24, 29.24, 29.24, 29.24, 29.24, 28.53, 28.53, 28.53, 28.53, 28.53, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.51, 28.51, 28.51, 28.51, 28.51, 28.6, 28.6, 28.6, 28.6, 28.6, 28.69, 28.69, 28.69, 28.69, 28.69, 28.75, 28.75, 28.75, 28.75, 28.75, 28.73, 28.73, 28.73, 28.73, 28.73, 28.69, 28.69, 28.69, 28.69, 28.69, 28.59, 28.59, 28.59, 28.59, 28.59, 28.65, 28.65, 28.65, 28.65, 28.65, 28.8, 28.8, 28.8, 28.8, 28.8, 28.94, 28.94, 28.94, 28.94, 28.94, 29.01, 29.01, 29.01, 29.01, 29.01, 29.11, 29.11]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 537 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717423546 --> 1717424168
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.29, 0.29, 0.29, 0.29, 0.29, 0.46, 0.46, 0.46, 0.46, 0.46, 0.4, 0.4, 0.4, 0.4, 0.4, 0.09, 0.09, 0.09, 0.09, 0.09, 0.22, 0.22, 0.22, 0.22, 0.22, 0.25, 0.25, 0.25, 0.25, 0.25, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.23, 0.23, 0.23, 0.23, 0.23, 0.31, 0.31, 0.31, 0.31, 0.31, 0.28, 0.28, 0.28, 0.28, 0.28, 0.29, 0.29, 0.29, 0.29, 0.29, 0.42, 0.42, 0.42, 0.42, 0.42, 0.25, 0.25, 0.25, 0.25, 0.25, 0.33, 0.33, 0.33, 0.33, 0.33, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.31, 0.31, 0.31, 0.31, 0.31, 0.37, 0.37, 0.37, 0.37, 0.37, 0.34, 0.34, 0.34, 0.34, 0.34, 0.23, 0.23, 0.23, 0.23, 0.23, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.29, 0.29, 0.29, 0.29, 0.29, 0.24, 0.24, 0.24, 0.24, 0.24, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.22, 0.22, 0.22, 0.22, 0.22, 0.23, 0.23, 0.23, 0.23, 0.23, 0.36, 0.36, 0.36, 0.36, 0.36, 0.37, 0.37, 0.37, 0.37, 0.37, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.08, 0.08, 0.08, 0.08, 0.08, 0.13, 0.13, 0.13, 0.13, 0.13, 0.36, 0.36, 0.36, 0.36, 0.36, 0.56, 0.56, 0.56, 0.56, 0.56, 0.49, 0.49, 0.49, 0.49, 0.49, 0.44, 0.44, 0.44, 0.44, 0.44, 0.32, 0.32, 0.32, 0.32, 0.32, 0.27, 0.27, 0.27, 0.27, 0.27, 0.24, 0.24, 0.24, 0.24, 0.24, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.36, 0.36, 0.36, 0.36, 0.36, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.21, 0.21, 0.21, 0.21, 0.21, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 537 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717423546 --> 1717424168
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 1.0, 1.0]
                    
Loading

@jaime-m-p
Copy link
Collaborator

the default should be false so that older models will continue functioning as they used to

@giladgd Fixed: 3b38d48

@giladgd
Copy link
Contributor Author

giladgd commented Jun 4, 2024

the default should be false so that older models will continue functioning as they used to

@giladgd Fixed: 3b38d48

@jaime-m-p I've just checked it again with the old functionary-small-v2.2.q4_0.gguf model and the issue is still there; tokenizing <|from|>system with special tokens enabled and detokenizing it back still results in <|from|> system.
The latest Functionary models use a pre-tokenizer so this issue does not affect them; I use this old model as a test case for testing compatibility with old models

@jaime-m-p
Copy link
Collaborator

@giladgd, @snichols Sorry, I think I'm missing something.
When you say "it's wrong" you mean "is not the previous" behavior?
When I say "it's wrong" I mean the output does not match AutoTokenizer.

Maybe I'm making some mistake, but seems AutoTokenizer output is <|from|> system.
If this is not a valid assumption, then we need an extra variable telling with model is old or new and reintroduce previous behavior for old models.

I use this old model as a test case for testing compatibility with old models

I see. If AutoTokenizer is not our ground truth, do you know what is the ground truth implementatión? so I can compare and try to match outputs.

I generated vocab files from the link you provided (functionary-small-v2.2.q4_0.gguf) and the random brute-force test is ok (I mean, matches AutoTokenizer outputs).

@shibe2
Copy link
Contributor

shibe2 commented Jun 5, 2024

Looking for ground truth is not the way. Theoretically, prompt format/tokenization should match one which was used for fine-tuning. That principle applies to space insertion too, and it may vary between fine-tunes of the same base model. But ultimately, what matters is which format works best with the given model. Sometimes a different format works better than model's original format. To achieve best results, one has to test the model, preferably on their particular use case, with and without spaces in different places. So it is best to leave the choice to the user. Now the problem is that if the space is inserted by llama.cpp, but is not needed, it is difficult to get rid of it.

I used to have a workaround to remove the unwanted space, but it was so ugly and annoying that at some point, I just uprooted the piece of code that inserts space, and I add it to the prompt on client side when needed.

@ggerganov
Copy link
Member

We want to match AutoTokenizer as much as possible - it seems to be the best choice in the current ecosystem. Of course if some finetune chooses some arbitrary tokenization, then llama.cpp (and maybe transformers) would not be correct, but this is another issue

@giladgd
Copy link
Contributor Author

giladgd commented Jun 9, 2024

I originally considered the change to the default tokenization a breaking change since it affected the generated outputs of older models, but I now understand that the previous default tokenization behavior was indeed incorrect.

@jaime-m-p Thanks for your help as I stumbled through this :)

@giladgd giladgd closed this Jun 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: SPM tokenization breaks in at least one specific case.
5 participants