Closed
Description
System Info
transformers
version: 4.29.0.dev0- Platform: Linux-4.18.0-305.19.1.el8_4.x86_64-x86_64-with-glibc2.28
- Python version: 3.9.7
- Huggingface_hub version: 0.13.3
- Safetensors version: 0.3.0
- PyTorch version (GPU?): 2.1.0.dev20230411+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
As mentioned on the title, the LLaMA tokenizer does not add the eos_token
at the end of the inputs. This only happens on the fast version (use_fast=True
).
Steps to reproduce the behaviour:
- Load the LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(LLAMA_PATH, add_eos_token=True, use_fast=True)
- Tokenize something
simple_sentence = "This is a sentence to test if the tokenizer adds eos token."
simple_sentence_ids = tokenizer(
simple_sentence, add_special_tokens=True
).input_ids
- Print the
input_ids
to check if theeos_token_id
(2
) is added at the end.
print(simple_sentence_ids)
- Output:
[1, 910, 338, 263, 10541, 304, 1243, 565, 278, 5993, 3950, 12778, 321, 359, 5993, 29889]
Expected behavior
Expected output
[1, 910, 338, 263, 10541, 304, 1243, 565, 278, 5993, 3950, 12778, 321, 359, 5993, 29889, 2]
Metadata
Metadata
Assignees
Labels
No labels