Skip to content

LLaMA FastTokenizer does not add eos_token_id at the end. #22794

Closed
@osainz59

Description

@osainz59

System Info

  • transformers version: 4.29.0.dev0
  • Platform: Linux-4.18.0-305.19.1.el8_4.x86_64-x86_64-with-glibc2.28
  • Python version: 3.9.7
  • Huggingface_hub version: 0.13.3
  • Safetensors version: 0.3.0
  • PyTorch version (GPU?): 2.1.0.dev20230411+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

As mentioned on the title, the LLaMA tokenizer does not add the eos_token at the end of the inputs. This only happens on the fast version (use_fast=True).

Steps to reproduce the behaviour:

  1. Load the LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(LLAMA_PATH, add_eos_token=True, use_fast=True)
  1. Tokenize something
simple_sentence = "This is a sentence to test if the tokenizer adds eos token."
simple_sentence_ids = tokenizer(
    simple_sentence, add_special_tokens=True
).input_ids
  1. Print the input_ids to check if the eos_token_id (2) is added at the end.
print(simple_sentence_ids)
  1. Output:
[1, 910, 338, 263, 10541, 304, 1243, 565, 278, 5993, 3950, 12778, 321, 359, 5993, 29889]

Expected behavior

Expected output

[1, 910, 338, 263, 10541, 304, 1243, 565, 278, 5993, 3950, 12778, 321, 359, 5993, 29889, 2]

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions