Skip to content

Commit 2da8853

Browse files
ArthurZuckersguggeramyerobertsLysandreJik
authored
🚨🚨 🚨🚨 [Tokenizer] attemp to fix add_token issues🚨🚨 🚨🚨 (#23909)
* fix test for bart. Order is correct now let's skip BPEs * ouf * styling * fix bert.... * slow refactoring * current updates * massive refactoring * update * NICE! * update to see where I am at * updates * update * update * revert * updates * updates * start supporting legacy_save * styling * big update * revert some changes * nits * nniiiiiice * small fixes * kinda fix t5 with new behaviour * major update * fixup * fix copies * today's updates * fix byt5 * upfate * update * update * updates * update vocab size test * Barthez does not use not need the fairseq offset ids * super calll must be after * calll super * move all super init * move other super init * fixup * nits * more fixes * nits * more fixes * nits * more fix * remove useless files * ouch all of them are affected * and more! * small imporvements * no more sanitize token * more changes around unique no split tokens * partially fix more things * keep legacy save but add warning * so... more fixes * updates * guess deberta tokenizer could be nuked * fixup * fixup did some bad things * nuke it if it breaks * remove prints and pretrain fast from slow with new format. * fixups * Apply suggestions from code review Co-authored-by: Sylvain Gugger <[email protected]> * fiou * nit * by default specials should not be normalized? * update * remove brakpoint * updates * a lot of updates * fixup * fixes revert some changes to match fast * small nits * that makes it cleaner * fix camembert accordingly * update * some lest breaking changes * update * fixup * fix byt5 and whisper mostly * some more fixes, canine's byte vocab * fix gpt2 * fix most of the perceiver tests (4 left) * fix layout lmv3 * fixup * fix copies for gpt2 style * make sure to only warn once * fix perciever and gpt2 tests * some more backward compatibility: also read special tokens map because some ppl use it........////..... * fixup * add else when reading * nits * fresh updates * fix copies * will this make everything faster? * fixes * more fixes * update * more fixes * fixup * is the source of truth right? * sorry camembert for the troubles * current updates * fixup * update led * update * fix regression * fix single word * more model specific fixes * fix t5 tests * fixup * more comments * update * fix nllb * rstrip removed * small fixes * better handle additional_special_tokens and vocab sizes * fixing * styling * fix 4 / 21 * fixup * fix nlbb's tests * some fixes * fix t5 * fixes * style * fix canine tests * damn this is nice * nits * m2m100 nit * fixups * fixes! * fixup * stash * fix merge * revert bad change * fixup * correct order for code Llama * fix speecht5 post merge * styling * revert source of 11 fails * small nits * all changes in one go * fnet hack * fix 2 more tests * update based on main branch of tokenizers * fixup * fix VITS issues * more fixes * fix mgp test * fix camembert issues * oups camembert still has 2 failing tests * mluke fixes * decode fixes * small nits * nits * fix llama and vits * fix camembert * smal nits * more fixes when initialising a fast from a slow and etc * fix one of the last test * fix CPM tokenizer test * fixups * fix pop2piano * fixup * ⚠️ Change tokenizers required version ⚠️ * ⚠️ Change tokenizers required version ⚠️ * "tokenizers>=0.14,<0.15", don't forget smaller than * fix musicgen tests and pretraiendtokenizerfast * fix owlvit and all * update t5 * fix 800 red * fix tests * fix the fix of the fix of t5 * styling * documentation nits * cache _added_tokens_encoder * fixups * Nit * fix red tests * one last nit! * make eveything a lot simpler * Now it's over 😉 * few small nits * Apply suggestions from code review Co-authored-by: amyeroberts <[email protected]> * updates that work for now * tests that should no be skipped / changed and fixed next * fixup * i am ashamed * pushe the fix * update * fixups * nits * fix added_tokens_encoder * fix canine test * fix pegasus vocab * fix transfoXL * fixup * whisper needs to be fixed for train new * pegasus nits * more pegasus fixes * minor update * better error message in failed test * fix whisper failing test * fix whisper failing test * fix pegasus * fixup * fix **** pegasus * reset things * remove another file * attempts to fix the strange custome encoder and offset * nits here and there * update * fixup * nit * fix the whisper test * nits nits * Apply suggestions from code review Co-authored-by: amyeroberts <[email protected]> * updates based on review * some small update to potentially remove * nits * import rlu cache * Update src/transformers/tokenization_utils_base.py Co-authored-by: Lysandre Debut <[email protected]> * move warning to `from_pretrained` * update tests results now that the special tokens are always added --------- Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: amyeroberts <[email protected]> Co-authored-by: Lysandre Debut <[email protected]>
1 parent 835b0a0 commit 2da8853

File tree

138 files changed

+2304
-2053
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

138 files changed

+2304
-2053
lines changed

.gitignore

+1-1
Original file line numberDiff line numberDiff line change
@@ -166,4 +166,4 @@ tags
166166
.DS_Store
167167

168168
# ruff
169-
.ruff_cache
169+
.ruff_cache

setup.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -172,7 +172,7 @@
172172
"tf2onnx",
173173
"timeout-decorator",
174174
"timm",
175-
"tokenizers>=0.11.1,!=0.11.3,<0.14",
175+
"tokenizers>=0.14,<0.15",
176176
"torch>=1.10,!=1.12.0",
177177
"torchaudio",
178178
"torchvision",

src/transformers/dependency_versions_table.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@
7878
"tf2onnx": "tf2onnx",
7979
"timeout-decorator": "timeout-decorator",
8080
"timm": "timm",
81-
"tokenizers": "tokenizers>=0.11.1,!=0.11.3,<0.14",
81+
"tokenizers": "tokenizers>=0.14,<0.15",
8282
"torch": "torch>=1.10,!=1.12.0",
8383
"torchaudio": "torchaudio",
8484
"torchvision": "torchvision",

src/transformers/models/albert/tokenization_albert.py

+10-8
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,14 @@ def __init__(
159159

160160
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
161161

162+
self.do_lower_case = do_lower_case
163+
self.remove_space = remove_space
164+
self.keep_accents = keep_accents
165+
self.vocab_file = vocab_file
166+
167+
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
168+
self.sp_model.Load(vocab_file)
169+
162170
super().__init__(
163171
do_lower_case=do_lower_case,
164172
remove_space=remove_space,
@@ -174,14 +182,6 @@ def __init__(
174182
**kwargs,
175183
)
176184

177-
self.do_lower_case = do_lower_case
178-
self.remove_space = remove_space
179-
self.keep_accents = keep_accents
180-
self.vocab_file = vocab_file
181-
182-
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
183-
self.sp_model.Load(vocab_file)
184-
185185
@property
186186
def vocab_size(self) -> int:
187187
return len(self.sp_model)
@@ -228,6 +228,8 @@ def _tokenize(self, text: str) -> List[str]:
228228
new_pieces = []
229229
for piece in pieces:
230230
if len(piece) > 1 and piece[-1] == str(",") and piece[-2].isdigit():
231+
# Logic to handle special cases see https://github.com/google-research/bert/blob/master/README.md#tokenization
232+
# `9,9` -> ['▁9', ',', '9'] instead of [`_9,`, '9']
231233
cur_pieces = self.sp_model.EncodeAsPieces(piece[:-1].replace(SPIECE_UNDERLINE, ""))
232234
if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:
233235
if len(cur_pieces[0]) == 1:

src/transformers/models/bart/tokenization_bart.py

+15-13
Original file line numberDiff line numberDiff line change
@@ -204,21 +204,10 @@ def __init__(
204204
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
205205

206206
# Mask token behave like a normal word, i.e. include the space before it
207+
# TODO seems like both slow and fast actually don't strip left and right soooooooo yeah. See `test_embeded_special_tokens`
208+
# Also this not only will strip the spaces but any punctuation
207209
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
208210

209-
super().__init__(
210-
errors=errors,
211-
bos_token=bos_token,
212-
eos_token=eos_token,
213-
unk_token=unk_token,
214-
sep_token=sep_token,
215-
cls_token=cls_token,
216-
pad_token=pad_token,
217-
mask_token=mask_token,
218-
add_prefix_space=add_prefix_space,
219-
**kwargs,
220-
)
221-
222211
with open(vocab_file, encoding="utf-8") as vocab_handle:
223212
self.encoder = json.load(vocab_handle)
224213
self.decoder = {v: k for k, v in self.encoder.items()}
@@ -235,6 +224,19 @@ def __init__(
235224
# Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
236225
self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
237226

227+
super().__init__(
228+
errors=errors,
229+
bos_token=bos_token,
230+
eos_token=eos_token,
231+
unk_token=unk_token,
232+
sep_token=sep_token,
233+
cls_token=cls_token,
234+
pad_token=pad_token,
235+
mask_token=mask_token,
236+
add_prefix_space=add_prefix_space,
237+
**kwargs,
238+
)
239+
238240
@property
239241
def vocab_size(self):
240242
return len(self.encoder)

src/transformers/models/bart/tokenization_bart_fast.py

+1
Original file line numberDiff line numberDiff line change
@@ -170,6 +170,7 @@ def __init__(
170170
trim_offsets=True,
171171
**kwargs,
172172
):
173+
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
173174
super().__init__(
174175
vocab_file,
175176
merges_file,

src/transformers/models/barthez/tokenization_barthez.py

+6-16
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,8 @@
4747

4848
SPIECE_UNDERLINE = "▁"
4949

50+
# TODO this class is useless. This is the most standard sentencpiece model. Let's find which one is closest and nuke this.
51+
5052

5153
class BarthezTokenizer(PreTrainedTokenizer):
5254
"""
@@ -141,6 +143,9 @@ def __init__(
141143

142144
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
143145

146+
self.vocab_file = vocab_file
147+
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
148+
self.sp_model.Load(str(vocab_file))
144149
super().__init__(
145150
bos_token=bos_token,
146151
eos_token=eos_token,
@@ -153,15 +158,6 @@ def __init__(
153158
**kwargs,
154159
)
155160

156-
self.vocab_file = vocab_file
157-
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
158-
self.sp_model.Load(str(vocab_file))
159-
160-
self.fairseq_tokens_to_ids = {"<s>": 0, "<pad>": 1, "</s>": 2, "<unk>": 3}
161-
162-
self.fairseq_tokens_to_ids["<mask>"] = len(self.sp_model) - 1
163-
self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}
164-
165161
def build_inputs_with_special_tokens(
166162
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
167163
) -> List[int]:
@@ -251,16 +247,10 @@ def _tokenize(self, text: str) -> List[str]:
251247

252248
def _convert_token_to_id(self, token):
253249
"""Converts a token (str) in an id using the vocab."""
254-
if token in self.fairseq_tokens_to_ids:
255-
return self.fairseq_tokens_to_ids[token]
256-
spm_id = self.sp_model.PieceToId(token)
257-
258-
return spm_id if spm_id else self.unk_token_id
250+
return self.sp_model.PieceToId(token)
259251

260252
def _convert_id_to_token(self, index):
261253
"""Converts an index (integer) in a token (str) using the vocab."""
262-
if index in self.fairseq_ids_to_tokens:
263-
return self.fairseq_ids_to_tokens[index]
264254
return self.sp_model.IdToPiece(index)
265255

266256
def convert_tokens_to_string(self, tokens):

src/transformers/models/bartpho/tokenization_bartpho.py

+12-12
Original file line numberDiff line numberDiff line change
@@ -139,18 +139,6 @@ def __init__(
139139

140140
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
141141

142-
super().__init__(
143-
bos_token=bos_token,
144-
eos_token=eos_token,
145-
unk_token=unk_token,
146-
sep_token=sep_token,
147-
cls_token=cls_token,
148-
pad_token=pad_token,
149-
mask_token=mask_token,
150-
sp_model_kwargs=self.sp_model_kwargs,
151-
**kwargs,
152-
)
153-
154142
self.vocab_file = vocab_file
155143
self.monolingual_vocab_file = monolingual_vocab_file
156144
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
@@ -174,6 +162,18 @@ def __init__(
174162

175163
self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}
176164

165+
super().__init__(
166+
bos_token=bos_token,
167+
eos_token=eos_token,
168+
unk_token=unk_token,
169+
sep_token=sep_token,
170+
cls_token=cls_token,
171+
pad_token=pad_token,
172+
mask_token=mask_token,
173+
sp_model_kwargs=self.sp_model_kwargs,
174+
**kwargs,
175+
)
176+
177177
def __getstate__(self):
178178
state = self.__dict__.copy()
179179
state["sp_model"] = None

src/transformers/models/bert/tokenization_bert.py

+16-15
Original file line numberDiff line numberDiff line change
@@ -196,20 +196,6 @@ def __init__(
196196
strip_accents=None,
197197
**kwargs,
198198
):
199-
super().__init__(
200-
do_lower_case=do_lower_case,
201-
do_basic_tokenize=do_basic_tokenize,
202-
never_split=never_split,
203-
unk_token=unk_token,
204-
sep_token=sep_token,
205-
pad_token=pad_token,
206-
cls_token=cls_token,
207-
mask_token=mask_token,
208-
tokenize_chinese_chars=tokenize_chinese_chars,
209-
strip_accents=strip_accents,
210-
**kwargs,
211-
)
212-
213199
if not os.path.isfile(vocab_file):
214200
raise ValueError(
215201
f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
@@ -225,7 +211,22 @@ def __init__(
225211
tokenize_chinese_chars=tokenize_chinese_chars,
226212
strip_accents=strip_accents,
227213
)
228-
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
214+
215+
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token))
216+
217+
super().__init__(
218+
do_lower_case=do_lower_case,
219+
do_basic_tokenize=do_basic_tokenize,
220+
never_split=never_split,
221+
unk_token=unk_token,
222+
sep_token=sep_token,
223+
pad_token=pad_token,
224+
cls_token=cls_token,
225+
mask_token=mask_token,
226+
tokenize_chinese_chars=tokenize_chinese_chars,
227+
strip_accents=strip_accents,
228+
**kwargs,
229+
)
229230

230231
@property
231232
def do_lower_case(self):

src/transformers/models/bert_generation/tokenization_bert_generation.py

+5-5
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,11 @@ def __init__(
9696
) -> None:
9797
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
9898

99+
self.vocab_file = vocab_file
100+
101+
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
102+
self.sp_model.Load(vocab_file)
103+
99104
# Add extra_ids to the special token list
100105
super().__init__(
101106
bos_token=bos_token,
@@ -107,11 +112,6 @@ def __init__(
107112
**kwargs,
108113
)
109114

110-
self.vocab_file = vocab_file
111-
112-
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
113-
self.sp_model.Load(vocab_file)
114-
115115
@property
116116
def vocab_size(self):
117117
return self.sp_model.get_piece_size()

src/transformers/models/bert_japanese/tokenization_bert_japanese.py

+21-22
Original file line numberDiff line numberDiff line change
@@ -160,25 +160,6 @@ def __init__(
160160
jumanpp_kwargs=None,
161161
**kwargs,
162162
):
163-
super().__init__(
164-
spm_file=spm_file,
165-
unk_token=unk_token,
166-
sep_token=sep_token,
167-
pad_token=pad_token,
168-
cls_token=cls_token,
169-
mask_token=mask_token,
170-
do_lower_case=do_lower_case,
171-
do_word_tokenize=do_word_tokenize,
172-
do_subword_tokenize=do_subword_tokenize,
173-
word_tokenizer_type=word_tokenizer_type,
174-
subword_tokenizer_type=subword_tokenizer_type,
175-
never_split=never_split,
176-
mecab_kwargs=mecab_kwargs,
177-
sudachi_kwargs=sudachi_kwargs,
178-
jumanpp_kwargs=jumanpp_kwargs,
179-
**kwargs,
180-
)
181-
182163
if subword_tokenizer_type == "sentencepiece":
183164
if not os.path.isfile(spm_file):
184165
raise ValueError(
@@ -226,13 +207,31 @@ def __init__(
226207
self.subword_tokenizer_type = subword_tokenizer_type
227208
if do_subword_tokenize:
228209
if subword_tokenizer_type == "wordpiece":
229-
self.subword_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
210+
self.subword_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token))
230211
elif subword_tokenizer_type == "character":
231-
self.subword_tokenizer = CharacterTokenizer(vocab=self.vocab, unk_token=self.unk_token)
212+
self.subword_tokenizer = CharacterTokenizer(vocab=self.vocab, unk_token=str(unk_token))
232213
elif subword_tokenizer_type == "sentencepiece":
233-
self.subword_tokenizer = SentencepieceTokenizer(vocab=self.spm_file, unk_token=self.unk_token)
214+
self.subword_tokenizer = SentencepieceTokenizer(vocab=self.spm_file, unk_token=str(unk_token))
234215
else:
235216
raise ValueError(f"Invalid subword_tokenizer_type '{subword_tokenizer_type}' is specified.")
217+
super().__init__(
218+
spm_file=spm_file,
219+
unk_token=unk_token,
220+
sep_token=sep_token,
221+
pad_token=pad_token,
222+
cls_token=cls_token,
223+
mask_token=mask_token,
224+
do_lower_case=do_lower_case,
225+
do_word_tokenize=do_word_tokenize,
226+
do_subword_tokenize=do_subword_tokenize,
227+
word_tokenizer_type=word_tokenizer_type,
228+
subword_tokenizer_type=subword_tokenizer_type,
229+
never_split=never_split,
230+
mecab_kwargs=mecab_kwargs,
231+
sudachi_kwargs=sudachi_kwargs,
232+
jumanpp_kwargs=jumanpp_kwargs,
233+
**kwargs,
234+
)
236235

237236
@property
238237
def do_lower_case(self):

src/transformers/models/bertweet/tokenization_bertweet.py

+16-17
Original file line numberDiff line numberDiff line change
@@ -134,18 +134,6 @@ def __init__(
134134
mask_token="<mask>",
135135
**kwargs,
136136
):
137-
super().__init__(
138-
normalization=normalization,
139-
bos_token=bos_token,
140-
eos_token=eos_token,
141-
sep_token=sep_token,
142-
cls_token=cls_token,
143-
unk_token=unk_token,
144-
pad_token=pad_token,
145-
mask_token=mask_token,
146-
**kwargs,
147-
)
148-
149137
try:
150138
from emoji import demojize
151139

@@ -161,10 +149,10 @@ def __init__(
161149
self.merges_file = merges_file
162150

163151
self.encoder = {}
164-
self.encoder[self.bos_token] = 0
165-
self.encoder[self.pad_token] = 1
166-
self.encoder[self.eos_token] = 2
167-
self.encoder[self.unk_token] = 3
152+
self.encoder[bos_token] = 0
153+
self.encoder[pad_token] = 1
154+
self.encoder[eos_token] = 2
155+
self.encoder[unk_token] = 3
168156

169157
self.add_from_file(vocab_file)
170158

@@ -178,9 +166,20 @@ def __init__(
178166

179167
self.normalization = normalization
180168
self.tweetPreprocessor = TweetTokenizer()
181-
182169
self.special_puncts = {"’": "'", "…": "..."}
183170

171+
super().__init__(
172+
normalization=normalization,
173+
bos_token=bos_token,
174+
eos_token=eos_token,
175+
sep_token=sep_token,
176+
cls_token=cls_token,
177+
unk_token=unk_token,
178+
pad_token=pad_token,
179+
mask_token=mask_token,
180+
**kwargs,
181+
)
182+
184183
def build_inputs_with_special_tokens(
185184
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
186185
) -> List[int]:

0 commit comments

Comments
 (0)