Skip to content

Commit a0d057f

Browse files
[ML] Update Deberta tokenizer (#116358) (#117195)
* Was using byte position for end of offset, but it seems like using char position is correct * Update docs/changelog/116358.yaml * Update UnigramTokenizer.java --------- Co-authored-by: Elastic Machine <[email protected]>
1 parent 626d100 commit a0d057f

File tree

2 files changed

+8
-1
lines changed

2 files changed

+8
-1
lines changed

docs/changelog/116358.yaml

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
pr: 116358
2+
summary: Update Deberta tokenizer
3+
area: Machine Learning
4+
type: bug
5+
issues: []

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/UnigramTokenizer.java

+3-1
Original file line numberDiff line numberDiff line change
@@ -367,8 +367,10 @@ List<DelimitedToken.Encoded> tokenize(CharSequence inputSequence, IntToIntFuncti
367367
new DelimitedToken.Encoded(
368368
Strings.format("<0x%02X>", bytes[i]),
369369
pieces[i],
370+
// even though we are changing the number of characters in the output, we don't
371+
// need to change the offsets. The offsets refer to the input characters
370372
offsetCorrection.apply(node.startsAtCharPos),
371-
offsetCorrection.apply(startsAtBytes + i)
373+
offsetCorrection.apply(endsAtChars)
372374
)
373375
);
374376
}

0 commit comments

Comments
 (0)