[ML] Update Deberta tokenizer (#116358) (#117195)

maxhniebergall · elasticmachine · web-flow · commit a0d057f09708 · 2024-11-21T08:05:12.000+11:00
* Was using byte position for end of offset, but it seems like using char position is correct

* Update docs/changelog/116358.yaml

* Update UnigramTokenizer.java

---------

Co-authored-by: Elastic Machine &lt;elasticmachine@users.noreply.github.com&gt;
diff --git a/docs/changelog/116358.yaml b/docs/changelog/116358.yaml
@@ -0,0 +1,5 @@
+pr: 116358
+summary: Update Deberta tokenizer
+area: Machine Learning
+type: bug
+issues: []
diff --git a/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/UnigramTokenizer.java b/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/UnigramTokenizer.java
@@ -367,8 +367,10 @@ List<DelimitedToken.Encoded> tokenize(CharSequence inputSequence, IntToIntFuncti
                         new DelimitedToken.Encoded(
                             Strings.format("<0x%02X>", bytes[i]),
                             pieces[i],
+                            // even though we are changing the number of characters in the output, we don't
+                            // need to change the offsets. The offsets refer to the input characters
                             offsetCorrection.apply(node.startsAtCharPos),
-                            offsetCorrection.apply(startsAtBytes + i)
+                            offsetCorrection.apply(endsAtChars)
                         )
                     );
                 }

Original file line number	Diff line number	Diff line change
`@@ -367,8 +367,10 @@ List<DelimitedToken.Encoded> tokenize(CharSequence inputSequence, IntToIntFuncti`
`367`	`367`	`new DelimitedToken.Encoded(`
`368`	`368`	`Strings.format("<0x%02X>", bytes[i]),`
`369`	`369`	`pieces[i],`
	`370`	`+ // even though we are changing the number of characters in the output, we don't`
	`371`	`+ // need to change the offsets. The offsets refer to the input characters`
`370`	`372`	`offsetCorrection.apply(node.startsAtCharPos),`
`371`		`- offsetCorrection.apply(startsAtBytes + i)`
	`373`	`+ offsetCorrection.apply(endsAtChars)`
`372`	`374`	`)`
`373`	`375`	`);`
`374`	`376`	`}`