Skip to content

[ML] fix NLP tokenization never_split handling around punctuation #82982

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

benwtrent
Copy link
Member

When multiple characters in a row might be part of the never_split we erroneously tokenized them.

This commit handles this scenario so now [[UNK] is now tokenized as [, [UNK]

@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Jan 24, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Copy link
Contributor

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent merged commit b9aebc8 into elastic:master Jan 25, 2022
@benwtrent benwtrent deleted the bugfix/ml-fix-never-split-tokenization branch January 25, 2022 12:39
@elastic elastic deleted a comment from elasticsearchmachine Jan 25, 2022
@benwtrent
Copy link
Member Author

This should not be backported to v8.0.0 as the original code it is fixing was not backported.

weizijun added a commit to weizijun/elasticsearch that referenced this pull request Jan 26, 2022
* upstream/master: (762 commits)
  [DOCS] Add note to that log4j customization is outside the support scope (elastic#82668)
  Batch Index Settings Update Requests (elastic#82896)
  [DOCS] Delete pipeline containing stored script (elastic#83102)
  Try again to fix changelog areas after reorg (elastic#83100)
  Bind to non-localhost for transport in some cases (elastic#82973)
  [DOCS] Reuse multi-level `join` warning (elastic#82976)
  Remove unnecessary CopyOnWriteHashMap class (elastic#83040)
  Adjust changelog categories after reorg (elastic#83087)
  [DOCS] Fix typo in `action.destructive_requires_name` breaking change (elastic#83085)
  Stack Monitoring: Add Enterprise Search monitoring index templates (elastic#82743)
  [DOCS] Fix stored script example snippet (elastic#83056)
  [DOCS] Re-add network traffic para to `term` query (elastic#83047)
  [DOCS] Rename example stored script (elastic#83054)
  [ML][DOCS] Add Trained model APIs to the REST APIs index (elastic#82791)
  [ML] Update running process when global calendar changes (elastic#83044)
  [Transform] Fix condition on which the transform stops processing buckets (elastic#82852)
  [DOCS] Fixes field names in ML sum functions. (elastic#83048)
  [ML] fix NLP tokenization never_split handling around punctuation (elastic#82982)
  Construct dynamic updates directly via object builders (elastic#81449)
  Emit trace.id into audit logs (elastic#82849)
  ...

# Conflicts:
#	client/rest-high-level/src/test/java/org/elasticsearch/client/IndicesClientIT.java
#	client/rest-high-level/src/test/java/org/elasticsearch/client/documentation/ILMDocumentationIT.java
#	server/src/main/java/org/elasticsearch/action/admin/indices/rollover/Condition.java
#	server/src/test/java/org/elasticsearch/action/admin/indices/rollover/ConditionTests.java
#	x-pack/plugin/core/src/test/java/org/elasticsearch/xpack/core/ilm/RolloverActionTests.java
#	x-pack/plugin/core/src/test/java/org/elasticsearch/xpack/core/ilm/TimeseriesLifecycleTypeTests.java
#	x-pack/plugin/core/src/test/java/org/elasticsearch/xpack/core/ilm/WaitForRolloverReadyStepTests.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants