Skip to content

Fix bug in Longest Matching tokenizer to preprocess spaces consistently #1062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 11, 2025

Conversation

wannaphong
Copy link
Member

Fixes #1061

Update the Longest Matching tokenizer to preprocess spaces consistently with the Multi-Cut tokenizer.

  • Modify pythainlp/tokenize/longest.py to group consecutive spaces into one token using regex.
  • Add test cases in tests/core/test_tokenize.py to verify consistent preprocessing of spaces between Longest Matching and Multi-Cut tokenizers.

Fixes #1061

Update the Longest Matching tokenizer to preprocess spaces consistently with the Multi-Cut tokenizer.

* Modify `pythainlp/tokenize/longest.py` to group consecutive spaces into one token using regex.
* Add test cases in `tests/core/test_tokenize.py` to verify consistent preprocessing of spaces between Longest Matching and Multi-Cut tokenizers.
@coveralls
Copy link

Coverage Status

coverage: 52.712% (+0.04%) from 52.675%
when pulling 97dfe87 on wannaphong/fix-tokenizer
into 9a9d11f on dev.

@wannaphong wannaphong added the bug bugs in the library label Jan 11, 2025
@wannaphong wannaphong added this to the 5.1 milestone Jan 11, 2025
@wannaphong wannaphong merged commit cae175c into dev Jan 11, 2025
41 checks passed
@bact bact deleted the wannaphong/fix-tokenizer branch January 11, 2025 05:55
@bact
Copy link
Member

bact commented Jan 11, 2025

the new _RE_SPACES regex has never been used

@wannaphong
Copy link
Member Author

wannaphong commented Jan 11, 2025

the new _RE_SPACES regex has never been used

Oh. It is a llm hallucination. Fixed 3aa57c6

This was referenced Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug bugs in the library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: Why isn’t space preprocessing consistent between Longest Matching and Multi-Cut?
3 participants