Fix bug in Longest Matching tokenizer to preprocess spaces consistently #1062

wannaphong · 2025-01-10T15:02:17Z

Update the Longest Matching tokenizer to preprocess spaces consistently with the Multi-Cut tokenizer.

Modify pythainlp/tokenize/longest.py to group consecutive spaces into one token using regex.
Add test cases in tests/core/test_tokenize.py to verify consistent preprocessing of spaces between Longest Matching and Multi-Cut tokenizers.

Fixes #1061 Update the Longest Matching tokenizer to preprocess spaces consistently with the Multi-Cut tokenizer. * Modify `pythainlp/tokenize/longest.py` to group consecutive spaces into one token using regex. * Add test cases in `tests/core/test_tokenize.py` to verify consistent preprocessing of spaces between Longest Matching and Multi-Cut tokenizers.

sonarqubecloud · 2025-01-10T15:06:42Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coveralls · 2025-01-10T15:13:58Z

coverage: 52.712% (+0.04%) from 52.675%
when pulling 97dfe87 on wannaphong/fix-tokenizer
into 9a9d11f on dev.

bact · 2025-01-11T05:56:42Z

the new _RE_SPACES regex has never been used

wannaphong · 2025-01-11T06:35:24Z

the new _RE_SPACES regex has never been used

Oh. It is a llm hallucination. Fixed 3aa57c6

wannaphong added 3 commits January 10, 2025 22:02

Update longest.py

2ae9ba6

Update test_tokenize.py

97dfe87

wannaphong mentioned this pull request Jan 10, 2025

bug: Why isn’t space preprocessing consistent between Longest Matching and Multi-Cut? #1061

Closed

wannaphong added the bug bugs in the library label Jan 11, 2025

wannaphong added this to the 5.1 milestone Jan 11, 2025

wannaphong merged commit cae175c into dev Jan 11, 2025
41 checks passed

bact deleted the wannaphong/fix-tokenizer branch January 11, 2025 05:55

This was referenced Feb 19, 2025

PyThaiNLP 5.1 Change Log #900

Closed

PyThaiNLP v5.1.0 Released! #1079

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug in Longest Matching tokenizer to preprocess spaces consistently #1062

Fix bug in Longest Matching tokenizer to preprocess spaces consistently #1062

wannaphong commented Jan 10, 2025

sonarqubecloud bot commented Jan 10, 2025

coveralls commented Jan 10, 2025

bact commented Jan 11, 2025

wannaphong commented Jan 11, 2025 •

edited

Loading

Fix bug in Longest Matching tokenizer to preprocess spaces consistently #1062

Fix bug in Longest Matching tokenizer to preprocess spaces consistently #1062

Conversation

wannaphong commented Jan 10, 2025

sonarqubecloud bot commented Jan 10, 2025

Quality Gate passed

coveralls commented Jan 10, 2025

bact commented Jan 11, 2025

wannaphong commented Jan 11, 2025 • edited Loading

wannaphong commented Jan 11, 2025 •

edited

Loading