Add Thai word list from ICU BreakIterator dictionary #879

pavaris-pm · 2023-12-05T05:32:07Z

What does this changes

@wannaphong @bact from issue #877 since ICU are included to almost all web browser, i've added ICU dictionary to PyThaiNLP where file of ICU dictionary are named as icubrk_th.txt and their python file to load the corpus are named as thai_icu.py krub.

Will resolve #877

Your checklist for this pull request

🚨Please review the guidelines for contributing to this repository.

Passed code styles and structures
Passed code linting checks and unit test

pep8speaks · 2023-12-05T05:32:17Z

Hello @pavaris-pm! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2023-12-06 10:17:06 UTC

wannaphong · 2023-12-05T06:06:58Z

Hello! Thank you for your pull request. Can you add filter the word that start with #?

SPDX-License-Identifier: Unicode-DFS-2016

pavaris-pm · 2023-12-05T08:03:08Z

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

wannaphong · 2023-12-05T08:13:43Z

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

Yes 👍

bact

Once license info has moved to corpus_license.md AND the comment lines are properly discarded, I can merge this.

pythainlp/corpus/thai_icu.py

bact · 2023-12-05T11:44:03Z

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

I think we can do this in get_corpus().

Maybe add the boolean parameter discard_comments to get_corpus()?
The default is probably False.

Or, we can utilize the existing Python standard library shlex for this. shlex will ignore comment lines when it gets its input.

https://docs.python.org/3/library/shlex.html

pavaris-pm · 2023-12-05T13:20:55Z

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

I think we can do this in get_corpus().

Maybe add the boolean parameter discard_comments to get_corpus()? The default is probably False.

Or, we can utilize the existing Python standard library shlex for this. shlex will ignore comment lines when it gets its input.

https://docs.python.org/3/library/shlex.html

@bact @wannaphong i already add comment filtering by adding a new parameters named discard_comments where the default value is set to be False. You can review the code from the latest commit krub

pythainlp/corpus/corpus_license.md

pavaris-pm

thanks for help me sorting it alphabetically krub 👍🏻

pavaris-pm · 2023-12-05T15:02:03Z

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

I think we can do this in get_corpus().

Maybe add the boolean parameter discard_comments to get_corpus()? The default is probably False.

Or, we can utilize the existing Python standard library shlex for this. shlex will ignore comment lines when it gets its input.

https://docs.python.org/3/library/shlex.html

@bact @wannaphong I've made some experiment to test the discard_comments parameters and fix some bugs from it. Now it works perfectly. feel free to review from now on krub. It's done 💯

wannaphong

It look great for me.

Filename: icubrk_th.txt License: Unicode-DFS-2016

Also rename `thai_icu()` to `thai_icu_words()` to make it more explicit and consistent with others, like: `thai_orst_words()`

Change thai_icu to thai_icu_words

sonarqubecloud · 2023-12-06T10:17:38Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

bact · 2023-12-06T10:19:20Z

pythainlp/corpus/core.py

-def get_corpus(filename: str, as_is: bool = False) -> Union[frozenset, list]:
+def get_corpus(filename: str,
+               as_is: bool = False,
+               comments: bool = True


I have changed this to comments instead of discard_comments (as I suggested earlier) to avoid double negation.

The semantic now is:

if comments = True, then keep comments

if comments = False, then discard comments

bact · 2023-12-06T10:20:12Z

pythainlp/corpus/core.py

    """
    path = path_pythainlp_corpus(filename)
    lines = []
    with open(path, "r", encoding="utf-8-sig") as fh:
        lines = fh.read().splitlines()

+    if not comments:
+        # take only text before character '#'
+        lines = [line.split("#", 1)[0] for line in lines]


This will allowed the comment to be at any position of the line.

bact

Approved.

Few modification to get_corpus() to make the code more generic.

I have changed the module/function name to

corpus.icu instead of corpus.thai_icu - to make the module name more generic
thai_icu_words instead of thai_icu - to make the function name inline with thai_words and thai_orst_words

So when import, it will be like:

from pythainlp.corpus.icu import thai_icu_words

Note: I will also after this rename the wikipedia (#869) and volubilis (#870) corpora as well, to make them more consistent:

So instead of having:

from pythainlp.corpus.volubilis import volubilis
from pythainlp.corpus.wikipedia_titles import wikipedia_titles

we should have:

from pythainlp.corpus.volubilis import thai_volubilis_words
from pythainlp.corpus.wikipedia import thai_wikipedia_titles

bact · 2023-12-06T12:23:07Z

Merged thank you.

add thai ICU corpus

3ebf721

pavaris-pm mentioned this pull request Dec 5, 2023

Add ICU wordbreak dictionary (Thai) #877

Closed

fix pep8

6338646

Add SPDX tags to thai_icu.txt

7677cf9

SPDX-License-Identifier: Unicode-DFS-2016

bact added enhancement enhance functionalities corpus corpus/dataset-related issues labels Dec 5, 2023

Sort imports in __init__.py

4328b3a

bact requested changes Dec 5, 2023

View reviewed changes

pythainlp/corpus/thai_icu.py Outdated Show resolved Hide resolved

bact added this to the 5.0 milestone Dec 5, 2023

bact changed the title ~~add Thai ICU Dict into PyThaiNLP corpus~~ Add Thai ICU wordbreak dictionary to PyThaiNLP corpus Dec 5, 2023

add comment filtering and update corpus license

42f60c4

pavaris-pm commented Dec 5, 2023

View reviewed changes

pythainlp/corpus/corpus_license.md Outdated Show resolved Hide resolved

pavaris-pm commented Dec 5, 2023

View reviewed changes

pavaris-pm added 4 commits December 5, 2023 13:35

fix pep8

ee85e1b

fix pep8 (trailing whitespaces)

b2473a0

fix bug in thai_icu

52ea875

fix typo

0bed068

pavaris-pm requested a review from bact December 5, 2023 15:02

Add more get_corpus docs

73378c6

wannaphong approved these changes Dec 5, 2023

View reviewed changes

bact added 3 commits December 6, 2023 09:43

Add license for Thai dict from ICU

bf7212a

Filename: icubrk_th.txt License: Unicode-DFS-2016

Rename thai_icu.txt to icubrk_th.txt

276be53

Adjust comment discard method in get_corpus()

f8ccc3a

bact added 4 commits December 6, 2023 10:12

Update and rename thai_icu.py to icu.py

171214c

Also rename `thai_icu()` to `thai_icu_words()` to make it more explicit and consistent with others, like: `thai_orst_words()`

Update __init__.py

5cf8809

Change thai_icu to thai_icu_words

Update core.py

3323b82

Update test_corpus.py

82322da

bact reviewed Dec 6, 2023

View reviewed changes

bact approved these changes Dec 6, 2023

View reviewed changes

bact merged commit 297aadc into PyThaiNLP:dev Dec 6, 2023

bact changed the title ~~Add Thai ICU wordbreak dictionary to PyThaiNLP corpus~~ Add Thai word list from ICU BreakIterator dictionary Dec 15, 2023

bact mentioned this pull request Dec 15, 2023

PyThaiNLP 5.0 Change Log #788

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Thai word list from ICU BreakIterator dictionary #879

Add Thai word list from ICU BreakIterator dictionary #879

pavaris-pm commented Dec 5, 2023 •

edited by bact

Loading

pep8speaks commented Dec 5, 2023 •

edited

Loading

wannaphong commented Dec 5, 2023

pavaris-pm commented Dec 5, 2023 •

edited

Loading

wannaphong commented Dec 5, 2023

bact left a comment •

edited

Loading

bact commented Dec 5, 2023

pavaris-pm commented Dec 5, 2023 •

edited

Loading

pavaris-pm left a comment •

edited

Loading

pavaris-pm commented Dec 5, 2023

wannaphong left a comment

sonarqubecloud bot commented Dec 6, 2023

bact Dec 6, 2023

bact Dec 6, 2023

bact left a comment

bact commented Dec 6, 2023

Add Thai word list from ICU BreakIterator dictionary #879

Add Thai word list from ICU BreakIterator dictionary #879

Conversation

pavaris-pm commented Dec 5, 2023 • edited by bact Loading

What does this changes

Your checklist for this pull request

pep8speaks commented Dec 5, 2023 • edited Loading

Comment last updated at 2023-12-06 10:17:06 UTC

wannaphong commented Dec 5, 2023

pavaris-pm commented Dec 5, 2023 • edited Loading

wannaphong commented Dec 5, 2023

bact left a comment • edited Loading

Choose a reason for hiding this comment

bact commented Dec 5, 2023

pavaris-pm commented Dec 5, 2023 • edited Loading

pavaris-pm left a comment • edited Loading

Choose a reason for hiding this comment

pavaris-pm commented Dec 5, 2023

wannaphong left a comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Dec 6, 2023

bact Dec 6, 2023

Choose a reason for hiding this comment

bact Dec 6, 2023

Choose a reason for hiding this comment

bact left a comment

Choose a reason for hiding this comment

bact commented Dec 6, 2023

pavaris-pm commented Dec 5, 2023 •

edited by bact

Loading

pep8speaks commented Dec 5, 2023 •

edited

Loading

pavaris-pm commented Dec 5, 2023 •

edited

Loading

bact left a comment •

edited

Loading

pavaris-pm commented Dec 5, 2023 •

edited

Loading

pavaris-pm left a comment •

edited

Loading