-
Notifications
You must be signed in to change notification settings - Fork 277
Add Thai word list from ICU BreakIterator dictionary #879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @pavaris-pm! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2023-12-06 10:17:06 UTC |
Hello! Thank you for your pull request. Can you add filter the word that start with |
SPDX-License-Identifier: Unicode-DFS-2016
Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise. |
Yes 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once license info has moved to corpus_license.md AND the comment lines are properly discarded, I can merge this.
I think we can do this in Maybe add the boolean parameter Or, we can utilize the existing Python standard library |
@bact @wannaphong i already add comment filtering by adding a new parameters named |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for help me sorting it alphabetically krub 👍🏻
@bact @wannaphong I've made some experiment to test the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It look great for me.
Filename: icubrk_th.txt License: Unicode-DFS-2016
Also rename `thai_icu()` to `thai_icu_words()` to make it more explicit and consistent with others, like: `thai_orst_words()`
Change thai_icu to thai_icu_words
Kudos, SonarCloud Quality Gate passed!
|
def get_corpus(filename: str, as_is: bool = False) -> Union[frozenset, list]: | ||
def get_corpus(filename: str, | ||
as_is: bool = False, | ||
comments: bool = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed this to comments
instead of discard_comments
(as I suggested earlier) to avoid double negation.
The semantic now is:
- if
comments
= True, then keep comments - if
comments
= False, then discard comments
""" | ||
path = path_pythainlp_corpus(filename) | ||
lines = [] | ||
with open(path, "r", encoding="utf-8-sig") as fh: | ||
lines = fh.read().splitlines() | ||
|
||
if not comments: | ||
# take only text before character '#' | ||
lines = [line.split("#", 1)[0] for line in lines] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will allowed the comment to be at any position of the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved.
Few modification to get_corpus()
to make the code more generic.
I have changed the module/function name to
corpus.icu
instead ofcorpus.thai_icu
- to make the module name more genericthai_icu_words
instead ofthai_icu
- to make the function name inline withthai_words
andthai_orst_words
So when import, it will be like:
from pythainlp.corpus.icu import thai_icu_words
Note: I will also after this rename the wikipedia (#869) and volubilis (#870) corpora as well, to make them more consistent:
So instead of having:
from pythainlp.corpus.volubilis import volubilis
from pythainlp.corpus.wikipedia_titles import wikipedia_titles
we should have:
from pythainlp.corpus.volubilis import thai_volubilis_words
from pythainlp.corpus.wikipedia import thai_wikipedia_titles
Merged thank you. |
What does this changes
@wannaphong @bact from issue #877 since ICU are included to almost all web browser, i've added ICU dictionary to PyThaiNLP where file of ICU dictionary are named as
icubrk_th.txt
and their python file to load the corpus are named asthai_icu.py
krub.Will resolve #877
Your checklist for this pull request
🚨Please review the guidelines for contributing to this repository.