-
Notifications
You must be signed in to change notification settings - Fork 279
Add: remove_trailing_repeat_consonants() #862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 38 commits
Commits
Show all changes
39 commits
Select commit
Hold shift + click to select a range
e39b622
Merge pull request #3 from PyThaiNLP/dev
konbraphat51 6ea4181
documentation
konbraphat51 be29c00
Add: implemation
konbraphat51 3e94234
Add: test code
konbraphat51 ca6cd94
Add: remove_repeat_consonants()
konbraphat51 181664c
Merge branch 'dev' of https://github.com/konbraphat51/pythainlp into dev
konbraphat51 702be9a
Fix: push miss
konbraphat51 130b1ec
Fix: divide the exceeding length code
konbraphat51 ef8ac0f
Refac: remove last white space
konbraphat51 2df4d37
Fix: restrict only to consonants
konbraphat51 16c3154
Refac: Remove unused import
konbraphat51 cc62a95
Refac: Use enumerate
konbraphat51 d74af32
Fix: add the function in init
konbraphat51 5bfa50d
Refac: use black
konbraphat51 28b6006
Refac: repeatedly used black
konbraphat51 c6b564d
Refac: resolve nested if
konbraphat51 8d09323
Fix test case
konbraphat51 946f59c
Refac: seperate function
konbraphat51 a5153e0
Refac: reduce line length
konbraphat51 43dfd25
Refac: seperate 2 functions
konbraphat51 d9ae534
Refac: use black
konbraphat51 844c21d
Refac: seperate match finding method
konbraphat51 1e1631f
Improve: save consonants repeaters for improve speed
konbraphat51 ceb9d76
Refac: make repeater checking function
konbraphat51 6509e0d
Refac: seperate function
konbraphat51 9c1a34c
Improve: Rename method
konbraphat51 24c3050
Refac: make names more clear
konbraphat51 13cf54a
Refac: reflect method name change
konbraphat51 a94fccb
Fix: argument inconsistence
konbraphat51 a2eca98
Merge: seperate + rename
konbraphat51 832d28c
Refac: revert to the first place
konbraphat51 95761ea
Refac: use black
konbraphat51 cefc4e7
Refac: reduce col length
konbraphat51 fd2896b
Refac: add last new line
konbraphat51 ee492f1
Update commentation
konbraphat51 4212ff3
Refac: clearify commentation
konbraphat51 abd4702
Refac: fix typi
konbraphat51 740c5e5
Add: __all__
konbraphat51 3315cb0
Sort export names in __all__
bact File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,256 @@ | ||
# -*- coding: utf-8 -*- | ||
# Copyright (C) 2016-2023 PyThaiNLP Project | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
""" | ||
Removement of repeated consonants at the end of words | ||
""" | ||
from pythainlp.corpus import thai_words | ||
from pythainlp.util.trie import Trie | ||
from pythainlp import thai_consonants as consonants | ||
from typing import Tuple, List | ||
|
||
# used by remove_trailing_repeat_consonants() | ||
# contains all words that has repeating consonants at the end | ||
# for each consonant | ||
# when dictionary updated, this should be updated too | ||
# key: consonant | ||
# value: list of words that has repeating consonants at the end | ||
last_consonants_repeaters = {} | ||
|
||
|
||
def remove_trailing_repeat_consonants( | ||
text: str, dictionary: Trie = None, has_dictionary_updated: bool = True | ||
) -> str: | ||
""" | ||
Remove repeating consonants at the last of the sentence. | ||
|
||
This function will remove the repeating consonants | ||
before a whitespace, new line or at the last | ||
so that the last word matches a word in the given dictionary. | ||
If there is no match, the repeating consonants will be | ||
reduced to one. | ||
If there are several match, the longest word will be used. | ||
Since this function uses a dictionary, the result may differs | ||
depending on the dictionary used. | ||
Plus, it is recommended to use normalize() to have a better result. | ||
|
||
:param str text: input text | ||
:param Trie dictionary: Trie dictionary to check the last word. | ||
If None, pythainlp.corpus.thai_words() will be used | ||
:param bool has_dictionary_updated: If the dictionary is updated | ||
or the first time using in the kernel, set this true. | ||
If not, set this false to save time. | ||
:return: text without repeating Thai consonants | ||
:rtype: str | ||
|
||
:Example: | ||
:: | ||
|
||
from pythainlp.util import remove_trailing_repeat_consonants | ||
from pythainlp.util import dict_trie | ||
|
||
# use default dictionary (pythainlp.corpus.thai_words()) | ||
remove_trailing_repeat_consonants('เริ่ดดดดดดดด') | ||
# output: เริ่ด | ||
|
||
remove_trailing_repeat_consonants('อืมมมมมมมมมมมมมมม') | ||
# output: อืมมม | ||
# "อืมมม" is in the default dictionary | ||
|
||
# use custom dictionary | ||
custom_dictionary = dict_trie(["อืมมมมม"]) | ||
remove_trailing_repeat_consonants('อืมมมมมมมมมมมมมมม', custom_dictionary) | ||
# output: อืมมมมม | ||
|
||
# long text | ||
remove_trailing_repeat_consonants('อืมมมมมมมมมมมมม คุณมีบุคลิกที่เริ่ดดดดด '\ | ||
'ฉันจะให้เกรดดีกับคุณณณ\nนี่เป็นความลับบบบบ') | ||
# output: อืมมม คุณมีบุคลิกที่เริ่ด ฉันจะให้เกรดดีกับคุณ | ||
# นี่เป็นความลับ | ||
""" | ||
# use default dictionary if not given | ||
if dictionary is None: | ||
dictionary = thai_words() | ||
|
||
# update repeaters dictionary if not updated | ||
if has_dictionary_updated: | ||
_update_consonant_repeaters(dictionary) | ||
|
||
# seperate by newline | ||
modified_lines = [] | ||
for line in text.split("\n"): | ||
segments = line.split(" ") | ||
|
||
for cnt, segment in enumerate(segments): | ||
segments[cnt] = _remove_repeat_trailing_consonants_from_segment( | ||
segment | ||
) | ||
|
||
# revert spaces | ||
modified_line = " ".join(segments) | ||
modified_lines.append(modified_line) | ||
|
||
# revert newlines | ||
modified_text = "\n".join(modified_lines) | ||
|
||
return modified_text | ||
|
||
|
||
def _remove_repeat_trailing_consonants_from_segment(segment: str) -> str: | ||
""" | ||
Remove repeating consonants at the last of the segment. | ||
|
||
This function process only at the last of the given text. | ||
Details is same as remove_repeat_consonants(). | ||
|
||
:param str segment: segment of text | ||
:return: segment without repeating Thai consonants | ||
:rtype: str | ||
""" | ||
# skip if the segment is not the target | ||
if not ( | ||
# the segment is long enough | ||
(len(segment) > 1) | ||
# last is Thai consonant | ||
and (segment[-1] in consonants) | ||
# has repiitition | ||
and (segment[-1] == segment[-2]) | ||
): | ||
# no need to process | ||
return segment | ||
|
||
# duplicating character | ||
dup = segment[-1] | ||
|
||
# find the words that has 2 or more duplication of | ||
# this character at the end. | ||
repeaters = last_consonants_repeaters[dup] | ||
|
||
# remove all of the last repeating character | ||
segment_head = _remove_all_last_consonants(segment, dup) | ||
|
||
# find the longest word that matches the segment | ||
longest_word, repetition = _find_longest_consonant_repeaters_match( | ||
segment_head, repeaters | ||
) | ||
|
||
if len(longest_word) > 0: | ||
# if there is a match, use it | ||
segment = segment_head + (dup * repetition) | ||
else: | ||
# if none found, | ||
# the chance is that the correct is one character, | ||
# or it's not in the dictionary. | ||
|
||
# make the repition to once | ||
segment = segment_head + (dup * 1) | ||
|
||
return segment | ||
|
||
|
||
def _remove_all_last_consonants(text: str, dup: str) -> str: | ||
""" | ||
Reduce repeating characters at the end of the text. | ||
|
||
This function will remove the repeating characters at the last. | ||
The text just before the repeating characters will be returned. | ||
|
||
:param str text: input text | ||
:param str dup: repeating character to be removed | ||
:return: text without repeating characters at the end | ||
:rtype: str | ||
""" | ||
removed = text | ||
while (len(removed) > 0) and (removed[-1] == dup): | ||
removed = removed[:-1] | ||
|
||
return removed | ||
|
||
|
||
def _update_consonant_repeaters(dictionary: Trie) -> None: | ||
""" | ||
Update dictionary of all words that has | ||
repeating consonants at the end from the dictionary. | ||
|
||
Search all words in the dictionary that has more than 1 consonants | ||
repeating at the end and store them in the global dictionary. | ||
|
||
:param str consonant: consonant to be searched | ||
:param Trie dictionary: Trie dictionary to search | ||
:rtype: None | ||
""" | ||
# initialize dictionary | ||
for consonant in list(consonants): | ||
last_consonants_repeaters[consonant] = [] | ||
|
||
# register | ||
for word in dictionary: | ||
if _is_last_consonant_repeater(word): | ||
last_consonants_repeaters[word[-1]].append(word) | ||
|
||
return | ||
|
||
|
||
def _is_last_consonant_repeater(word: str) -> bool: | ||
""" | ||
Check if the word has repeating consonants at the end. | ||
|
||
This function checks if the word has | ||
more than 1 repeating consonants at the end. | ||
|
||
:param str word: word to be checked | ||
:return: True if the word has repeating consonants at the end. | ||
:rtype: bool | ||
""" | ||
return ( | ||
(len(word) > 1) and (word[-1] == word[-2]) and (word[-1] in consonants) | ||
) | ||
|
||
|
||
def _find_longest_consonant_repeaters_match( | ||
segment_head: str, repeaters: List[str] | ||
) -> Tuple[str, int]: | ||
""" | ||
Find the longest word that matches the segment. | ||
|
||
Find the longest word that matches the last | ||
of the segment from the given repeaters list. | ||
This returns the word and | ||
how much the last character is repeated correctly. | ||
|
||
:param str segment: segment of text | ||
:param List[str] repeaters: list of words | ||
that has repeating consonants at the end | ||
:return: "tuple of the word" and | ||
"how much the last character is repeated correctly" | ||
If none, ("", 0) will be returned. | ||
:rtype: Tuple[str, int] | ||
""" | ||
longest_word = "" # the longest word that matches the segment | ||
repetition = 0 # how much the last character is repeated correctly | ||
for repeater in repeaters: | ||
# remove all of the last repeating character | ||
repeater_head = _remove_all_last_consonants(repeater, repeater[-1]) | ||
|
||
# check match | ||
if ( | ||
(len(segment_head) >= len(repeater_head)) | ||
and (segment_head[-len(repeater_head) :] == repeater_head) | ||
# matched confirmed, check it's longer | ||
and (len(repeater) > len(longest_word)) | ||
): | ||
longest_word = repeater | ||
repetition = len(repeater) - len(repeater_head) | ||
|
||
return longest_word, repetition |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.