You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I suggest to add a dictionary-based consonant-remover method.
As like เริศศศศศศศศศศศศศศ -> เริศ
Context
I am doing text mining of Pantip. I saw that there are not few people write like "เริศศศศศศศศศศศศศศ", to express their emotions. Current pythainlp.utils.normalize() removes only vowels duplication, so there is no method to handle this now. Current tokenizers may separate this as "เริศ / ศศศศศศศศศศศศศ", but it becomes a noise of analysis.
Plus the implementation was a little long, so I wanted this method in pythainlp library
Possible implementation
My implementation was like below.
#>>against เริศศศศศศศศศศศศศศif (len(sentence) >2) andpythainlp.util.isthaichar(sentence[-1]) and (sentence[-1] ==sentence[-2]):
# The last of the sentence has duplication (duplication typically at the last)dup=sentence[-1]
#find the words in the dictionary that has duplication at the last#required here because dictio dynamically addedrepeaters= []
forwordindictio:
if (len(word) >2) and (word[-1] ==dup) and (word[-2] ==dup):
all_same=Trueforcnt_1inrange(len(word)):
ifword[cnt_1] !=dup:
all_same=Falsebreakifnotall_same:
repeaters.append(word)
#check if there is matching with repeaterssentence_head=sentencewhile(sentence_head[-1] ==dup):
if (len(sentence_head) ==1):
breaksentence_head=sentence_head[:-1]
found=Falseforrepeaterinrepeaters:
rep_head=repeaterrepetition=0while(rep_head[-1] ==dup):
rep_head=rep_head[:-1]
repetition+=1ifsentence_head[-len(rep_head):] ==rep_head:
found=Truebreakiffound:
sentences[cnt] =sentence_head+ (dup*repetition)
else:
sentences[cnt] =sentence_head+ (dup*1)
If this plan seems good, I could make a PR
The text was updated successfully, but these errors were encountered:
Detailed description
I suggest to add a dictionary-based consonant-remover method.
As like เริศศศศศศศศศศศศศศ -> เริศ
Context
I am doing text mining of Pantip. I saw that there are not few people write like "เริศศศศศศศศศศศศศศ", to express their emotions. Current
pythainlp.utils.normalize()
removes only vowels duplication, so there is no method to handle this now. Current tokenizers may separate this as "เริศ / ศศศศศศศศศศศศศ", but it becomes a noise of analysis.Plus the implementation was a little long, so I wanted this method in pythainlp library
Possible implementation
My implementation was like below.
If this plan seems good, I could make a PR
The text was updated successfully, but these errors were encountered: