Skip to content

[Suggestion] Add consonant-remover method #860

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
konbraphat51 opened this issue Nov 7, 2023 · 2 comments · Fixed by #862
Closed

[Suggestion] Add consonant-remover method #860

konbraphat51 opened this issue Nov 7, 2023 · 2 comments · Fixed by #862

Comments

@konbraphat51
Copy link
Contributor

konbraphat51 commented Nov 7, 2023

Detailed description

I suggest to add a dictionary-based consonant-remover method.
As like เริศศศศศศศศศศศศศศ -> เริศ

Context

I am doing text mining of Pantip. I saw that there are not few people write like "เริศศศศศศศศศศศศศศ", to express their emotions. Current pythainlp.utils.normalize() removes only vowels duplication, so there is no method to handle this now. Current tokenizers may separate this as "เริศ / ศศศศศศศศศศศศศ", but it becomes a noise of analysis.
Plus the implementation was a little long, so I wanted this method in pythainlp library

Possible implementation

My implementation was like below.

       #>>against เริศศศศศศศศศศศศศศ

        if (len(sentence) > 2) and pythainlp.util.isthaichar(sentence[-1]) and (sentence[-1] == sentence[-2]):
            # The last of the sentence has duplication (duplication typically at the last)

            dup = sentence[-1]
        
            #find the words in the dictionary that has duplication at the last
            #required here because dictio dynamically added
            repeaters = []
            for word in dictio:
                if (len(word) > 2) and (word[-1] == dup) and (word[-2] == dup):
                    all_same = True
                    for cnt_1 in range(len(word)):
                        if word[cnt_1] != dup:
                            all_same = False
                            break
                    if not all_same:
                        repeaters.append(word)
                    
            #check if there is matching with repeaters
            sentence_head = sentence
            while(sentence_head[-1] == dup):
                if (len(sentence_head) == 1):
                    break
                
                sentence_head = sentence_head[:-1]

            found = False
            for repeater in repeaters:
                rep_head = repeater
                
                repetition = 0
                while(rep_head[-1] == dup):
                    rep_head = rep_head[:-1]
                    repetition += 1
                    
                if sentence_head[-len(rep_head):] == rep_head:
                    found = True
                    break
                    
            if found:
                sentences[cnt] = sentence_head + (dup * repetition)
            else:
                sentences[cnt] = sentence_head + (dup * 1)

If this plan seems good, I could make a PR

@wannaphong
Copy link
Member

wannaphong commented Nov 8, 2023

It looks good. 👍

@konbraphat51
Copy link
Contributor Author

Okey, I will handle this soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants