[Suggestion] Add consonant-remover method #860

konbraphat51 · 2023-11-07T12:09:39Z

Detailed description

I suggest to add a dictionary-based consonant-remover method.
As like เริศศศศศศศศศศศศศศ -> เริศ

Context

I am doing text mining of Pantip. I saw that there are not few people write like "เริศศศศศศศศศศศศศศ", to express their emotions. Current pythainlp.utils.normalize() removes only vowels duplication, so there is no method to handle this now. Current tokenizers may separate this as "เริศ / ศศศศศศศศศศศศศ", but it becomes a noise of analysis.
Plus the implementation was a little long, so I wanted this method in pythainlp library

Possible implementation

My implementation was like below.

       #>>against เริศศศศศศศศศศศศศศ

        if (len(sentence) > 2) and pythainlp.util.isthaichar(sentence[-1]) and (sentence[-1] == sentence[-2]):
            # The last of the sentence has duplication (duplication typically at the last)

            dup = sentence[-1]
        
            #find the words in the dictionary that has duplication at the last
            #required here because dictio dynamically added
            repeaters = []
            for word in dictio:
                if (len(word) > 2) and (word[-1] == dup) and (word[-2] == dup):
                    all_same = True
                    for cnt_1 in range(len(word)):
                        if word[cnt_1] != dup:
                            all_same = False
                            break
                    if not all_same:
                        repeaters.append(word)
                    
            #check if there is matching with repeaters
            sentence_head = sentence
            while(sentence_head[-1] == dup):
                if (len(sentence_head) == 1):
                    break
                
                sentence_head = sentence_head[:-1]

            found = False
            for repeater in repeaters:
                rep_head = repeater
                
                repetition = 0
                while(rep_head[-1] == dup):
                    rep_head = rep_head[:-1]
                    repetition += 1
                    
                if sentence_head[-len(rep_head):] == rep_head:
                    found = True
                    break
                    
            if found:
                sentences[cnt] = sentence_head + (dup * repetition)
            else:
                sentences[cnt] = sentence_head + (dup * 1)

If this plan seems good, I could make a PR

The text was updated successfully, but these errors were encountered:

wannaphong · 2023-11-08T12:45:05Z

It looks good. 👍

konbraphat51 · 2023-11-08T12:45:52Z

Okey, I will handle this soon

konbraphat51 mentioned this issue Nov 9, 2023

Add: remove_trailing_repeat_consonants() #862

Merged

2 tasks

bact closed this as completed in #862 Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Suggestion] Add consonant-remover method #860

[Suggestion] Add consonant-remover method #860

konbraphat51 commented Nov 7, 2023 •

edited

Loading

wannaphong commented Nov 8, 2023 •

edited

Loading

konbraphat51 commented Nov 8, 2023

[Suggestion] Add consonant-remover method #860

[Suggestion] Add consonant-remover method #860

Comments

konbraphat51 commented Nov 7, 2023 • edited Loading

Detailed description

Context

Possible implementation

wannaphong commented Nov 8, 2023 • edited Loading

konbraphat51 commented Nov 8, 2023

konbraphat51 commented Nov 7, 2023 •

edited

Loading

wannaphong commented Nov 8, 2023 •

edited

Loading