Reduce reload word tokenizer engine in word_tokenize #973

wannaphong · 2024-11-02T12:21:31Z

Current, word_tokenize function call word tokenizer engine in every using because it doesn't have global variable for store the engine, so it needs to reload same word tokenizer engine in every using.

We can fix by using global variable for store the engine and engine's name than check if it call same engine, it doesn't call word tokenizer engine.

The text was updated successfully, but these errors were encountered:

new5558 · 2025-01-10T15:24:56Z

Can you please elaborate on this behavior in more detail? I tried to reproduce your mentioned behavior using this Colab Notebook. I noticed the engine is already called only once without relying on Python global variable store!

Please correct me if I am wrong, but I observed that the engine created outside of segment function does not need to be migrated to a global variable.

Example:

pythainlp/pythainlp/tokenize/han_solo.py

Lines 119 to 131 in 9a9d11f

    
           _to_feature = Featurizer() 
        
           def segment(text: str) -> List[str]: 
        
               x = _to_feature.featurize(text)["X"] 
        
               y_pred = tagger.tag(x) 
        
               list_cut = [] 
        
               for j, k in zip(list(text), y_pred): 
        
                   if k == "1": 
        
                       list_cut.append(j) 
        
                   else: 
        
                       list_cut[-1] += j 
        
               return list_cut

However, There is a case of engine created inside of segment function. In this case we need to migrate engine installation logic out of segment function.

Example:

pythainlp/pythainlp/tokenize/attacut.py

Lines 29 to 45 in 9a9d11f

    
           def segment(text: str, model: str = "attacut-sc") -> List[str]: 
        
               """ 
        
               Wrapper for AttaCut - Fast and Reasonably Accurate Word Tokenizer for Thai 
        
               :param str text: text to be tokenized to words 
        
               :param str model: model of word tokenizer model 
        
               :return: list of words, tokenized from the text 
        
               :rtype: list[str] 
        
               **Options for model** 
        
                   * *attacut-sc* (default) using both syllable and character features 
        
                   * *attacut-c* using only character feature 
        
               """ 
        
               if not text or not isinstance(text, str): 
        
                   return [] 
        
               _tokenizer = AttacutTokenizer(model) 
        
               return _tokenizer.tokenize(text)

If it actually works this way, I can help in this weekend migrate all engines in word tokenizer module out of segment function and open a PR. Does it sound good it you?

wannaphong · 2025-01-10T16:48:24Z

It sounds good. 👍

Can you please elaborate on this behavior in more detail? I tried to reproduce your mentioned behavior using this Colab Notebook. I noticed the engine is already called only once without relying on Python global variable store!

Please correct me if I am wrong, but I observed that the engine created outside of segment function does not need to be migrated to a global variable.

Example:

pythainlp/pythainlp/tokenize/han_solo.py

Lines 119 to 131 in 9a9d11f

_to_feature = Featurizer()

def segment(text: str) -> List[str]:

x = _to_feature.featurize(text)["X"]

y_pred = tagger.tag(x)

list_cut = []

for j, k in zip(list(text), y_pred):

if k == "1":

list_cut.append(j)

else:

list_cut[-1] += j

return list_cut

However, There is a case of engine created inside of segment function. In this case we need to migrate engine installation logic out of segment function.

Example:

pythainlp/pythainlp/tokenize/attacut.py

Lines 29 to 45 in 9a9d11f

def segment(text: str, model: str = "attacut-sc") -> List[str]:

"""

Wrapper for AttaCut - Fast and Reasonably Accurate Word Tokenizer for Thai

:param str text: text to be tokenized to words

:param str model: model of word tokenizer model

:return: list of words, tokenized from the text

:rtype: list[str]

**Options for model**

* *attacut-sc* (default) using both syllable and character features

* *attacut-c* using only character feature

"""

if not text or not isinstance(text, str):

return []

_tokenizer = AttacutTokenizer(model)

return _tokenizer.tokenize(text)

If it actually works this way, I can help in this weekend migrate all engines in word tokenizer module out of segment function and open a PR. Does it sound good it you?

new5558 · 2025-01-10T19:58:11Z

~~My mistake! 😅 After looking at the code again. I found AttacutTokenizer(model) requires argument model which can be dynamically chosen at runtime.~~

~~Will use the original ideas of global variable instead.~~

Please ignore this and refer to the below investigation instead

new5558 · 2025-01-10T21:47:37Z

I just did a deeper dive and created a detailed benchmark on This New Colab

Findings 1: Using a singleton tokenizer engine can improve the attacut's performance by 4X!
(23.83ms per call Vs 5.35ms per call)
Findings 2: Using global or not did not affect any output/performance of the code on both single thread or multi threading. (If we only modify the singleton object though, not reassigning it). Still using global is more intuitive than not even though no performance/safety improvement
Findings 3: The best practice I found for handling singleton in Python is to combine 1) global variable with 2) threading Lock. Ref: Effective Python Book

The benefit of using Lock is to prevent race condition that leads to more than one object creation when used in concurrency in API calls like FastAPI/Flask. (Reproduction in Colab)

I think this best practice implementation may be too overkill for a small tokenizer like this, but may benefits for bigger model that is tricky to handle memory/cold start like LM #1048. However, If you guys think implementing threading lock with the tokenizer engine is more suitable, please feel free to notify me in this issue.

wannaphong · 2025-01-11T06:34:24Z

I just did a deeper dive and created a detailed benchmark on This New Colab
* **Findings 1:** Using a singleton tokenizer engine can improve the attacut's performance by 4X!
  (23.83ms per call Vs 5.35ms per call)

* **Findings 2:** Using `global` or not did not affect any output/performance of the code on both single thread or multi threading. (If we only modify the singleton object though, not reassigning it). Still using `global` is more intuitive than not even though no performance/safety improvement

* **Findings 3:** The `best practice`  I found for handling singleton in Python is to combine 1) global variable with 2) [threading Lock](https://github.com/bslatkin/effectivepython/blob/main/example_code/item_098/mycli/global_lock_perf.py). Ref: [Effective Python Book](https://effectivepython.com/)
The benefit of using Lock is to prevent race condition that leads to more than one object creation when used in concurrency in API calls like FastAPI/Flask. (Reproduction in Colab)

I think this best practice implementation may be too overkill for a small tokenizer like this, but may benefits for bigger model that is tricky to handle memory/cold start like LM #1048. However, If you guys think implementing threading lock with the tokenizer engine is more suitable, please feel free to notify me in this issue.

@bact How are you think? I think Findings 1 is good enough for the requirement.

bact · 2025-01-11T06:48:02Z

Thank you @new5558 for a very detailed research and evaluation. I agree with you and @wannaphong that the Solution 1 is probably already enough for general cases.

We may have to think about multi thread later, together with other commonly used functions.

bact added this to PyThaiNLP Nov 2, 2024

bact added the enhancement enhance functionalities label Nov 2, 2024

new5558 mentioned this issue Jan 11, 2025

[Ready] Reduce reload word tokenizer engine in word_tokenize #1064

Merged

4 tasks

wannaphong closed this as completed in #1064 Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce reload word tokenizer engine in word_tokenize #973

Reduce reload word tokenizer engine in word_tokenize #973

wannaphong commented Nov 2, 2024

new5558 commented Jan 10, 2025

wannaphong commented Jan 10, 2025

new5558 commented Jan 10, 2025 •

edited

Loading

new5558 commented Jan 10, 2025 •

edited

Loading

wannaphong commented Jan 11, 2025

bact commented Jan 11, 2025

Reduce reload word tokenizer engine in word_tokenize #973

Reduce reload word tokenizer engine in word_tokenize #973

Comments

wannaphong commented Nov 2, 2024

new5558 commented Jan 10, 2025

wannaphong commented Jan 10, 2025

new5558 commented Jan 10, 2025 • edited Loading

new5558 commented Jan 10, 2025 • edited Loading

wannaphong commented Jan 11, 2025

bact commented Jan 11, 2025

new5558 commented Jan 10, 2025 •

edited

Loading

new5558 commented Jan 10, 2025 •

edited

Loading