Skip to content

Reduce reload word tokenizer engine in word_tokenize #973

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wannaphong opened this issue Nov 2, 2024 · 6 comments · Fixed by #1064
Closed

Reduce reload word tokenizer engine in word_tokenize #973

wannaphong opened this issue Nov 2, 2024 · 6 comments · Fixed by #1064
Labels
enhancement enhance functionalities

Comments

@wannaphong
Copy link
Member

Current, word_tokenize function call word tokenizer engine in every using because it doesn't have global variable for store the engine, so it needs to reload same word tokenizer engine in every using.

We can fix by using global variable for store the engine and engine's name than check if it call same engine, it doesn't call word tokenizer engine.

@bact bact added this to PyThaiNLP Nov 2, 2024
@bact bact added the enhancement enhance functionalities label Nov 2, 2024
@new5558
Copy link
Contributor

new5558 commented Jan 10, 2025

Can you please elaborate on this behavior in more detail? I tried to reproduce your mentioned behavior using this Colab Notebook. I noticed the engine is already called only once without relying on Python global variable store!

Please correct me if I am wrong, but I observed that the engine created outside of segment function does not need to be migrated to a global variable.

Example:

_to_feature = Featurizer()
def segment(text: str) -> List[str]:
x = _to_feature.featurize(text)["X"]
y_pred = tagger.tag(x)
list_cut = []
for j, k in zip(list(text), y_pred):
if k == "1":
list_cut.append(j)
else:
list_cut[-1] += j
return list_cut

However, There is a case of engine created inside of segment function. In this case we need to migrate engine installation logic out of segment function.

Example:

def segment(text: str, model: str = "attacut-sc") -> List[str]:
"""
Wrapper for AttaCut - Fast and Reasonably Accurate Word Tokenizer for Thai
:param str text: text to be tokenized to words
:param str model: model of word tokenizer model
:return: list of words, tokenized from the text
:rtype: list[str]
**Options for model**
* *attacut-sc* (default) using both syllable and character features
* *attacut-c* using only character feature
"""
if not text or not isinstance(text, str):
return []
_tokenizer = AttacutTokenizer(model)
return _tokenizer.tokenize(text)

If it actually works this way, I can help in this weekend migrate all engines in word tokenizer module out of segment function and open a PR. Does it sound good it you?

@wannaphong
Copy link
Member Author

It sounds good. 👍

Can you please elaborate on this behavior in more detail? I tried to reproduce your mentioned behavior using this Colab Notebook. I noticed the engine is already called only once without relying on Python global variable store!

Please correct me if I am wrong, but I observed that the engine created outside of segment function does not need to be migrated to a global variable.

Example:

_to_feature = Featurizer()
def segment(text: str) -> List[str]:
x = _to_feature.featurize(text)["X"]
y_pred = tagger.tag(x)
list_cut = []
for j, k in zip(list(text), y_pred):
if k == "1":
list_cut.append(j)
else:
list_cut[-1] += j
return list_cut

However, There is a case of engine created inside of segment function. In this case we need to migrate engine installation logic out of segment function.

Example:

def segment(text: str, model: str = "attacut-sc") -> List[str]:
"""
Wrapper for AttaCut - Fast and Reasonably Accurate Word Tokenizer for Thai
:param str text: text to be tokenized to words
:param str model: model of word tokenizer model
:return: list of words, tokenized from the text
:rtype: list[str]
**Options for model**
* *attacut-sc* (default) using both syllable and character features
* *attacut-c* using only character feature
"""
if not text or not isinstance(text, str):
return []
_tokenizer = AttacutTokenizer(model)
return _tokenizer.tokenize(text)

If it actually works this way, I can help in this weekend migrate all engines in word tokenizer module out of segment function and open a PR. Does it sound good it you?

@new5558
Copy link
Contributor

new5558 commented Jan 10, 2025

My mistake! 😅 After looking at the code again. I found AttacutTokenizer(model) requires argument model which can be dynamically chosen at runtime.

Will use the original ideas of global variable instead.

Please ignore this and refer to the below investigation instead

@new5558
Copy link
Contributor

new5558 commented Jan 10, 2025

I just did a deeper dive and created a detailed benchmark on This New Colab

  • Findings 1: Using a singleton tokenizer engine can improve the attacut's performance by 4X!
    (23.83ms per call Vs 5.35ms per call)

  • Findings 2: Using global or not did not affect any output/performance of the code on both single thread or multi threading. (If we only modify the singleton object though, not reassigning it). Still using global is more intuitive than not even though no performance/safety improvement

  • Findings 3: The best practice I found for handling singleton in Python is to combine 1) global variable with 2) threading Lock. Ref: Effective Python Book

The benefit of using Lock is to prevent race condition that leads to more than one object creation when used in concurrency in API calls like FastAPI/Flask. (Reproduction in Colab)

I think this best practice implementation may be too overkill for a small tokenizer like this, but may benefits for bigger model that is tricky to handle memory/cold start like LM #1048. However, If you guys think implementing threading lock with the tokenizer engine is more suitable, please feel free to notify me in this issue.

@wannaphong
Copy link
Member Author

I just did a deeper dive and created a detailed benchmark on This New Colab

* **Findings 1:** Using a singleton tokenizer engine can improve the attacut's performance by 4X!
  (23.83ms per call Vs 5.35ms per call)

* **Findings 2:** Using `global` or not did not affect any output/performance of the code on both single thread or multi threading. (If we only modify the singleton object though, not reassigning it). Still using `global` is more intuitive than not even though no performance/safety improvement

* **Findings 3:** The `best practice`  I found for handling singleton in Python is to combine 1) global variable with 2) [threading Lock](https://github.com/bslatkin/effectivepython/blob/main/example_code/item_098/mycli/global_lock_perf.py). Ref: [Effective Python Book](https://effectivepython.com/)

The benefit of using Lock is to prevent race condition that leads to more than one object creation when used in concurrency in API calls like FastAPI/Flask. (Reproduction in Colab)

I think this best practice implementation may be too overkill for a small tokenizer like this, but may benefits for bigger model that is tricky to handle memory/cold start like LM #1048. However, If you guys think implementing threading lock with the tokenizer engine is more suitable, please feel free to notify me in this issue.

@bact How are you think? I think Findings 1 is good enough for the requirement.

@bact
Copy link
Member

bact commented Jan 11, 2025

Thank you @new5558 for a very detailed research and evaluation. I agree with you and @wannaphong that the Solution 1 is probably already enough for general cases.

We may have to think about multi thread later, together with other commonly used functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement enhance functionalities
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

3 participants