-
Notifications
You must be signed in to change notification settings - Fork 277
Reduce reload word tokenizer engine in word_tokenize #973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you please elaborate on this behavior in more detail? I tried to reproduce your mentioned behavior using this Colab Notebook. I noticed the engine is already called only once without relying on Python global variable store! Please correct me if I am wrong, but I observed that the engine created outside of Example: pythainlp/pythainlp/tokenize/han_solo.py Lines 119 to 131 in 9a9d11f
However, There is a case of engine created inside of Example: pythainlp/pythainlp/tokenize/attacut.py Lines 29 to 45 in 9a9d11f
If it actually works this way, I can help in this weekend migrate all engines in word tokenizer module out of |
It sounds good. 👍
|
Please ignore this and refer to the below investigation instead |
I just did a deeper dive and created a detailed benchmark on This New Colab
The benefit of using Lock is to prevent race condition that leads to more than one object creation when used in concurrency in API calls like FastAPI/Flask. (Reproduction in Colab) I think this |
@bact How are you think? I think Findings 1 is good enough for the requirement. |
Thank you @new5558 for a very detailed research and evaluation. I agree with you and @wannaphong that the Solution 1 is probably already enough for general cases. We may have to think about multi thread later, together with other commonly used functions. |
Current,
word_tokenize
function call word tokenizer engine in every using because it doesn't have global variable for store the engine, so it needs to reload same word tokenizer engine in every using.We can fix by using global variable for store the engine and engine's name than check if it call same engine, it doesn't call word tokenizer engine.
The text was updated successfully, but these errors were encountered: