You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the Codewithkyrian\Transformers\PretrainedTokenizers\NllbTokenizer class this regex is used to detect language codes: /^[a-z]{3}_[A-Z]{3}$/
However some models, like Xenova/nllb-200-distilled-600M use a format like eng_Latn (full list)
I would suggest something like /^[a-z]{3}_[a-zA-Z]{3,4}$/
Is there a big penalty to false positives here? Is this check required?
You're right, the current regex does not accommodate the formats provided by this particular model. Didn't get to test with it so thank you for bringing this to my attention.
In this context, I don't see any significant penalty for false positives, so sure, your suggested regex /^[a-z]{3}_[a-zA-Z]{3,4}$/ would be more inclusive for different language code formats.
I appreciate your contribution and will incorporate this improvement. Thank you!
System Info
Ubuntu, PHP 8.1.2
PHP Version
8.1.2
Environment/Platform
Description
In the
Codewithkyrian\Transformers\PretrainedTokenizers\NllbTokenizer
class this regex is used to detect language codes:/^[a-z]{3}_[A-Z]{3}$/
However some models, like Xenova/nllb-200-distilled-600M use a format like eng_Latn (full list)
I would suggest something like
/^[a-z]{3}_[a-zA-Z]{3,4}$/
Is there a big penalty to false positives here? Is this check required?
Reproduction
The text was updated successfully, but these errors were encountered: