Skip to content

Regex for detecting language codes incorrect #43

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 4 tasks
Thorry84 opened this issue Jun 20, 2024 · 2 comments · Fixed by #49
Closed
1 of 4 tasks

Regex for detecting language codes incorrect #43

Thorry84 opened this issue Jun 20, 2024 · 2 comments · Fixed by #49
Labels
bug Something isn't working

Comments

@Thorry84
Copy link

System Info

Ubuntu, PHP 8.1.2

PHP Version

8.1.2

Environment/Platform

  • Command-line application
  • Web application
  • Serverless
  • Other (please specify)

Description

In the Codewithkyrian\Transformers\PretrainedTokenizers\NllbTokenizer class this regex is used to detect language codes: /^[a-z]{3}_[A-Z]{3}$/
However some models, like Xenova/nllb-200-distilled-600M use a format like eng_Latn (full list)

I would suggest something like /^[a-z]{3}_[a-zA-Z]{3,4}$/

Is there a big penalty to false positives here? Is this check required?

Reproduction

$trans = pipeline('translation', 'Xenova/nllb-200-distilled-600M');
$trans('Translation test', srcLang: 'eng_Latn', tgtLang: 'deu_Latn');
@Thorry84 Thorry84 added the bug Something isn't working label Jun 20, 2024
@CodeWithKyrian
Copy link
Owner

You're right, the current regex does not accommodate the formats provided by this particular model. Didn't get to test with it so thank you for bringing this to my attention.

In this context, I don't see any significant penalty for false positives, so sure, your suggested regex /^[a-z]{3}_[a-zA-Z]{3,4}$/ would be more inclusive for different language code formats.

I appreciate your contribution and will incorporate this improvement. Thank you!

@Thorry84
Copy link
Author

Thanks so much for your work! <3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants