Regex for detecting language codes incorrect #43

Thorry84 · 2024-06-20T15:05:12Z

System Info

Ubuntu, PHP 8.1.2

PHP Version

8.1.2

Environment/Platform

Command-line application
Web application
Serverless
Other (please specify)

Description

In the Codewithkyrian\Transformers\PretrainedTokenizers\NllbTokenizer class this regex is used to detect language codes: /^[a-z]{3}_[A-Z]{3}$/
However some models, like Xenova/nllb-200-distilled-600M use a format like eng_Latn (full list)

I would suggest something like /^[a-z]{3}_[a-zA-Z]{3,4}$/

Is there a big penalty to false positives here? Is this check required?

Reproduction

$trans = pipeline('translation', 'Xenova/nllb-200-distilled-600M');
$trans('Translation test', srcLang: 'eng_Latn', tgtLang: 'deu_Latn');

The text was updated successfully, but these errors were encountered:

CodeWithKyrian · 2024-07-15T19:18:27Z

You're right, the current regex does not accommodate the formats provided by this particular model. Didn't get to test with it so thank you for bringing this to my attention.

In this context, I don't see any significant penalty for false positives, so sure, your suggested regex /^[a-z]{3}_[a-zA-Z]{3,4}$/ would be more inclusive for different language code formats.

I appreciate your contribution and will incorporate this improvement. Thank you!

Thorry84 · 2024-07-16T07:14:10Z

Thanks so much for your work! <3

Thorry84 added the bug Something isn't working label Jun 20, 2024

CodeWithKyrian linked a pull request Jul 15, 2024 that will close this issue

fix: improve regex for detecting language codes in NllbTokenizer #49

Merged

2 tasks

CodeWithKyrian closed this as completed in #49 Jul 15, 2024

martindewawd mentioned this issue Mar 25, 2025

Matlib version error #88

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex for detecting language codes incorrect #43

Regex for detecting language codes incorrect #43

Thorry84 commented Jun 20, 2024

CodeWithKyrian commented Jul 15, 2024

Thorry84 commented Jul 16, 2024

Regex for detecting language codes incorrect #43

Regex for detecting language codes incorrect #43

Comments

Thorry84 commented Jun 20, 2024

System Info

PHP Version

Environment/Platform

Description

Reproduction

CodeWithKyrian commented Jul 15, 2024

Thorry84 commented Jul 16, 2024