Skip to content

Series import: Sao Tome and Principe is detected as San Marino in Russian #1228

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
php-coder opened this issue Jan 18, 2020 · 6 comments
Open
Labels
Milestone

Comments

@php-coder
Copy link
Owner

Logs:

m.w.f.s.i.SeriesInfoExtractorServiceImpl : Determine country from 'Динозавры, Сан Томе и Принсипи 2010, 4 блока без зубцов'
m.w.f.s.i.SeriesInfoExtractorServiceImpl : Possible candidates: [Динозавры, Сан, Томе, Принсипи, блока, без, зубцов]
m.w.f.s.i.SeriesInfoExtractorServiceImpl : Found countries: []
m.w.f.s.i.SeriesInfoExtractorServiceImpl : Possible candidate: 'Динозавры%'
m.w.f.s.i.SeriesInfoExtractorServiceImpl : Possible candidate: 'Сан%'
m.w.f.s.i.SeriesInfoExtractorServiceImpl : Found countries: [54, 104]
@php-coder
Copy link
Owner Author

It only happens when it's written as "Сан Томе и Принсипи" and works correctly with "Сан-Томе и Принсипи".

@php-coder
Copy link
Owner Author

The same issue with "Saint Kitts and Nevis" that is mistakenly recognized as "Saint Vincent and the Grenadines":

m.w.f.s.i.SeriesInfoExtractorServiceImpl : Possible candidates: [Сент, Киттс, фауна, доисторическая, динозавры]
m.w.f.s.i.SeriesInfoExtractorServiceImpl : Found countries: []
m.w.f.s.i.SeriesInfoExtractorServiceImpl : Possible candidate: 'Сент%'
m.w.f.s.i.SeriesInfoExtractorServiceImpl : Found countries: [100, 119]

@php-coder
Copy link
Owner Author

m.w.f.s.i.SeriesInfoExtractorServiceImpl : Determine country from '1994 г. Экваториальная Гвинея (Equatorial Guinea). Фауна. Динозавры'
m.w.f.s.i.SeriesInfoExtractorServiceImpl : Possible candidates: [Экваториальная, Гвинея, Фауна, Динозавры]
m.w.f.s.i.SeriesInfoExtractorServiceImpl : Found countries: [105, 121]
mysql> select id,name from countries where id in (105,121);
+-----+-------------------+
| id  | name              |
+-----+-------------------+
| 105 | Guinea            |
| 121 | Equatorial Guinea |
+-----+-------------------+
2 rows in set (0.00 sec)

@php-coder
Copy link
Owner Author

The current idea is when we have 2+ candidates, don't pick the first one blindly but try to append the next word and lookup again. In this case, longer names (with 2+ words) could be detected properly.

@php-coder php-coder added this to the 0.4.4 milestone Apr 10, 2020
@php-coder
Copy link
Owner Author

Perhaps, this is different but "Papua New Guinea" has been recognized as "Guinea":

m.w.f.s.i.SeriesInfoExtractorServiceImpl : Possible candidates: [Папуа, Новая, Гвинея, New, фауна, динозавры]
m.w.f.s.i.SeriesInfoExtractorServiceImpl : Found countries: [105]

@php-coder php-coder modified the milestones: 0.4.4, next May 23, 2020
@php-coder
Copy link
Owner Author

Accidentally closed because of a typo in a commit message :(

@php-coder php-coder reopened this May 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant