-
Notifications
You must be signed in to change notification settings - Fork 11.6k
console: fix getwchar failing when LC_ALL undefined #3688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This does not fix the issue of Ctrl+C not stopping generation. Also note that LC_ALL environment variable and LC_ALL constant have different meanings. Normally, LC_ALL environment variable does not need to be set, and for me, non-ASCII input worked without it and without this change. |
Maybe we are thinking about separate problems with similar symptoms. I can reproduce this 100% of the time, if LC_ALL is unset ( This also causes infinite generation, because weof from getwchar is interpreted the same way as end of user input, and when this problem happens, console handler receives endless stream of weof from getwchar, continuously interpreting it as end of user input. Edit: If you can reproduce ctrl+C not stopping generation, I'd appreciate if you help me reproduce it, so maybe a better fix for both problems can be worked out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
common/console.cpp
Outdated
auto lang = getenv("LANG"); | ||
|
||
if (locale == nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a reasonable fix to me. Since it's not often needed, it may be better to defer the retrieval of the "LANG" environment variable until it's needed, but it's good as is.
auto lang = getenv("LANG"); | |
if (locale == nullptr) { | |
if (locale == nullptr) { | |
auto lang = getenv("LANG"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found one more case, default debian 12 minimal image, and probably more default images for vm/lxc/docker can have a non utf LC_ALL
and non utf lang
, so just one more commit.
Ok, that fixes it for non unicode default locale too, tested on debian 12 minimal image. @DannyDaemonic |
The problems reported in #3638 are:
I can reproduce them all by having incorrect values for locale-related environment variables. # 1 may be expected even if there are not bugs in console code. # 3 is the strongest pointer to possibility of a bug. But the bug can be somewhere in the system libraries, or it can even be the expected behavior. After all, it is caused by misconfiguration.
Here is what I get without this change.
<...> == Running in interactive mode. == F: Welches Alphabet hat die Buchstaben Ä, Ö, Ü, ẞ? Generation does not start yet. When I enter When I press Ctrl+C, it stops, and I can type in green color again. So everything works as expected. You may get different results because of some difference in your system/environment. For example, incorrect value of LC_CTYPE environment variable can trigger the bug. Then if you override it with LC_ALL, the problem will go away. Also, it can be different terminal or whatever software. Overall, this looks to me like a workaround rather than fixing of root issue.
It requires investigation. You already looked into it more than I did. If the behavior can be fixed without changing locale, it would be the correct fix. I may look into it some time later, but now I'm working on something else. |
Co-authored-by: cebtenzzre <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is approach is solid. It checks if the default locale contains utf-8
, and we will have the fewest issues with a UTF 8 locale.
But this reminds me of the docker fix we implemented in #1673 where we had to use C.utf8
! NOTE: It's lowercase and there's no dash! I was surprised that C.utf8
worked and C.UTF-8
didn't. I tried looking into it and there's just no strict standard for this type of thing. But the two most common are C.UTF-8
and C.utf8
.
common/console.cpp
Outdated
auto locale = setlocale(LC_ALL, ""); | ||
|
||
if (locale == nullptr || strcasestr(locale, "utf-8") == nullptr) { | ||
auto lang = getenv("LANG"); | ||
if (lang != nullptr && strcasestr(lang, "utf-8") != nullptr) { | ||
setlocale(LC_ALL, lang); | ||
} else { | ||
setlocale(LC_ALL, "C.UTF-8"); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're trying to cover all our bases, we probably want to try locales in this order: "", "LANG", "C.UTF-8", "C.utf8".
Unfortunately, this can get a bit complicated, but if we approach it with a loop, I think it looks the cleanest and lets us add additional sets in the future. Like perhaps we want to check en_US.UTF-8
if the other UTF 8 character sets fail to load since not all distros include the C UTF 8 set, or perhaps there's another function someone will recommend/write that will suggest a possible locale.
auto locale = setlocale(LC_ALL, ""); | |
if (locale == nullptr || strcasestr(locale, "utf-8") == nullptr) { | |
auto lang = getenv("LANG"); | |
if (lang != nullptr && strcasestr(lang, "utf-8") != nullptr) { | |
setlocale(LC_ALL, lang); | |
} else { | |
setlocale(LC_ALL, "C.UTF-8"); | |
} | |
} | |
using LocaleStr = const char*; | |
LocaleStr locales[] = { | |
"", | |
getenv("LANG"), | |
"C.UTF-8", | |
"C.utf8" | |
}; | |
const int numLocales = sizeof(locales) / sizeof(locales[0]); | |
for (int i = 0; i < numLocales; ++i) { | |
LocaleStr current = locales[i]; | |
if (current) { | |
LocaleStr result = setlocale(LC_ALL, current); | |
if (result && (strcasestr(result, "utf-8") || strcasestr(result, "utf8"))) { | |
break; | |
} | |
} | |
} |
I realize this is slightly less efficient, but it will cover the widest range of character sets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since setting incorrect/nonexistent locale will simply make setlocale
return nullptr, so I believe this could be simplified to
setlocale(LC_ALL, "C.UTF-8") || setlocale(LC_ALL, "C.utf8") || setlocale(LC_ALL, "");
in the else bracket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would probably work. We don't use any of the locale formatting functions or locale dependent functions such as isalpha
. But, we aren't the only ones adding code and although I don't see the need for such uses, it is possible someone wants to format the timing results or benchmarks in the future.
I think you had the right idea starting with setlocale(LC_ALL, "")
and checking what it returned for UTF 8. And if you want to try "LANG"
(which I believe is traditionally used as a fallback when the default fails), then we also need to test if that locale is UTF 8. But for the last two, the simpler check setlocale(LC_ALL, "C.UTF-8") || setlocale(LC_ALL, "C.utf8")
works, being that the utf8 test is redundant.
The other thing I thought of later was if setlocale(LC_ALL, "")
returned a string that didn't contain utf8, we could truncate it after the .
and try it with utf8
and UTF-8
. I don't know how elaborate we want to get. I googled it and C.utf8 C.UTF-8 isn't included in all distros by default, but all the systems I checked did have it. For example:
$ locale -a
C
C.utf8
en_US.utf8
POSIX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other thing I thought of later was if
setlocale(LC_ALL, "")
returned a string that didn't contain utf8, we could truncate it after the.
and try it withutf8
andUTF-8
. I don't know how elaborate we want to get. I googled it and C.utf8 C.UTF-8 isn't included in all distros by default, but all the systems I checked did have it.
Apparently there isn't a hard standard for locale (though obviously vast majority of Linux distros use the same "standard" )
GNU libc manual says " locale names are system-specific. "
So no matter what we do, this is always gonna be a best effort workaround and never a "proper" fix.
I believe that including setlocale(LC_ALL, "")
as a final fallback, would be sufficient to appropriately "cancel" this fix when it cannot succeed, so the previous behaviour will be kept if the workaround cannot be applied, and this will at least ensure this doesn't accidentally make things worse in case of a weird system configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been sitting on a big patch that adds some much needed functionality to console based editing until I had enough time to properly address any issues it may raise. I could try a more elaborate fix in that patch since it will need to be tested more thoroughly anyway and we could go with your much simpler fix in the meantime. I can't imagine using C
instead of language[_territory]
would create any issues, but I haven't tested it. If you have done so with Polish and everything looked fine, I'll approve that change. We just need to keep an eye out for any new reports that might be related.
I was able to trigger bad console behavior even after applying these changes up to 8396208. Let me restate my opinion on the issue. The issue is caused by system misconfiguration. It should be fixed by the user/operator of the system, e.g. by setting environment variables correctly. Our concern here may be that under such misconfiguration, examples are failing in a way that is worse than we would like. We can try to find out why exactly We should mess with locales only if there is no better solution. |
@shibe2 |
const char * cs = nl_langinfo(CODESET);
if (strcasecmp(cs, "UTF-8") && strcasecmp(cs, "utf8")) {
fprintf(stderr, "warning: character encodings other than UTF-8 are not supported, make sure you are using UTF-8 locale\n");
} This needs to be printed somewhere near "== Running in interactive mode. ==", otherwise it will be lost in the giant dump that is printed by default. Then hopefully the user can fix the locale in whatever way is appropriate for their system. (Edit: case-insensitive test) |
I do absolutely agree that the source cause is a system misconfiguration. But this still happens for couple of distros on clean default install |
I'd agree if it would be specifically for llama.cpp. I know that minimal Linux images for containers may not have locales configured the same way as in regular distributions. If someone wants to run interactive console generation in such environment, it is perfectly reasonable to require proper locale configuration that would be needed to support any other application that relies on locale. In other cases, I assumed that default locale would use UTF-8, and llama.cpp must work with any such locale. I definitely want to investigate it when I have time. |
Turns out Also, @DannyDaemonic Let me know if that looks acceptable now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow I always hit Ctrl-Enter
instead of enter when editing these things.
static bool locale_setverify( int c, const char * l ) | ||
{ | ||
setlocale(c, l); | ||
auto locale = setlocale(c, nullptr); | ||
|
||
return !strcmp( l, locale ); | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out setlocale(X,"") is not the same as setlocale(X,nullptr) as the former only returns nullptr if locale changes, whereas the latter returns but doesn't change.
You are right about passing in NULL for the second argument, however, you are wrong about how it works when you pass in a string. setlocale
will only return nullptr
if it fails to set the locale. It will also always return the current locale if it succeeds, and it will succeed even if you pass it in the current locale and nothing changes. So there is no need for a separate call to check the results. You can simply look at the return results.
locale_setverify
is unnecessary at best and wrong at worst. It won't work correctly when you pass in l
of empty string (""
) for example - which getenv("LANG")
could absolutely be.
Sorry about the review comment above. I was in the middle of editing my response when I accidentally posted it somehow. That said, I don't really like this latest set of changes. It overcomplicates the process. We only need to check if I prefer your proposed solution earlier where you just chain the |
Fixes #3638
It appears some Linux distributions can have
LC_ALL
undefined andLANG
is used instead, which causesgetwchar()
to fail unicode conversion, but more importantly, this "poisons" thestdin
(/dev/tty
) causing any subsequentgetwchar
to returnWEOF
indefinitely.This PR, addresses this issue, by adding a fallback which sets locale from
LANG
, and if that too is undefined, sets locale toC.UTF-8
which ensures IO in unicode compatible mode.