Skip to content

console: fix getwchar failing when LC_ALL undefined #3688

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

staviq
Copy link
Contributor

@staviq staviq commented Oct 19, 2023

Fixes #3638

It appears some Linux distributions can have LC_ALL undefined and LANG is used instead, which causes getwchar() to fail unicode conversion, but more importantly, this "poisons" the stdin (/dev/tty) causing any subsequent getwchar to return WEOF indefinitely.

This PR, addresses this issue, by adding a fallback which sets locale from LANG, and if that too is undefined, sets locale to C.UTF-8 which ensures IO in unicode compatible mode.

image

@staviq staviq changed the title fix getwchar failing when LC_ALL undefined console: fix getwchar failing when LC_ALL undefined Oct 19, 2023
@shibe2
Copy link
Contributor

shibe2 commented Oct 19, 2023

This does not fix the issue of Ctrl+C not stopping generation.

Also note that LC_ALL environment variable and LC_ALL constant have different meanings. Normally, LC_ALL environment variable does not need to be set, and for me, non-ASCII input worked without it and without this change.

@staviq
Copy link
Contributor Author

staviq commented Oct 20, 2023

This does not fix the issue of Ctrl+C not stopping generation.

Also note that LC_ALL environment variable and LC_ALL constant have different meanings. Normally, LC_ALL environment variable does not need to be set, and for me, non-ASCII input worked without it and without this change.

Maybe we are thinking about separate problems with similar symptoms.

I can reproduce this 100% of the time, if LC_ALL is unset ( setlocale(LC_ALL,"") returns nullptr ), getwchar fails and it's internal state gets corrupted, making it ignore all input. If you terminate the process when this happens, your input will pop up in the shell prompt, because it was not being consumed and it accumulated in the stdin buffer.

This also causes infinite generation, because weof from getwchar is interpreted the same way as end of user input, and when this problem happens, console handler receives endless stream of weof from getwchar, continuously interpreting it as end of user input.

Edit: If you can reproduce ctrl+C not stopping generation, I'd appreciate if you help me reproduce it, so maybe a better fix for both problems can be worked out.

Copy link
Contributor

@DannyDaemonic DannyDaemonic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

Comment on lines 115 to 117
auto lang = getenv("LANG");

if (locale == nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a reasonable fix to me. Since it's not often needed, it may be better to defer the retrieval of the "LANG" environment variable until it's needed, but it's good as is.

Suggested change
auto lang = getenv("LANG");
if (locale == nullptr) {
if (locale == nullptr) {
auto lang = getenv("LANG");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one more case, default debian 12 minimal image, and probably more default images for vm/lxc/docker can have a non utf LC_ALL and non utf lang, so just one more commit.

@staviq
Copy link
Contributor Author

staviq commented Oct 20, 2023

Ok, that fixes it for non unicode default locale too, tested on debian 12 minimal image.

@DannyDaemonic
It's super late and I'm half asleep :) so if you find anything wrong I'll deal with it tomorrow.

@shibe2
Copy link
Contributor

shibe2 commented Oct 20, 2023

Maybe we are thinking about separate problems with similar symptoms.

The problems reported in #3638 are:

  1. Non-ASCII characters do not appear when typed.
  2. Generation starts prematurely.
  3. Generation does not stop when Ctrl+C is pressed.

I can reproduce them all by having incorrect values for locale-related environment variables. # 1 may be expected even if there are not bugs in console code. # 3 is the strongest pointer to possibility of a bug. But the bug can be somewhere in the system libraries, or it can even be the expected behavior. After all, it is caused by misconfiguration.

I can reproduce this 100% of the time, if LC_ALL is unset

Here is what I get without this change.

printenv|grep -E "^(LANG|LC_)"
LANG=en_US.UTF-8
echo $LC_ALL

bin/main --interactive-first --multiline-input --color -m ggml-model-Q6_K.gguf
Log start

<...>

== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- To return control to LLaMa, end your input with ''.
- To return control without starting a new line, end your input with '/'.

F: Welches Alphabet hat die Buchstaben Ä, Ö, Ü, ẞ?
A:

Generation does not start yet. When I enter /, it writes:
Das deutsche Alphabet hat die Buchstaben Ä, Ö, Ü und ẞ. Diese Buchstaben <...>

When I press Ctrl+C, it stops, and I can type in green color again. So everything works as expected.

You may get different results because of some difference in your system/environment. For example, incorrect value of LC_CTYPE environment variable can trigger the bug. Then if you override it with LC_ALL, the problem will go away. Also, it can be different terminal or whatever software.

Overall, this looks to me like a workaround rather than fixing of root issue.

If you can reproduce ctrl+C not stopping generation, I'd appreciate if you help me reproduce it, so maybe a better fix for both problems can be worked out.

It requires investigation. You already looked into it more than I did. If the behavior can be fixed without changing locale, it would be the correct fix. I may look into it some time later, but now I'm working on something else.

Copy link
Contributor

@DannyDaemonic DannyDaemonic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is approach is solid. It checks if the default locale contains utf-8, and we will have the fewest issues with a UTF 8 locale.

But this reminds me of the docker fix we implemented in #1673 where we had to use C.utf8! NOTE: It's lowercase and there's no dash! I was surprised that C.utf8 worked and C.UTF-8 didn't. I tried looking into it and there's just no strict standard for this type of thing. But the two most common are C.UTF-8 and C.utf8.

Comment on lines 115 to 124
auto locale = setlocale(LC_ALL, "");

if (locale == nullptr || strcasestr(locale, "utf-8") == nullptr) {
auto lang = getenv("LANG");
if (lang != nullptr && strcasestr(lang, "utf-8") != nullptr) {
setlocale(LC_ALL, lang);
} else {
setlocale(LC_ALL, "C.UTF-8");
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're trying to cover all our bases, we probably want to try locales in this order: "", "LANG", "C.UTF-8", "C.utf8".

Unfortunately, this can get a bit complicated, but if we approach it with a loop, I think it looks the cleanest and lets us add additional sets in the future. Like perhaps we want to check en_US.UTF-8 if the other UTF 8 character sets fail to load since not all distros include the C UTF 8 set, or perhaps there's another function someone will recommend/write that will suggest a possible locale.

Suggested change
auto locale = setlocale(LC_ALL, "");
if (locale == nullptr || strcasestr(locale, "utf-8") == nullptr) {
auto lang = getenv("LANG");
if (lang != nullptr && strcasestr(lang, "utf-8") != nullptr) {
setlocale(LC_ALL, lang);
} else {
setlocale(LC_ALL, "C.UTF-8");
}
}
using LocaleStr = const char*;
LocaleStr locales[] = {
"",
getenv("LANG"),
"C.UTF-8",
"C.utf8"
};
const int numLocales = sizeof(locales) / sizeof(locales[0]);
for (int i = 0; i < numLocales; ++i) {
LocaleStr current = locales[i];
if (current) {
LocaleStr result = setlocale(LC_ALL, current);
if (result && (strcasestr(result, "utf-8") || strcasestr(result, "utf8"))) {
break;
}
}
}

I realize this is slightly less efficient, but it will cover the widest range of character sets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since setting incorrect/nonexistent locale will simply make setlocale return nullptr, so I believe this could be simplified to
setlocale(LC_ALL, "C.UTF-8") || setlocale(LC_ALL, "C.utf8") || setlocale(LC_ALL, ""); in the else bracket

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would probably work. We don't use any of the locale formatting functions or locale dependent functions such as isalpha. But, we aren't the only ones adding code and although I don't see the need for such uses, it is possible someone wants to format the timing results or benchmarks in the future.

I think you had the right idea starting with setlocale(LC_ALL, "") and checking what it returned for UTF 8. And if you want to try "LANG" (which I believe is traditionally used as a fallback when the default fails), then we also need to test if that locale is UTF 8. But for the last two, the simpler check setlocale(LC_ALL, "C.UTF-8") || setlocale(LC_ALL, "C.utf8") works, being that the utf8 test is redundant.

The other thing I thought of later was if setlocale(LC_ALL, "") returned a string that didn't contain utf8, we could truncate it after the . and try it with utf8 and UTF-8. I don't know how elaborate we want to get. I googled it and C.utf8 C.UTF-8 isn't included in all distros by default, but all the systems I checked did have it. For example:

$ locale -a
C
C.utf8
en_US.utf8
POSIX

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other thing I thought of later was if setlocale(LC_ALL, "") returned a string that didn't contain utf8, we could truncate it after the . and try it with utf8 and UTF-8. I don't know how elaborate we want to get. I googled it and C.utf8 C.UTF-8 isn't included in all distros by default, but all the systems I checked did have it.

Apparently there isn't a hard standard for locale (though obviously vast majority of Linux distros use the same "standard" )

GNU libc manual says " locale names are system-specific. "

So no matter what we do, this is always gonna be a best effort workaround and never a "proper" fix.

I believe that including setlocale(LC_ALL, "") as a final fallback, would be sufficient to appropriately "cancel" this fix when it cannot succeed, so the previous behaviour will be kept if the workaround cannot be applied, and this will at least ensure this doesn't accidentally make things worse in case of a weird system configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been sitting on a big patch that adds some much needed functionality to console based editing until I had enough time to properly address any issues it may raise. I could try a more elaborate fix in that patch since it will need to be tested more thoroughly anyway and we could go with your much simpler fix in the meantime. I can't imagine using C instead of language[_territory] would create any issues, but I haven't tested it. If you have done so with Polish and everything looked fine, I'll approve that change. We just need to keep an eye out for any new reports that might be related.

@shibe2
Copy link
Contributor

shibe2 commented Oct 20, 2023

I was able to trigger bad console behavior even after applying these changes up to 8396208. Let me restate my opinion on the issue.

The issue is caused by system misconfiguration. It should be fixed by the user/operator of the system, e.g. by setting environment variables correctly.

Our concern here may be that under such misconfiguration, examples are failing in a way that is worse than we would like. We can try to find out why exactly getwchar and Ctrl+C don't work and consider possible solutions.

We should mess with locales only if there is no better solution.

@DannyDaemonic
Copy link
Contributor

@shibe2
I agree this is likely a misconfiguration issue. I fixed a similar issue for the docker image in the past - by changing the configuration. That said, I don't see the harm in trying to work around the issue. It might help to warn the user that their default locale is being overriden, but these locale settings aren't being used for formatting numbers and we don't call functions like isalpha or isdigit. The only thing we really use the locale for is the UTF-8 encoding.

@shibe2
Copy link
Contributor

shibe2 commented Oct 20, 2023

const char * cs = nl_langinfo(CODESET);
if (strcasecmp(cs, "UTF-8") && strcasecmp(cs, "utf8")) {
     fprintf(stderr, "warning: character encodings other than UTF-8 are not supported, make sure you are using UTF-8 locale\n");
}

This needs to be printed somewhere near "== Running in interactive mode. ==", otherwise it will be lost in the giant dump that is printed by default. Then hopefully the user can fix the locale in whatever way is appropriate for their system.

(Edit: case-insensitive test)

@staviq
Copy link
Contributor Author

staviq commented Oct 20, 2023

@shibe2

I do absolutely agree that the source cause is a system misconfiguration.

But this still happens for couple of distros on clean default install
And I don't think we should encourage users to change their system configuration specifically for llama.cpp (imho)

@shibe2
Copy link
Contributor

shibe2 commented Oct 20, 2023

And I don't think we should encourage users to change their system configuration specifically for llama.cpp (imho)

I'd agree if it would be specifically for llama.cpp. I know that minimal Linux images for containers may not have locales configured the same way as in regular distributions. If someone wants to run interactive console generation in such environment, it is perfectly reasonable to require proper locale configuration that would be needed to support any other application that relies on locale. In other cases, I assumed that default locale would use UTF-8, and llama.cpp must work with any such locale. I definitely want to investigate it when I have time.

@staviq
Copy link
Contributor Author

staviq commented Oct 20, 2023

Turns out setlocale(X,"") is not the same as setlocale(X,nullptr) as the former only returns not nullptr if locale changes, whereas the latter returns but doesn't change.

Also, setlocale returns nullptr but getenv returns empty string, so i added that check too.

@DannyDaemonic Let me know if that looks acceptable now.

Copy link
Contributor

@DannyDaemonic DannyDaemonic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow I always hit Ctrl-Enter instead of enter when editing these things.

Comment on lines +62 to +69
static bool locale_setverify( int c, const char * l )
{
setlocale(c, l);
auto locale = setlocale(c, nullptr);

return !strcmp( l, locale );
}
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out setlocale(X,"") is not the same as setlocale(X,nullptr) as the former only returns nullptr if locale changes, whereas the latter returns but doesn't change.

You are right about passing in NULL for the second argument, however, you are wrong about how it works when you pass in a string. setlocale will only return nullptr if it fails to set the locale. It will also always return the current locale if it succeeds, and it will succeed even if you pass it in the current locale and nothing changes. So there is no need for a separate call to check the results. You can simply look at the return results.

locale_setverify is unnecessary at best and wrong at worst. It won't work correctly when you pass in l of empty string ("") for example - which getenv("LANG") could absolutely be.

@DannyDaemonic
Copy link
Contributor

Sorry about the review comment above. I was in the middle of editing my response when I accidentally posted it somehow.

That said, I don't really like this latest set of changes. It overcomplicates the process. We only need to check if getenv("LANG") is null. My previous fixes checked if the LANG env results were null before using it, but perhaps I should have been more explicit there. We also don't need to call setlocale a second time to check the results. You can simply check the return value.

I prefer your proposed solution earlier where you just chain the setlocales with || over this. If you do want to try this order ("", LANG, "C.utf8", "C.UTF-8"), it can be done a lot more straightforwardly.

@staviq staviq marked this pull request as draft October 21, 2023 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[User] Hitting special characters keys on AZERTY keyboards starts infinite generation in interactive mode
4 participants