console: fix getwchar failing when LC_ALL undefined #3688

staviq · 2023-10-19T19:47:17Z

It appears some Linux distributions can have LC_ALL undefined and LANG is used instead, which causes getwchar() to fail unicode conversion, but more importantly, this "poisons" the stdin (/dev/tty) causing any subsequent getwchar to return WEOF indefinitely.

This PR, addresses this issue, by adding a fallback which sets locale from LANG, and if that too is undefined, sets locale to C.UTF-8 which ensures IO in unicode compatible mode.

shibe2 · 2023-10-19T23:25:55Z

This does not fix the issue of Ctrl+C not stopping generation.

Also note that LC_ALL environment variable and LC_ALL constant have different meanings. Normally, LC_ALL environment variable does not need to be set, and for me, non-ASCII input worked without it and without this change.

staviq · 2023-10-20T01:22:35Z

This does not fix the issue of Ctrl+C not stopping generation.

Also note that LC_ALL environment variable and LC_ALL constant have different meanings. Normally, LC_ALL environment variable does not need to be set, and for me, non-ASCII input worked without it and without this change.

Maybe we are thinking about separate problems with similar symptoms.

I can reproduce this 100% of the time, if LC_ALL is unset ( setlocale(LC_ALL,"") returns nullptr ), getwchar fails and it's internal state gets corrupted, making it ignore all input. If you terminate the process when this happens, your input will pop up in the shell prompt, because it was not being consumed and it accumulated in the stdin buffer.

This also causes infinite generation, because weof from getwchar is interpreted the same way as end of user input, and when this problem happens, console handler receives endless stream of weof from getwchar, continuously interpreting it as end of user input.

Edit: If you can reproduce ctrl+C not stopping generation, I'd appreciate if you help me reproduce it, so maybe a better fix for both problems can be worked out.

DannyDaemonic

This looks good to me.

DannyDaemonic · 2023-10-20T01:53:27Z

common/console.cpp

+        auto lang = getenv("LANG");
+
+        if (locale == nullptr) {


This seems like a reasonable fix to me. Since it's not often needed, it may be better to defer the retrieval of the "LANG" environment variable until it's needed, but it's good as is.

Suggested change

auto lang = getenv("LANG");

if (locale == nullptr) {

if (locale == nullptr) {

auto lang = getenv("LANG");

I found one more case, default debian 12 minimal image, and probably more default images for vm/lxc/docker can have a non utf LC_ALL and non utf lang, so just one more commit.

staviq · 2023-10-20T02:22:03Z

Ok, that fixes it for non unicode default locale too, tested on debian 12 minimal image.

@DannyDaemonic
It's super late and I'm half asleep :) so if you find anything wrong I'll deal with it tomorrow.

shibe2 · 2023-10-20T02:25:22Z

Maybe we are thinking about separate problems with similar symptoms.

The problems reported in #3638 are:

Non-ASCII characters do not appear when typed.
Generation starts prematurely.
Generation does not stop when Ctrl+C is pressed.

I can reproduce them all by having incorrect values for locale-related environment variables. # 1 may be expected even if there are not bugs in console code. # 3 is the strongest pointer to possibility of a bug. But the bug can be somewhere in the system libraries, or it can even be the expected behavior. After all, it is caused by misconfiguration.

I can reproduce this 100% of the time, if LC_ALL is unset

Here is what I get without this change.

printenv|grep -E "^(LANG|LC_)"
LANG=en_US.UTF-8
echo $LC_ALL

bin/main --interactive-first --multiline-input --color -m ggml-model-Q6_K.gguf
Log start

<...>

== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- To return control to LLaMa, end your input with ''.
- To return control without starting a new line, end your input with '/'.

F: Welches Alphabet hat die Buchstaben Ä, Ö, Ü, ẞ?
A:

Generation does not start yet. When I enter /, it writes:
Das deutsche Alphabet hat die Buchstaben Ä, Ö, Ü und ẞ. Diese Buchstaben <...>

When I press Ctrl+C, it stops, and I can type in green color again. So everything works as expected.

You may get different results because of some difference in your system/environment. For example, incorrect value of LC_CTYPE environment variable can trigger the bug. Then if you override it with LC_ALL, the problem will go away. Also, it can be different terminal or whatever software.

Overall, this looks to me like a workaround rather than fixing of root issue.

If you can reproduce ctrl+C not stopping generation, I'd appreciate if you help me reproduce it, so maybe a better fix for both problems can be worked out.

It requires investigation. You already looked into it more than I did. If the behavior can be fixed without changing locale, it would be the correct fix. I may look into it some time later, but now I'm working on something else.

common/console.cpp

Co-authored-by: cebtenzzre <[email protected]>

DannyDaemonic

This is approach is solid. It checks if the default locale contains utf-8, and we will have the fewest issues with a UTF 8 locale.

But this reminds me of the docker fix we implemented in #1673 where we had to use C.utf8! NOTE: It's lowercase and there's no dash! I was surprised that C.utf8 worked and C.UTF-8 didn't. I tried looking into it and there's just no strict standard for this type of thing. But the two most common are C.UTF-8 and C.utf8.

DannyDaemonic · 2023-10-20T09:34:52Z

common/console.cpp

+        auto locale = setlocale(LC_ALL, "");
+
+        if (locale == nullptr || strcasestr(locale, "utf-8") == nullptr) {
+            auto lang = getenv("LANG");
+            if (lang != nullptr && strcasestr(lang, "utf-8") != nullptr) {
+                setlocale(LC_ALL, lang);
+            } else {
+                setlocale(LC_ALL, "C.UTF-8");
+            }
+        }


If we're trying to cover all our bases, we probably want to try locales in this order: "", "LANG", "C.UTF-8", "C.utf8".

Unfortunately, this can get a bit complicated, but if we approach it with a loop, I think it looks the cleanest and lets us add additional sets in the future. Like perhaps we want to check en_US.UTF-8 if the other UTF 8 character sets fail to load since not all distros include the C UTF 8 set, or perhaps there's another function someone will recommend/write that will suggest a possible locale.

Suggested change

auto locale = setlocale(LC_ALL, "");

if (locale == nullptr || strcasestr(locale, "utf-8") == nullptr) {

auto lang = getenv("LANG");

if (lang != nullptr && strcasestr(lang, "utf-8") != nullptr) {

setlocale(LC_ALL, lang);

} else {

setlocale(LC_ALL, "C.UTF-8");

}

}

using LocaleStr = const char*;

LocaleStr locales[] = {

"",

getenv("LANG"),

"C.UTF-8",

"C.utf8"

};

const int numLocales = sizeof(locales) / sizeof(locales[0]);

for (int i = 0; i < numLocales; ++i) {

LocaleStr current = locales[i];

if (current) {

LocaleStr result = setlocale(LC_ALL, current);

if (result && (strcasestr(result, "utf-8") || strcasestr(result, "utf8"))) {

break;

}

}

}

I realize this is slightly less efficient, but it will cover the widest range of character sets.

Since setting incorrect/nonexistent locale will simply make setlocale return nullptr, so I believe this could be simplified to
setlocale(LC_ALL, "C.UTF-8") || setlocale(LC_ALL, "C.utf8") || setlocale(LC_ALL, ""); in the else bracket

That would probably work. We don't use any of the locale formatting functions or locale dependent functions such as isalpha. But, we aren't the only ones adding code and although I don't see the need for such uses, it is possible someone wants to format the timing results or benchmarks in the future.

I think you had the right idea starting with setlocale(LC_ALL, "") and checking what it returned for UTF 8. And if you want to try "LANG" (which I believe is traditionally used as a fallback when the default fails), then we also need to test if that locale is UTF 8. But for the last two, the simpler check setlocale(LC_ALL, "C.UTF-8") || setlocale(LC_ALL, "C.utf8") works, being that the utf8 test is redundant.

The other thing I thought of later was if setlocale(LC_ALL, "") returned a string that didn't contain utf8, we could truncate it after the . and try it with utf8 and UTF-8. I don't know how elaborate we want to get. I googled it and C.utf8 C.UTF-8 isn't included in all distros by default, but all the systems I checked did have it. For example:

$ locale -a C C.utf8 en_US.utf8 POSIX

The other thing I thought of later was if setlocale(LC_ALL, "") returned a string that didn't contain utf8, we could truncate it after the . and try it with utf8 and UTF-8. I don't know how elaborate we want to get. I googled it and C.utf8 C.UTF-8 isn't included in all distros by default, but all the systems I checked did have it.

Apparently there isn't a hard standard for locale (though obviously vast majority of Linux distros use the same "standard" )

GNU libc manual says " locale names are system-specific. "

So no matter what we do, this is always gonna be a best effort workaround and never a "proper" fix.

I believe that including setlocale(LC_ALL, "") as a final fallback, would be sufficient to appropriately "cancel" this fix when it cannot succeed, so the previous behaviour will be kept if the workaround cannot be applied, and this will at least ensure this doesn't accidentally make things worse in case of a weird system configuration.

I've been sitting on a big patch that adds some much needed functionality to console based editing until I had enough time to properly address any issues it may raise. I could try a more elaborate fix in that patch since it will need to be tested more thoroughly anyway and we could go with your much simpler fix in the meantime. I can't imagine using C instead of language[_territory] would create any issues, but I haven't tested it. If you have done so with Polish and everything looked fine, I'll approve that change. We just need to keep an eye out for any new reports that might be related.

shibe2 · 2023-10-20T10:22:01Z

I was able to trigger bad console behavior even after applying these changes up to 8396208. Let me restate my opinion on the issue.

The issue is caused by system misconfiguration. It should be fixed by the user/operator of the system, e.g. by setting environment variables correctly.

Our concern here may be that under such misconfiguration, examples are failing in a way that is worse than we would like. We can try to find out why exactly getwchar and Ctrl+C don't work and consider possible solutions.

We should mess with locales only if there is no better solution.

DannyDaemonic · 2023-10-20T11:18:52Z

@shibe2
I agree this is likely a misconfiguration issue. I fixed a similar issue for the docker image in the past - by changing the configuration. That said, I don't see the harm in trying to work around the issue. It might help to warn the user that their default locale is being overriden, but these locale settings aren't being used for formatting numbers and we don't call functions like isalpha or isdigit. The only thing we really use the locale for is the UTF-8 encoding.

shibe2 · 2023-10-20T12:01:49Z

const char * cs = nl_langinfo(CODESET);
if (strcasecmp(cs, "UTF-8") && strcasecmp(cs, "utf8")) {
     fprintf(stderr, "warning: character encodings other than UTF-8 are not supported, make sure you are using UTF-8 locale\n");
}

This needs to be printed somewhere near "== Running in interactive mode. ==", otherwise it will be lost in the giant dump that is printed by default. Then hopefully the user can fix the locale in whatever way is appropriate for their system.

(Edit: case-insensitive test)

staviq · 2023-10-20T13:34:43Z

@shibe2

I do absolutely agree that the source cause is a system misconfiguration.

But this still happens for couple of distros on clean default install
And I don't think we should encourage users to change their system configuration specifically for llama.cpp (imho)

shibe2 · 2023-10-20T14:32:18Z

And I don't think we should encourage users to change their system configuration specifically for llama.cpp (imho)

I'd agree if it would be specifically for llama.cpp. I know that minimal Linux images for containers may not have locales configured the same way as in regular distributions. If someone wants to run interactive console generation in such environment, it is perfectly reasonable to require proper locale configuration that would be needed to support any other application that relies on locale. In other cases, I assumed that default locale would use UTF-8, and llama.cpp must work with any such locale. I definitely want to investigate it when I have time.

staviq · 2023-10-20T19:22:53Z

Turns out setlocale(X,"") is not the same as setlocale(X,nullptr) as the former only returns not nullptr if locale changes, whereas the latter returns but doesn't change.

Also, setlocale returns nullptr but getenv returns empty string, so i added that check too.

@DannyDaemonic Let me know if that looks acceptable now.

DannyDaemonic

Somehow I always hit Ctrl-Enter instead of enter when editing these things.

DannyDaemonic · 2023-10-21T00:15:52Z

common/console.cpp

+    static bool locale_setverify( int c, const char * l )
+    {
+        setlocale(c, l);
+        auto locale = setlocale(c, nullptr);
+
+        return !strcmp( l, locale );
+    }
+#endif


Turns out setlocale(X,"") is not the same as setlocale(X,nullptr) as the former only returns nullptr if locale changes, whereas the latter returns but doesn't change.

You are right about passing in NULL for the second argument, however, you are wrong about how it works when you pass in a string. setlocale will only return nullptr if it fails to set the locale. It will also always return the current locale if it succeeds, and it will succeed even if you pass it in the current locale and nothing changes. So there is no need for a separate call to check the results. You can simply look at the return results.

locale_setverify is unnecessary at best and wrong at worst. It won't work correctly when you pass in l of empty string ("") for example - which getenv("LANG") could absolutely be.

DannyDaemonic · 2023-10-21T00:34:06Z

Sorry about the review comment above. I was in the middle of editing my response when I accidentally posted it somehow.

That said, I don't really like this latest set of changes. It overcomplicates the process. We only need to check if getenv("LANG") is null. My previous fixes checked if the LANG env results were null before using it, but perhaps I should have been more explicit there. We also don't need to call setlocale a second time to check the results. You can simply check the return value.

I prefer your proposed solution earlier where you just chain the setlocales with || over this. If you do want to try this order ("", LANG, "C.utf8", "C.UTF-8"), it can be done a lot more straightforwardly.

fix getwchar failing when LC_ALL undefined

9eab8b6

staviq changed the title ~~fix getwchar failing when LC_ALL undefined~~ console: fix getwchar failing when LC_ALL undefined Oct 19, 2023

fmt

5616b43

DannyDaemonic approved these changes Oct 20, 2023

View reviewed changes

staviq added 2 commits October 20, 2023 04:03

fix non-utf locale

1e328f4

fix non-utf locale 2

1ad5224

cebtenzzre reviewed Oct 20, 2023

View reviewed changes

common/console.cpp Outdated Show resolved Hide resolved

common/console.cpp Outdated Show resolved Hide resolved

staviq and others added 2 commits October 20, 2023 03:04

Update common/console.cpp

a72c053

Co-authored-by: cebtenzzre <[email protected]>

remove redundant C locale check

8396208

DannyDaemonic suggested changes Oct 20, 2023

View reviewed changes

fix empty str vs nullptr for setlocale and getenv

226ed5f

fix win

6d126d0

DannyDaemonic reviewed Oct 21, 2023

View reviewed changes

staviq marked this pull request as draft October 21, 2023 01:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

console: fix getwchar failing when LC_ALL undefined #3688

console: fix getwchar failing when LC_ALL undefined #3688

staviq commented Oct 19, 2023

shibe2 commented Oct 19, 2023

staviq commented Oct 20, 2023 •

edited

Loading

DannyDaemonic left a comment

DannyDaemonic Oct 20, 2023

staviq Oct 20, 2023

staviq commented Oct 20, 2023

shibe2 commented Oct 20, 2023

DannyDaemonic left a comment

DannyDaemonic Oct 20, 2023

staviq Oct 20, 2023

DannyDaemonic Oct 20, 2023

staviq Oct 20, 2023

DannyDaemonic Oct 20, 2023

shibe2 commented Oct 20, 2023

DannyDaemonic commented Oct 20, 2023

shibe2 commented Oct 20, 2023 •

edited

Loading

staviq commented Oct 20, 2023

shibe2 commented Oct 20, 2023

staviq commented Oct 20, 2023 •

edited

Loading

DannyDaemonic left a comment •

edited

Loading

DannyDaemonic Oct 21, 2023

DannyDaemonic commented Oct 21, 2023

-        auto locale = setlocale(LC_ALL, "");
-        if (locale == nullptr || strcasestr(locale, "utf-8") == nullptr) {
-            auto lang = getenv("LANG");
-            if (lang != nullptr && strcasestr(lang, "utf-8") != nullptr) {
-                setlocale(LC_ALL, lang);
-            } else {
-                setlocale(LC_ALL, "C.UTF-8");
-            }
-        }
+        using LocaleStr = const char*;
+        LocaleStr locales[] = {
+            "",
+            getenv("LANG"),
+            "C.UTF-8",
+            "C.utf8"
+        };
+        const int numLocales = sizeof(locales) / sizeof(locales[0]);
+        for (int i = 0; i < numLocales; ++i) {
+            LocaleStr current = locales[i];
+            if (current) {
+                LocaleStr result = setlocale(LC_ALL, current);
+                if (result && (strcasestr(result, "utf-8") || strcasestr(result, "utf8"))) {
+                    break;
+                }
+            }
+        }

console: fix getwchar failing when LC_ALL undefined #3688

Are you sure you want to change the base?

console: fix getwchar failing when LC_ALL undefined #3688

Conversation

staviq commented Oct 19, 2023

shibe2 commented Oct 19, 2023

staviq commented Oct 20, 2023 • edited Loading

DannyDaemonic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

staviq commented Oct 20, 2023

shibe2 commented Oct 20, 2023

DannyDaemonic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shibe2 commented Oct 20, 2023

DannyDaemonic commented Oct 20, 2023

shibe2 commented Oct 20, 2023 • edited Loading

staviq commented Oct 20, 2023

shibe2 commented Oct 20, 2023

staviq commented Oct 20, 2023 • edited Loading

DannyDaemonic left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DannyDaemonic commented Oct 21, 2023

staviq commented Oct 20, 2023 •

edited

Loading

shibe2 commented Oct 20, 2023 •

edited

Loading

staviq commented Oct 20, 2023 •

edited

Loading

DannyDaemonic left a comment •

edited

Loading