Skip to content

Commit bfaaa04

Browse files
committed
Fix langinfo(ALT_DIGITS)
This has never worked properly before in Perl. The code is returning the result of the libc function nl_langinfo(). The documentation for it that I have found (and presumably my predecessors) is very unclear. But what actually happens (from using gdb) is that the return is very C unfriendly. Instead of returning a NUL-terminated string, it returns 100 (perhaps fewer) NUL-terminated strings in a row. When it is fewer (given the few examples I've seen), the final one ends with two NULs in a row. (I can't think of a way for it to work and be otherwise). The 100th one doesn't necessarily have two terminating NULs. Prior to this commit, only the string for the zeroth digit was returned; now the entire ALT_DIGIT string sequence is returned, forcing a double NUL at the end of the final one. This information is accessible in several ways. Via XS, one can use any of several functions, including the newly introduced sv_langinfo(), returning an SV, which allows for easier handling of embedded NULs. (Otherwise in XS, using the functions that return a char*, one has to look for the double-NUL.) From Perl-space, the access is via I18N::Langinfo, which behind the scenes also uses an SV. The documentation added in this commit gives advice for how to turn the return into an @array for more convenient access.
1 parent fbdd09d commit bfaaa04

File tree

4 files changed

+277
-40
lines changed

4 files changed

+277
-40
lines changed

ext/I18N-Langinfo/Langinfo.pm

+49-9
Original file line numberDiff line numberDiff line change
@@ -70,13 +70,15 @@ our @EXPORT_OK = qw(
7070
YESSTR
7171
);
7272

73-
our $VERSION = '0.22';
73+
our $VERSION = '0.23';
7474

7575
XSLoader::load();
7676

7777
1;
7878
__END__
7979
80+
=encoding utf8
81+
8082
=head1 NAME
8183
8284
I18N::Langinfo - query locale information
@@ -155,16 +157,52 @@ For the character code set being used (such as "ISO8859-1", "cp850",
155157
156158
=item *
157159
158-
For an alternate representation of digits, for the
159-
radix character used between the integer and the fractional part
160-
of decimal numbers, the group separator string for large-ish floating point
161-
numbers (yes, the final two are redundant with
160+
For the radix character used between the integer and the fractional part of
161+
decimal numbers and the group separator string for large-ish floating point
162+
numbers (yes, these are redundant with
162163
L<POSIX::localeconv()|POSIX/localeconv>):
163164
164-
ALT_DIGITS RADIXCHAR THOUSEP
165+
RADIXCHAR THOUSEP
165166
166167
=item *
167168
169+
For any alternate digits in the locale
170+
171+
ALT_DIGITS
172+
173+
This returns a sequence of up to 100 strings, starting with the alternate
174+
representation of zero; then the same for one, two, ... ninety-nine.
175+
176+
To access this data conveniently, you could do something like
177+
178+
use I18N::Langinfo qw(langinfo ALT_DIGITS);
179+
my @alt_digits = split ';', langinfo(ALT_DIGITS);
180+
181+
The array C<@alt_digits> will contain 0 elements if the current locale doesn't
182+
have alternate digits specified for it. Otherwise, it will have as many
183+
elements as the locale defines, with C<[0]> containing the alternate digit for
184+
zero; C<[1]> for one; and so forth, up to potentially C<[99]> for the
185+
alternate representation of ninety-nine.
186+
187+
Most locales don't have alternate digits, so the array will be empty.
188+
189+
Be aware that the alternate representation in some locales for the numbers
190+
0..9 will have a leading alternate-zero, so would look like the equivalent of
191+
00..09.
192+
193+
Running this program
194+
195+
use I18N::Langinfo qw(langinfo ALT_DIGITS);
196+
my @alt_digits = split ';', langinfo(ALT_DIGITS);
197+
splice @alt_digits, 15;
198+
print join " ", @alt_digits, "\n";
199+
200+
on a Japanese locale yields
201+
202+
S<C<〇 一 二 三 四 五 六 七 八 九 十 十一 十二 十三 十四>>
203+
204+
= item *
205+
168206
For the affirmative and negative responses and expressions:
169207
170208
YESSTR YESEXPR NOSTR NOEXPR
@@ -235,9 +273,11 @@ differently, please file a report at L<https://github.com/Perl/perl5/issues>.
235273
236274
=item C<ALT_DIGITS>
237275
238-
Currently this gives the same results as Linux does. If you have examples of
239-
it needing to work differently, please file a report at
240-
L<https://github.com/Perl/perl5/issues>.
276+
This tries hard to return the same values as C<nl_langinfo()>. It uses the
277+
C<%O> formats that the C<libc> L<strftime(3)> function on some platforms (not
278+
Windows) understands. It returns as many consecutive alternate digits as it
279+
can find, starting with the one for zero; or the empty string if none are
280+
found.
241281
242282
=item C<ERA_D_FMT>
243283

lib/locale.t

+47-1
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ BEGIN {
5858
}
5959

6060
use feature 'fc';
61-
use I18N::Langinfo qw(langinfo CODESET CRNCYSTR RADIXCHAR THOUSEP);
61+
use I18N::Langinfo qw(langinfo CODESET CRNCYSTR RADIXCHAR THOUSEP ALT_DIGITS);
6262

6363
# =1 adds debugging output; =2 increases the verbosity somewhat
6464
our $debug = $ENV{PERL_DEBUG_FULL_TEST} // 0;
@@ -1060,6 +1060,7 @@ foreach my $Locale (@Locale) {
10601060
debug "is utf8 locale? = $is_utf8_locale\n";
10611061
debug "radix = " . disp_str(langinfo(RADIXCHAR)) . "\n";
10621062
debug "numeric group separator = '" . disp_str(langinfo(THOUSEP)) . "'\n";
1063+
debug "alt_digits = " . disp_str(langinfo(ALT_DIGITS)) . "\n";
10631064
debug "currency = " . disp_str(langinfo(CRNCYSTR));
10641065
}
10651066

@@ -2465,6 +2466,51 @@ foreach my $Locale (@Locale) {
24652466
print "# failed $locales_test_number locale '$Locale' numbers @f\n"
24662467
}
24672468
}
2469+
2470+
{
2471+
my @f = ();
2472+
++$locales_test_number;
2473+
$test_names{$locales_test_number} =
2474+
'Verify ALT_DIGITS returns nothing, or else non-ASCII and'
2475+
. ' the single char digits evaluate to consecutive integers'
2476+
. ' starting at 0';
2477+
2478+
my $alts = langinfo(ALT_DIGITS);
2479+
if ($alts) {
2480+
my @alts = split ';', $alts;
2481+
my $prev = -1;
2482+
foreach my $num (@alts) {
2483+
if ($num =~ /[[:ascii:]]/) {
2484+
push @f, disp_str($num);
2485+
last;
2486+
}
2487+
2488+
# We only look at single character strings; likely locales
2489+
# that have alternate digits have a different mechanism for
2490+
# representing larger numbers. Japanese for example, has a
2491+
# single character for the number 10, which is prefixed to the
2492+
# '1' symbol for '11', etc. And 21 is represented by 3
2493+
# characters, the '2' symbol, followed by the '10' symbol,
2494+
# then the '1' symbol. (There is nothing to say that a locale
2495+
# even has to use base 10.)
2496+
last if length $num > 1;
2497+
2498+
use Unicode::UCD 'num';
2499+
my $value = num($num);
2500+
if ($value != $prev + 1) {
2501+
push @f, disp_str($num);
2502+
last;
2503+
}
2504+
2505+
$prev = $value;
2506+
}
2507+
}
2508+
2509+
report_result($Locale, $locales_test_number, @f == 0);
2510+
if (@f) {
2511+
print "# failed $locales_test_number locale '$Locale' numbers @f\n"
2512+
}
2513+
}
24682514
}
24692515

24702516
my $final_locales_test_number = $locales_test_number;

locale.c

+178-28
Original file line numberDiff line numberDiff line change
@@ -6118,13 +6118,66 @@ S_langinfo_sv_i(pTHX_
61186118

61196119
const char * retval = nl_langinfo(item);
61206120
Size_t total_len = strlen(retval);
6121+
char separator;
6122+
6123+
if (UNLIKELY(item == ALT_DIGITS) && total_len > 0) {
6124+
6125+
char * sep_pos =
6126+
(char *) strpbrk(retval, "!\"#$%&'()*+,-./.@[\\]^_`{|}~");
6127+
if (sep_pos) {
6128+
separator = retval[sep_pos - retval];
6129+
}
6130+
else {
6131+
separator = '\0';
6132+
6133+
/* Must be using NUL to separate the digits. There are up to
6134+
* 100 of them, ending in two NULs if fewer. Find the end */
6135+
const char * s = retval + total_len + 1;
6136+
6137+
for (unsigned int i = 1; i <= 99; i++) {
6138+
Size_t len = strlen(s) + 1;
6139+
total_len += len;
6140+
6141+
if (len == 1) { /* Only a NUL */
6142+
break;
6143+
}
6144+
6145+
s += len;
6146+
}
6147+
}
6148+
}
6149+
61216150
sv_setpvn(sv, retval, total_len);
61226151

61236152
gwLOCALE_UNLOCK;
61246153

6154+
/* Convert the ALT_DIGITS separator to a semi-colong if not already */
6155+
if (UNLIKELY(item == ALT_DIGITS) && total_len > 0 && separator != ';') {
6156+
char * digit_string = SvPVX(sv);
6157+
char * s = digit_string;
6158+
char * e = s + total_len;
6159+
6160+
while (s < e) {
6161+
char * this_end = (char *) memchr(s, separator, total_len);
6162+
if (! this_end) {
6163+
break;
6164+
}
6165+
6166+
*this_end = ';';
6167+
s = this_end;
6168+
}
6169+
}
6170+
61256171
SvUTF8_off(sv);
61266172
retval = SvPVX_const(sv);
61276173

6174+
/* Note that get_locale_string_utf8ness_i() is passed a char*, so stops
6175+
* looking at the first NUL, meaning it only looks at string [0] in the
6176+
* ALT_DIGITS case: alternate zero. One might think that you'd need to
6177+
* look at all the strings to determine utf8ness. But that is not true
6178+
* for this case; string [0] is sufficient. This is because there are
6179+
* no ASCII alternate digits, so [0] is enough to decide the utf8ness
6180+
* */
61286181
if (utf8ness) {
61296182
*utf8ness = get_locale_string_utf8ness_i(retval,
61306183
LOCALE_UTF8NESS_UNKNOWN,
@@ -6865,34 +6918,7 @@ S_emulate_langinfo(pTHX_ const int item,
68656918

68666919
restore_toggled_locale_c(LC_TIME, orig_TIME_locale);
68676920

6868-
/* If the item is 'ALT_DIGITS', '*retbuf' contains the alternate
6869-
* format for wday 0. If the value is the same as the normal 0,
6870-
* there isn't an alternate, so clear the buffer.
6871-
*
6872-
* (wday was chosen because its range is all a single digit.
6873-
* Things like tm_sec have two digits as the minimum: '00'.) */
6874-
if (item == ALT_DIGITS && strEQ(temp, "0")) {
6875-
retval = "";
6876-
Safefree(temp);
6877-
break;
6878-
}
6879-
6880-
/* ALT_DIGITS is problematic. Experiments on it showed that
6881-
* strftime() did not always work properly when going from alt-9 to
6882-
* alt-10. Only a few locales have this item defined, and in all
6883-
* of them on Linux that khw was able to find, nl_langinfo() merely
6884-
* returned the alt-0 character, possibly doubled. Most Unicode
6885-
* digits are in blocks of 10 consecutive code points, so that is
6886-
* sufficient information for such scripts, as we can infer alt-1,
6887-
* alt-2, .... But for a Japanese locale, a CJK ideographic 0 is
6888-
* returned, and the CJK digits are not in code point order, so you
6889-
* can't really infer anything. The localedef for this locale did
6890-
* specify the succeeding digits, so that strftime() works properly
6891-
* on them, without needing to infer anything. But the
6892-
* nl_langinfo() return did not give sufficient information for the
6893-
* caller to understand what's going on. So until there is
6894-
* evidence that it should work differently, this returns the alt-0
6895-
* string for ALT_DIGITS. */
6921+
if (LIKELY(item != ALT_DIGITS)) {
68966922

68976923
/* If to return what strftime() returns, are done */
68986924
if (! return_format) {
@@ -6926,6 +6952,130 @@ S_emulate_langinfo(pTHX_ const int item,
69266952

69276953
Safefree(temp);
69286954
break;
6955+
}
6956+
6957+
/* Here, the item is 'ALT_DIGITS' and temp contains the zeroth
6958+
* alternate digit. If empty or doesn't differ from regular digits,
6959+
* return that there aren't alternate digits */
6960+
if (temp[0] == '\0' || strchr(temp, '0')) {
6961+
Safefree(temp);
6962+
retval = "";
6963+
break;
6964+
}
6965+
6966+
/* ALT_DIGITS requires special handling because it requires up to 100
6967+
* values. Below we generate those by using the %O modifier to
6968+
* strftime() formats.
6969+
*
6970+
* We already have the alternate digit for zero in 'temp', generated
6971+
* using the %Ow format. That was used because it seems least likely
6972+
* to have a leading zero. But some locales return that anyway. If
6973+
* the first half of temp is identical to the second half, assume that
6974+
* is the case, and use just the first half */
6975+
const char * alt0 = temp; /* Clearer synonym */
6976+
Size_t alt0_len = strlen(alt0);
6977+
if ((alt0_len & 1) == 0) {
6978+
Size_t half_alt0_len = alt0_len / 2;
6979+
if (strnEQ(temp, temp + half_alt0_len, half_alt0_len)) {
6980+
alt0_len = half_alt0_len;
6981+
}
6982+
}
6983+
6984+
/* Save the 0 digit string */
6985+
sv_setpvn(sv, alt0, alt0_len);
6986+
sv_catpvn_nomg (sv, ";", 1);
6987+
6988+
/* Various %O formats can be used to derive the alternate digits. Only
6989+
* %Oy can go up to the full 100 values. If it doesn't work, we try
6990+
* various fallbacks in decreasing order of how many values they can
6991+
* deliver. maxes[] tells the highest value that the format applies
6992+
* to; offsets[] compensates for 0-based vs 1-based indices; and vars[]
6993+
* holds what field in the 'struct tm' to applies to the corresponding
6994+
* format */
6995+
int year, min, sec;
6996+
const char * fmts[] = {"%Oy", "%OM", "%OS", "%Od", "%OH", "%Om", "%Ow" };
6997+
const Size_t maxes[] = { 99, 59, 59, 31, 23, 11, 6 };
6998+
const int offsets[] = { 0, 0, 0, 1, 0, 1, 0 };
6999+
int * vars[] = {&year, &min, &sec, &mday, &hour, &mon, &mday };
7000+
Size_t j = 0; /* Current index into the above tables */
7001+
7002+
orig_TIME_locale = toggle_locale_c(LC_TIME, locale);
7003+
7004+
for (unsigned int i = 1; i <= 99; i++) {
7005+
struct tm mytm;
7006+
7007+
redo:
7008+
if (j >= C_ARRAY_LENGTH(fmts)) {
7009+
break; /* Exhausted formats early; can't continue */
7010+
}
7011+
7012+
if (i > maxes[j]) {
7013+
j++; /* Exhausted this format; try next one */
7014+
goto redo;
7015+
}
7016+
7017+
year = (strchr(fmts[j], 'y')) ? 1900 : 2011;
7018+
hour = 0;
7019+
min = 0;
7020+
sec = 0;
7021+
mday = 1;
7022+
mon = 0;
7023+
7024+
/* Change the variable corresponding to this format to the
7025+
* current time being run in 'i' */
7026+
*(vars[j]) += i - offsets[j];
7027+
7028+
/* Do the strftime. Once we have determined the UTF8ness (if
7029+
* we want it), assume the rest will be the same, and use
7030+
* strftime_tm(), which doesn't recalculate UTF8ness */
7031+
ints_to_tm(&mytm, sec, min, hour, mday, mon, year, 0, 0, 0);
7032+
char * temp;
7033+
if (utf8ness && is_utf8 != UTF8NESS_NO && is_utf8 != UTF8NESS_YES) {
7034+
temp = strftime8(fmts[j],
7035+
&mytm,
7036+
UTF8NESS_IMMATERIAL,
7037+
&is_utf8,
7038+
false /* not calling from sv_strftime */
7039+
);
7040+
}
7041+
else {
7042+
temp = strftime_tm(fmts[j], &mytm);
7043+
}
7044+
7045+
DEBUG_Lv(PerlIO_printf(Perl_debug_log,
7046+
"i=%d, format=%s, alt='%s'\n",
7047+
i, fmts[j], temp));
7048+
7049+
/* If no result (meaning this platform didn't recognize this
7050+
* format), or it returned regular digits, give up on this
7051+
* format, to try the next candidate one */
7052+
if (temp == NULL || strpbrk(temp, "0123456789")) {
7053+
Safefree(temp);
7054+
j++;
7055+
goto redo;
7056+
}
7057+
7058+
/* If there is a leading zero, skip past it, to get the second
7059+
* one in the string */
7060+
const char * current = temp;
7061+
if (strnEQ(temp, alt0, alt0_len)) {
7062+
current += alt0_len;
7063+
}
7064+
7065+
/* Append this number to the ongoing list, including the separator.
7066+
* */
7067+
sv_catpv_nomg (sv, current);
7068+
sv_catpvn_nomg (sv, ";", 1);
7069+
Safefree(temp);
7070+
} /* End of loop generating ALT_DIGIT strings */
7071+
7072+
Safefree(alt0);
7073+
7074+
restore_toggled_locale_c(LC_TIME, orig_TIME_locale);
7075+
7076+
retval_type = RETVAL_IN_sv;
7077+
break;
7078+
69297079
# endif
69307080

69317081
} /* End of braced group for outer switch 'default:' case */

0 commit comments

Comments
 (0)