Skip to content

Commit a6e6666

Browse files
committed
Fix langinfo(ALT_DIGITS)
This has never worked properly before in Perl. The code is returning the result of the libc function nl_langinfo(). The documentation for it that I have found (and presumably my predecessors) is very unclear. But what actually happens (from using gdb) is that the return is very C unfriendly. Instead of returning a NUL-terminated string, it returns 100 (perhaps fewer) NUL-terminated strings in a row. When it is fewer (given the few examples I've seen), the final one ends with two NULs in a row. (I can't think of a way for it to work and be otherwise). The 100th one doesn't necessarily have two terminating NULs. Prior to this commit, only the string for the zeroth digit was returned; now the entire ALT_DIGIT string sequence is returned, forcing a double NUL at the end of the final one. This information is accessible in several ways. Via XS, one can use any of several functions, including the newly introduced sv_langinfo(), returning an SV, which allows for easier handling of embedded NULs. (Otherwise in XS, using the functions that return a char*, one has to look for the double-NUL.) From Perl-space, the access is via I18N::Langinfo, which behind the scenes also uses an SV. The documentation added in this commit gives advice for how to turn the return into an @array for more convenient access.
1 parent fbdd09d commit a6e6666

File tree

4 files changed

+256
-40
lines changed

4 files changed

+256
-40
lines changed

ext/I18N-Langinfo/Langinfo.pm

+44-9
Original file line numberDiff line numberDiff line change
@@ -70,13 +70,15 @@ our @EXPORT_OK = qw(
7070
YESSTR
7171
);
7272

73-
our $VERSION = '0.22';
73+
our $VERSION = '0.23';
7474

7575
XSLoader::load();
7676

7777
1;
7878
__END__
7979
80+
=encoding utf8
81+
8082
=head1 NAME
8183
8284
I18N::Langinfo - query locale information
@@ -155,16 +157,47 @@ For the character code set being used (such as "ISO8859-1", "cp850",
155157
156158
=item *
157159
158-
For an alternate representation of digits, for the
159-
radix character used between the integer and the fractional part
160-
of decimal numbers, the group separator string for large-ish floating point
161-
numbers (yes, the final two are redundant with
160+
For the radix character used between the integer and the fractional part of
161+
decimal numbers and the group separator string for large-ish floating point
162+
numbers (yes, these are redundant with
162163
L<POSIX::localeconv()|POSIX/localeconv>):
163164
164-
ALT_DIGITS RADIXCHAR THOUSEP
165+
RADIXCHAR THOUSEP
165166
166167
=item *
167168
169+
For any alternate digits in the locale
170+
171+
ALT_DIGITS
172+
173+
This returns a sequence of up to 100 strings, starting with the alternate
174+
representation of zero; then the same for one, two, ... ninety-nine.
175+
176+
To access this data conveniently, you could do something like
177+
178+
use I18N::Langinfo qw(langinfo ALT_DIGITS);
179+
my @alt_digits = split '\0', langinfo(ALT_DIGITS);
180+
181+
The array C<@alt_digits> will contain 0 elements if the current locale doesn't
182+
have alternate digits specified for it. Otherwise, it will have as many
183+
elements as the locale defines, with C<[0]> containing the alternate digit for
184+
zero; C<[1]> for one; and so forth, up to potentially C<[99]> for the
185+
alternate representation of ninety-nine.
186+
187+
Most locales don't have alternate digits, so the array will be empty. Running
188+
this program
189+
190+
use I18N::Langinfo qw(langinfo ALT_DIGITS);
191+
my @alt_digits = split '\0', langinfo(ALT_DIGITS);
192+
splice @alt_digits, 15;
193+
print join " ", @alt_digits, "\n";
194+
195+
on a Japanese locale yields
196+
197+
S<C<〇 一 二 三 四 五 六 七 八 九 十 十一 十二 十三 十四>>
198+
199+
= item *
200+
168201
For the affirmative and negative responses and expressions:
169202
170203
YESSTR YESEXPR NOSTR NOEXPR
@@ -235,9 +268,11 @@ differently, please file a report at L<https://github.com/Perl/perl5/issues>.
235268
236269
=item C<ALT_DIGITS>
237270
238-
Currently this gives the same results as Linux does. If you have examples of
239-
it needing to work differently, please file a report at
240-
L<https://github.com/Perl/perl5/issues>.
271+
This tries hard to return the same values as C<nl_langinfo()>. It uses the
272+
C<%O> formats that the C<libc> L<strftime(3)> function on some platforms (not
273+
Windows) understands. It returns as many consecutive alternate digits as it
274+
can find, starting with the one for zero; or the empty string if none are
275+
found.
241276
242277
=item C<ERA_D_FMT>
243278

lib/locale.t

+47-1
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ BEGIN {
5858
}
5959

6060
use feature 'fc';
61-
use I18N::Langinfo qw(langinfo CODESET CRNCYSTR RADIXCHAR THOUSEP);
61+
use I18N::Langinfo qw(langinfo CODESET CRNCYSTR RADIXCHAR THOUSEP ALT_DIGITS);
6262

6363
# =1 adds debugging output; =2 increases the verbosity somewhat
6464
our $debug = $ENV{PERL_DEBUG_FULL_TEST} // 0;
@@ -1060,6 +1060,7 @@ foreach my $Locale (@Locale) {
10601060
debug "is utf8 locale? = $is_utf8_locale\n";
10611061
debug "radix = " . disp_str(langinfo(RADIXCHAR)) . "\n";
10621062
debug "numeric group separator = '" . disp_str(langinfo(THOUSEP)) . "'\n";
1063+
debug "alt_digits = " . disp_str(langinfo(ALT_DIGITS)) . "\n";
10631064
debug "currency = " . disp_str(langinfo(CRNCYSTR));
10641065
}
10651066

@@ -2465,6 +2466,51 @@ foreach my $Locale (@Locale) {
24652466
print "# failed $locales_test_number locale '$Locale' numbers @f\n"
24662467
}
24672468
}
2469+
2470+
{
2471+
my @f = ();
2472+
++$locales_test_number;
2473+
$test_names{$locales_test_number} =
2474+
'Verify ALT_DIGITS returns nothing, or else non-ASCII and'
2475+
. ' the single char digits evaluate to consecutive integers'
2476+
. ' starting at 0';
2477+
2478+
my $alts = langinfo(ALT_DIGITS);
2479+
if ($alts) {
2480+
my @alts = split '\0', $alts;
2481+
my $prev = -1;
2482+
foreach my $num (@alts) {
2483+
if ($num =~ /[[:ascii:]]/) {
2484+
push @f, disp_str($num);
2485+
last;
2486+
}
2487+
2488+
# We only look at single character strings; likely locales
2489+
# that have alternate digits have a different mechanism for
2490+
# representing larger numbers. Japanese for example, has a
2491+
# single character for the number 10, which is prefixed to the
2492+
# '1' symbol for '11', etc. And 21 is represented by 3
2493+
# characters, the '2' symbol, followed by the '10' symbol,
2494+
# then the '1' symbol. (There is nothing to say that a locale
2495+
# even has to use base 10.)
2496+
last if length $num > 1;
2497+
2498+
use Unicode::UCD 'num';
2499+
my $value = num($num);
2500+
if ($value != $prev + 1) {
2501+
push @f, disp_str($num);
2502+
last;
2503+
}
2504+
2505+
$prev = $value;
2506+
}
2507+
}
2508+
2509+
report_result($Locale, $locales_test_number, @f == 0);
2510+
if (@f) {
2511+
print "# failed $locales_test_number locale '$Locale' numbers @f\n"
2512+
}
2513+
}
24682514
}
24692515

24702516
my $final_locales_test_number = $locales_test_number;

locale.c

+162-28
Original file line numberDiff line numberDiff line change
@@ -6118,13 +6118,46 @@ S_langinfo_sv_i(pTHX_
61186118

61196119
const char * retval = nl_langinfo(item);
61206120
Size_t total_len = strlen(retval);
6121+
6122+
if (UNLIKELY(item == ALT_DIGITS) && total_len > 0) {
6123+
6124+
/* The return of this item must end in 2 NULs if there are fewer
6125+
* than 100 strings */
6126+
const char * s = retval + total_len + 1;
6127+
6128+
for (unsigned int i = 1; i <= 99; i++) {
6129+
Size_t len = strlen(s) + 1;
6130+
total_len += len;
6131+
6132+
if (len == 1) { /* Only a NUL */
6133+
break;
6134+
}
6135+
6136+
s += len;
6137+
}
6138+
}
6139+
61216140
sv_setpvn(sv, retval, total_len);
61226141

61236142
gwLOCALE_UNLOCK;
61246143

6144+
/* Make sure the sequence ends in a double NUL to make it easier on
6145+
* down stream handlers; this is not guaranteed by nl_langinfo() itself
6146+
* */
6147+
if (UNLIKELY(item == ALT_DIGITS) && total_len > 0) {
6148+
sv_catpvn_nomg (sv, "\0", 1);
6149+
}
6150+
61256151
SvUTF8_off(sv);
61266152
retval = SvPVX_const(sv);
61276153

6154+
/* Note that get_locale_string_utf8ness_i() is passed a char*, so stops
6155+
* looking at the first NUL, meaning it only looks at string [0] in the
6156+
* ALT_DIGITS case: alternate zero. One might think that you'd need to
6157+
* look at all the strings to determine utf8ness. But that is not true
6158+
* for this case; string [0] is sufficient. This is because there are
6159+
* no ASCII alternate digits, so [0] is enough to decide the utf8ness
6160+
* */
61286161
if (utf8ness) {
61296162
*utf8ness = get_locale_string_utf8ness_i(retval,
61306163
LOCALE_UTF8NESS_UNKNOWN,
@@ -6865,34 +6898,7 @@ S_emulate_langinfo(pTHX_ const int item,
68656898

68666899
restore_toggled_locale_c(LC_TIME, orig_TIME_locale);
68676900

6868-
/* If the item is 'ALT_DIGITS', '*retbuf' contains the alternate
6869-
* format for wday 0. If the value is the same as the normal 0,
6870-
* there isn't an alternate, so clear the buffer.
6871-
*
6872-
* (wday was chosen because its range is all a single digit.
6873-
* Things like tm_sec have two digits as the minimum: '00'.) */
6874-
if (item == ALT_DIGITS && strEQ(temp, "0")) {
6875-
retval = "";
6876-
Safefree(temp);
6877-
break;
6878-
}
6879-
6880-
/* ALT_DIGITS is problematic. Experiments on it showed that
6881-
* strftime() did not always work properly when going from alt-9 to
6882-
* alt-10. Only a few locales have this item defined, and in all
6883-
* of them on Linux that khw was able to find, nl_langinfo() merely
6884-
* returned the alt-0 character, possibly doubled. Most Unicode
6885-
* digits are in blocks of 10 consecutive code points, so that is
6886-
* sufficient information for such scripts, as we can infer alt-1,
6887-
* alt-2, .... But for a Japanese locale, a CJK ideographic 0 is
6888-
* returned, and the CJK digits are not in code point order, so you
6889-
* can't really infer anything. The localedef for this locale did
6890-
* specify the succeeding digits, so that strftime() works properly
6891-
* on them, without needing to infer anything. But the
6892-
* nl_langinfo() return did not give sufficient information for the
6893-
* caller to understand what's going on. So until there is
6894-
* evidence that it should work differently, this returns the alt-0
6895-
* string for ALT_DIGITS. */
6901+
if (LIKELY(item != ALT_DIGITS)) {
68966902

68976903
/* If to return what strftime() returns, are done */
68986904
if (! return_format) {
@@ -6926,6 +6932,134 @@ S_emulate_langinfo(pTHX_ const int item,
69266932

69276933
Safefree(temp);
69286934
break;
6935+
}
6936+
6937+
/* Here, the item is 'ALT_DIGITS' and temp contains the zeroth
6938+
* alternate digit. If empty or doesn't differ from regular digits,
6939+
* return that there aren't alternate digits */
6940+
if (temp[0] == '\0' || strchr(temp, '0')) {
6941+
Safefree(temp);
6942+
retval = "";
6943+
break;
6944+
}
6945+
6946+
/* ALT_DIGITS requires special handling because it is not a simple
6947+
* string, but a sequence of up to 100 NUL-terminated strings. Below
6948+
* we generate those by using the %O modifier to strftime() formats.
6949+
*
6950+
* We already have the alternate digit for zero in 'temp', generated
6951+
* using the %Ow format. That was used because it seems least likely
6952+
* to have a leading zero. But some locales return that anyway. If
6953+
* the first half of temp is identical to the second half, assume that
6954+
* is the case, and use just the second half */
6955+
const char * alt0 = temp; /* Clearer synonym */
6956+
Size_t alt0_len = strlen(alt0);
6957+
if ((alt0_len & 1) == 0) {
6958+
Size_t half_alt0_len = alt0_len / 2;
6959+
if (strnEQ(temp, temp + half_alt0_len, half_alt0_len))
6960+
{
6961+
alt0_len = half_alt0_len;
6962+
}
6963+
}
6964+
6965+
/* Save the 0 digit string */
6966+
sv_setpvn(sv, alt0, alt0_len);
6967+
sv_catpvn_nomg (sv, "\0", 1);
6968+
6969+
/* Various %O formats can be used to derive the alternate digits. Only
6970+
* %Oy can go up to the full 100 values. If it doesn't work, we try
6971+
* various fallbacks in decreasing order of how many values they can
6972+
* deliver. maxes[] tells the highest value that the format applies
6973+
* to; offsets[] compensates for 0-based vs 1-based indices; and vars[]
6974+
* holds what field in the 'struct tm' to applies to the corresponding
6975+
* format */
6976+
int year, min, sec;
6977+
const char * fmts[] = {"%Oy", "%OM", "%OS", "%Od", "%OH", "%Om", "%Ow" };
6978+
const Size_t maxes[] = { 99, 59, 59, 31, 23, 11, 6 };
6979+
const int offsets[] = { 0, 0, 0, 1, 0, 1, 0 };
6980+
int * vars[] = {&year, &min, &sec, &mday, &hour, &mon, &mday };
6981+
Size_t j = 0; /* Current index into the above tables */
6982+
6983+
orig_TIME_locale = toggle_locale_c(LC_TIME, locale);
6984+
6985+
for (unsigned int i = 1; i <= 99; i++) {
6986+
struct tm mytm;
6987+
6988+
redo:
6989+
if (j >= C_ARRAY_LENGTH(fmts)) {
6990+
break; /* Exhausted formats early; can't continue */
6991+
}
6992+
6993+
if (i > maxes[j]) {
6994+
j++; /* Exhausted this format; try next one */
6995+
goto redo;
6996+
}
6997+
6998+
year = (strchr(fmts[j], 'y')) ? 1900 : 2011;
6999+
hour = 0;
7000+
min = 0;
7001+
sec = 0;
7002+
mday = 1;
7003+
mon = 0;
7004+
7005+
/* Change the variable corresponding to this format to the
7006+
* current time being run in 'i' */
7007+
*(vars[j]) += i - offsets[j];
7008+
7009+
/* Do the strftime. Once we have determined the UTF8ness (if
7010+
* we want it), assume the rest will be the same, and use
7011+
* strftime_tm(), which doesn't recalculate UTF8ness */
7012+
ints_to_tm(&mytm, sec, min, hour, mday, mon, year, 0, 0, 0);
7013+
char * temp;
7014+
if (utf8ness && is_utf8 != UTF8NESS_NO && is_utf8 != UTF8NESS_YES) {
7015+
temp = strftime8(fmts[j],
7016+
&mytm,
7017+
UTF8NESS_IMMATERIAL,
7018+
&is_utf8,
7019+
false /* not calling from sv_strftime */
7020+
);
7021+
}
7022+
else {
7023+
temp = strftime_tm(fmts[j], &mytm);
7024+
}
7025+
7026+
DEBUG_Lv(PerlIO_printf(Perl_debug_log,
7027+
"i=%d, format=%s, alt='%s'\n",
7028+
i, fmts[j], temp));
7029+
7030+
/* If no result (meaning this platform didn't recognize this
7031+
* format), or it returned regular digits, give up on this
7032+
* format, to try the next candidate one */
7033+
if (temp == NULL || strpbrk(temp, "0123456789")) {
7034+
Safefree(temp);
7035+
j++;
7036+
goto redo;
7037+
}
7038+
7039+
/* If there is a leading zero, skip past it, to get the second
7040+
* one in the string */
7041+
const char * current = temp;
7042+
if (strnEQ(temp, alt0, alt0_len)) {
7043+
current += alt0_len;
7044+
}
7045+
7046+
/* Append this number to the ongoing list, including a NUL
7047+
* separator */
7048+
sv_catpv_nomg (sv, current);
7049+
sv_catpvn_nomg (sv, "\0", 1);
7050+
Safefree(temp);
7051+
} /* End of loop generating ALT_DIGIT strings */
7052+
7053+
Safefree(alt0);
7054+
7055+
restore_toggled_locale_c(LC_TIME, orig_TIME_locale);
7056+
7057+
/* Make sure there is an empty string trailing everything, so
7058+
* it all ends with two consecutive NULs */
7059+
sv_catpvn_nomg (sv, "\0", 1);
7060+
retval_type = RETVAL_IN_sv;
7061+
break;
7062+
69297063
# endif
69307064

69317065
} /* End of braced group for outer switch 'default:' case */

pod/perldelta.pod

+3-2
Original file line numberDiff line numberDiff line change
@@ -125,9 +125,10 @@ XXX Remove this section if Porting/corelist-perldelta.pl did not add any content
125125

126126
=item *
127127

128-
L<XXX> has been upgraded from version A.xx to B.yy.
128+
L<I18N::Langinfo> has been upgraded from version 0.22 to 0.23.
129129

130-
XXX If there was something important to note about this change, include that here.
130+
This fixes what is returned for the C<ALT_DIGITS> item which has never
131+
before worked properly in Perl.
131132

132133
=back
133134

0 commit comments

Comments
 (0)