Implement upper, lower case conversion for char #12561

pzol · 2014-02-26T04:16:05Z

Added common and simple case folding, i.e. mapping one to one character mapping. For more information see http://www.unicode.org/faq/casemap_charprop.html

Removed auto-generated dead code which wasn't used.

kud1ing · 2014-02-26T06:57:18Z

See also #9084

huonw · 2014-02-26T06:58:24Z

Removed auto-generated dead code which wasn't used.

Would it be possible to do this in a separate commit (in this PR), for ease of review and general good-git-practice?

pzol · 2014-02-26T10:04:03Z

Ok, will do that in two separate commits. Need to iron out an issue I have found first.

huonw · 2014-02-26T11:49:03Z

Thanks.

pzol · 2014-02-26T13:06:46Z

Done.

pzol · 2014-02-26T13:41:33Z

Docs updated.

pzol · 2014-03-01T06:44:56Z

@flaper87 removed code duplication.

Valloric · 2014-03-01T20:04:49Z

src/libstd/char.rs

@@ -486,6 +518,39 @@ fn test_to_digit() {
 }

 #[test]
+fn test_to_lowercase() {
+    assert_eq!('A'.to_lowercase(), 'a');


In important test case to have would be the infamous Turkish i.

It's an issue in many languages/frameworks:
http://blogs.msdn.com/b/deeptanshuv/archive/2004/09/04/225720.aspx
http://haacked.com/archive/2012/07/05/turkish-i-problem-and-why-you-should-care.aspx/
https://groups.google.com/d/topic/golang-nuts/w8eZxT3dA48/discussion
http://stackoverflow.com/questions/16830570/qt-turkish-characters-case-conversion
http://wiki.tcl.tk/748
esamattis/underscore.string#252
nicolas-grekas/Patchwork-UTF8#2
http://lotusnotus.com/lotusnotus_en.nsf/dx/dotless-i-tolowercase-and-touppercase-functions-use-responsibly.htm

I could add more, but basically every language/framework gets this wrong and it has cost people's lives.

The problem comes out when you deal with string comparison in a language sensitive context. If you just convert to upper case and then compare, without ever considering a locale, you cannot know we're dealing with turkish. A plain conversion to lower and to upper works according to UnicodeData.txt - without the special language sensitive context, i.e. the dotted upper case I with a dot, convers to a simple i.

So currently this would not pass:

assert_eq!('ı'.to_uppercase(), 'İ'); assert_eq!('İ'.to_lowercase(), 'i'); let tr_alphabet = "abcçdefgğhıijklmnoöprsştuüvyz"; let tr_upper = "ABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZ"; let tr_lower = "abcçdefgğhıijklmnoöprsştuüvyz"; for (a, e) in upper_chars(tr_alphabet).zip(tr_upper.chars()) { assert!(a == e, format!("actual {} != expected {}", a, e));

This should be tackled in a different lib. I will be proposing a libi18n that should have things like case insensitive comparison in a locale context.

On 1 mar 2014, at 21:05, Val Markovic [email protected] wrote:

In src/libstd/char.rs:

@@ -486,6 +518,39 @@ fn test_to_digit() {
}

#[test]
+fn test_to_lowercase() {

assert_eq!('A'.to_lowercase(), 'a');
In important test case to have would be the infamous Turkish i.

It's an issue in many languages/frameworks:
http://blogs.msdn.com/b/deeptanshuv/archive/2004/09/04/225720.aspx
http://haacked.com/archive/2012/07/05/turkish-i-problem-and-why-you-should-care.aspx/
https://groups.google.com/d/topic/golang-nuts/w8eZxT3dA48/discussion
http://stackoverflow.com/questions/16830570/qt-turkish-characters-case-conversion
http://wiki.tcl.tk/748
esamattis/underscore.string#252
nicolas-grekas/Patchwork-UTF8#2
http://lotusnotus.com/lotusnotus_en.nsf/dx/dotless-i-tolowercase-and-touppercase-functions-use-responsibly.htm

I could add more, but basically every language/framework gets this wrong and it has cost people's lives.

—
Reply to this email directly or view it on GitHub.

flaper87 · 2014-03-02T21:49:50Z

Just a small nit from a partial review. It looks good to me! Thanks a lot!

flaper87 · 2014-03-03T21:29:05Z

@pzol could you squash the last 2 commits into the second one? This is looking good. Thanks

pzol · 2014-03-04T06:23:49Z

Squashed!

flaper87 · 2014-03-06T08:42:48Z

LGTM, @huonw mind taking a final look here?

alexcrichton · 2014-03-09T19:45:55Z

In the past I've found unicode case sensitivity to be a very tricky and hairy topic. I've heard things like it's based on locale, based on which variant of unicode you're using, it changes from revision to revision, etc.

My only worry about this is that the to_uppercase and to_lowercase functions are a little vague about what exactly they are doing. It would be nice for them to base as transparent as possible with explicit references to any online standards or documentation explaining how exactly the case conversion is being performed.

I'm a little worried to merge this as I'm certainly no unicode expert, but the code looks good to me and I'd be willing to r+ with more comprehensive comments.

pzol · 2014-03-10T06:25:57Z

@alexcrichton comments with references in the code or in the commit?

Case folding is decribed here http://unicode.org/reports/tr21/tr21-3.html.
The conversion implemented here cover the so called common (ASCII basicly) and simple case folding - where one codepoint translates to one codepoint without locale specific sensivity, like the turkish special cases of i. The conversion is based on the UnicodeData.txt file ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt which was already being used in std::unicode and is documented here http://www.unicode.org/reports/tr44/.

The above mentioned documented mentions:

There are a number of complications to case mappings that occur once the repertoire of characters is expanded beyond ASCII.

Because of the inclusion of certain composite characters for compatibility, such as 01F1 "DZ" capital dz, there is a third case, called titlecase, which is used where the first letter of a word is to be capitalized (e.g. Titlecase, vs. UPPERCASE, or lowercase).
For example, the title case of the example character is 01F2 "Dz" capital d with small z.
Case mappings may produce strings of different length than the original.
For example, the German character 00DF "ß" small letter sharp s expands when uppercased to the sequence of two characters "SS". This also occurs where there is no precomposed character corresponding to a case mapping, such as with 0149 "ŉ" latin small letter n preceded by apostrophe.
Characters may also have different case mappings, depending on the context.
For example, 03A3 "Σ" capital sigma lowercases to 03C3 "σ" small sigma if it is followed by another letter, but lowercases to 03C2 "ς" small final sigma if it is not.
Characters may have case mappings that depend on the locale.
For example, in Turkish the letter 0049 "I" capital letter i lowercases to 0131 "ı" small dotless i.
Case mappings are not, in general, reversible.
For example, once the string "McGowan" has been uppercased, lowercased or titlecased, the original cannot be recovered by applying another uppercase, lowercase, or titlecase operation.

Go, Python and other languages use the same mechanism.
My recommendation would be to start with this approach and provide a more sophisticated and more complete in a separate library liblocale. I am currently working on such. An example of a locale library I like is http://www.boost.org/doc/libs/1_55_0/libs/locale/doc/html/index.html

Currently all Rust character handling is based on the the (naive) assumption, that one Rust char being one codepoint maps to one letter, and thus the implemented simple case conversion seems appropriate.

alexcrichton · 2014-03-10T06:29:58Z

I'd be looking for comments on the functions themselves. The current documentation only states:

/// Convert a char to its uppercase equivalent
///
/// The case-folding performed is the common or simple mapping:
/// it only maps a codepoint to its equivalent if it is also a single codepoint
///
/// # Return value
///
/// Returns the char itself if no conversion if possible

This isn't very descriptive about how it's doing the uppercase/lowercase behind the scenes.

huonw · 2014-03-10T10:04:05Z

I agree with @alexcrichton: having references and citations to the canonical source of algorithms is really good so that everyone is on the same page with precisely what is implemented.

pzol · 2014-03-13T08:44:07Z

How about

/// Convert a char to its uppercase equivalent
///
/// The case-folding performed is the common or simple mapping:
/// it maps one unicode codepoint (one char in Rust) to its uppercase equivalent according
/// to the Unicode database at ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
/// The additional SpecialCasing.txt is not considered here, as it expands to multiple
/// codepoints in some cases.
///
/// A full reference can be found here
/// http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G33992
///
/// # Return value
///
/// Returns the char itself if no conversion was made
#[inline]
pub fn to_uppercase(c: char) -> char {
    conversions::to_upper(c)
}
/// Convert a char to its lowercase equivalent
///
/// The case-folding performed is the common or simple mapping
/// see `to_uppercase` for references and more information
///
/// # Return value
///
/// Returns the char itself if no conversion if possible
#[inline]
pub fn to_lowercase(c: char) -> char {
    conversions::to_lower(c)
}

huonw · 2014-03-13T10:42:00Z

Seems fine to me. (cc #12862 re the links, you don't have to do anything about them now, though.)

pzol · 2014-03-13T10:47:19Z

Sorry, should have updated them before, done now.

flaper87 · 2014-03-13T10:50:42Z

@pzol could you squash your last commit into the one that implements the upper, lower case conversion? With that and @huonw comments addressed, it LGTM

huonw · 2014-03-13T10:59:46Z

(My comment is already addressed... as I said in it, there is nothing that needs work.)

flaper87 · 2014-03-13T11:01:40Z

(I wasn't referring to that one, anyway, looks fine)

Remove whitespace Update documentation for to_uppercase, to_lowercase

pzol · 2014-03-13T11:25:29Z

Squashed the last commit!

Added common and simple case folding, i.e. mapping one to one character mapping. For more information see http://www.unicode.org/faq/casemap_charprop.html Removed auto-generated dead code which wasn't used.

pzol mentioned this pull request Feb 26, 2014

add unicode case folding for char/str #9084

Closed

pzol mentioned this pull request Feb 26, 2014

Unicode to_lower() and to_upper #9363

Closed

Valloric reviewed Mar 1, 2014
View reviewed changes

Piotr Zolnierek added 2 commits March 13, 2014 09:32

std::unicode: remove unused category tables

4a00211

Implement lower, upper case conversion for char

04170b0

Remove code duplication

dba5625

Remove whitespace Update documentation for to_uppercase, to_lowercase

bors merged commit dba5625 into rust-lang:master Mar 13, 2014

pzol deleted the char-case branch March 13, 2014 19:16

Implement upper, lower case conversion for char #12561

Implement upper, lower case conversion for char #12561

Uh oh!

Conversation

pzol commented Feb 26, 2014

Uh oh!

kud1ing commented Feb 26, 2014

Uh oh!

huonw commented Feb 26, 2014

Uh oh!

pzol commented Feb 26, 2014

Uh oh!

huonw commented Feb 26, 2014

Uh oh!

pzol commented Feb 26, 2014

Uh oh!

pzol commented Feb 26, 2014

Uh oh!

pzol commented Mar 1, 2014

Uh oh!

Valloric Mar 1, 2014

Choose a reason for hiding this comment

Uh oh!

pzol Mar 2, 2014

Choose a reason for hiding this comment

Uh oh!

flaper87 commented Mar 2, 2014

Uh oh!

flaper87 commented Mar 3, 2014

Uh oh!

pzol commented Mar 4, 2014

Uh oh!

flaper87 commented Mar 6, 2014

Uh oh!

alexcrichton commented Mar 9, 2014

Uh oh!

pzol commented Mar 10, 2014

Uh oh!

alexcrichton commented Mar 10, 2014

Uh oh!

huonw commented Mar 10, 2014

Uh oh!

pzol commented Mar 13, 2014

Uh oh!

huonw commented Mar 13, 2014

Uh oh!

pzol commented Mar 13, 2014

Uh oh!

flaper87 commented Mar 13, 2014

Uh oh!

huonw commented Mar 13, 2014

Uh oh!

flaper87 commented Mar 13, 2014

Uh oh!

pzol commented Mar 13, 2014

Uh oh!

Uh oh!