Skip to content

Implement upper, lower case conversion for char #12561

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 13, 2014
Merged

Conversation

pzol
Copy link
Contributor

@pzol pzol commented Feb 26, 2014

Added common and simple case folding, i.e. mapping one to one character mapping. For more information see http://www.unicode.org/faq/casemap_charprop.html

Removed auto-generated dead code which wasn't used.

@kud1ing
Copy link

kud1ing commented Feb 26, 2014

See also #9084

@huonw
Copy link
Member

huonw commented Feb 26, 2014

Removed auto-generated dead code which wasn't used.

Would it be possible to do this in a separate commit (in this PR), for ease of review and general good-git-practice?

@pzol
Copy link
Contributor Author

pzol commented Feb 26, 2014

Ok, will do that in two separate commits. Need to iron out an issue I have found first.

@huonw
Copy link
Member

huonw commented Feb 26, 2014

Thanks.

@pzol
Copy link
Contributor Author

pzol commented Feb 26, 2014

Done.

@pzol
Copy link
Contributor Author

pzol commented Feb 26, 2014

Docs updated.

@pzol
Copy link
Contributor Author

pzol commented Mar 1, 2014

@flaper87 removed code duplication.

@@ -486,6 +518,39 @@ fn test_to_digit() {
}

#[test]
fn test_to_lowercase() {
assert_eq!('A'.to_lowercase(), 'a');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem comes out when you deal with string comparison in a language sensitive context. If you just convert to upper case and then compare, without ever considering a locale, you cannot know we're dealing with turkish. A plain conversion to lower and to upper works according to UnicodeData.txt - without the special language sensitive context, i.e. the dotted upper case I with a dot, convers to a simple i.

So currently this would not pass:

  assert_eq!('ı'.to_uppercase(), 'İ');
  assert_eq!('İ'.to_lowercase(), 'i');

  let tr_alphabet = "abcçdefgğhıijklmnoöprsştuüvyz";
  let tr_upper    = "ABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZ";
  let tr_lower    = "abcçdefgğhıijklmnoöprsştuüvyz";

  for (a, e) in upper_chars(tr_alphabet).zip(tr_upper.chars()) {
    assert!(a == e, format!("actual {} != expected {}", a, e));

This should be tackled in a different lib. I will be proposing a libi18n that should have things like case insensitive comparison in a locale context.

On 1 mar 2014, at 21:05, Val Markovic [email protected] wrote:

In src/libstd/char.rs:

@@ -486,6 +518,39 @@ fn test_to_digit() {
}

#[test]
+fn test_to_lowercase() {

  • assert_eq!('A'.to_lowercase(), 'a');
    In important test case to have would be the infamous Turkish i.

It's an issue in many languages/frameworks:
http://blogs.msdn.com/b/deeptanshuv/archive/2004/09/04/225720.aspx
http://haacked.com/archive/2012/07/05/turkish-i-problem-and-why-you-should-care.aspx/
https://groups.google.com/d/topic/golang-nuts/w8eZxT3dA48/discussion
http://stackoverflow.com/questions/16830570/qt-turkish-characters-case-conversion
http://wiki.tcl.tk/748
esamattis/underscore.string#252
nicolas-grekas/Patchwork-UTF8#2
http://lotusnotus.com/lotusnotus_en.nsf/dx/dotless-i-tolowercase-and-touppercase-functions-use-responsibly.htm

I could add more, but basically every language/framework gets this wrong and it has cost people's lives.


Reply to this email directly or view it on GitHub.

@flaper87
Copy link
Contributor

flaper87 commented Mar 2, 2014

Just a small nit from a partial review. It looks good to me! Thanks a lot!

@flaper87
Copy link
Contributor

flaper87 commented Mar 3, 2014

@pzol could you squash the last 2 commits into the second one? This is looking good. Thanks

@pzol
Copy link
Contributor Author

pzol commented Mar 4, 2014

Squashed!

@flaper87
Copy link
Contributor

flaper87 commented Mar 6, 2014

LGTM, @huonw mind taking a final look here?

@alexcrichton
Copy link
Member

In the past I've found unicode case sensitivity to be a very tricky and hairy topic. I've heard things like it's based on locale, based on which variant of unicode you're using, it changes from revision to revision, etc.

My only worry about this is that the to_uppercase and to_lowercase functions are a little vague about what exactly they are doing. It would be nice for them to base as transparent as possible with explicit references to any online standards or documentation explaining how exactly the case conversion is being performed.

I'm a little worried to merge this as I'm certainly no unicode expert, but the code looks good to me and I'd be willing to r+ with more comprehensive comments.

@pzol
Copy link
Contributor Author

pzol commented Mar 10, 2014

@alexcrichton comments with references in the code or in the commit?

Case folding is decribed here http://unicode.org/reports/tr21/tr21-3.html.
The conversion implemented here cover the so called common (ASCII basicly) and simple case folding - where one codepoint translates to one codepoint without locale specific sensivity, like the turkish special cases of i. The conversion is based on the UnicodeData.txt file ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt which was already being used in std::unicode and is documented here http://www.unicode.org/reports/tr44/.

The above mentioned documented mentions:

There are a number of complications to case mappings that occur once the repertoire of characters is expanded beyond ASCII.

Because of the inclusion of certain composite characters for compatibility, such as 01F1 "DZ" capital dz, there is a third case, called titlecase, which is used where the first letter of a word is to be capitalized (e.g. Titlecase, vs. UPPERCASE, or lowercase).
For example, the title case of the example character is 01F2 "Dz" capital d with small z.
Case mappings may produce strings of different length than the original.
For example, the German character 00DF "ß" small letter sharp s expands when uppercased to the sequence of two characters "SS". This also occurs where there is no precomposed character corresponding to a case mapping, such as with 0149 "ʼn" latin small letter n preceded by apostrophe.
Characters may also have different case mappings, depending on the context.
For example, 03A3 "Σ" capital sigma lowercases to 03C3 "σ" small sigma if it is followed by another letter, but lowercases to 03C2 "ς" small final sigma if it is not.
Characters may have case mappings that depend on the locale.
For example, in Turkish the letter 0049 "I" capital letter i lowercases to 0131 "ı" small dotless i.
Case mappings are not, in general, reversible.
For example, once the string "McGowan" has been uppercased, lowercased or titlecased, the original cannot be recovered by applying another uppercase, lowercase, or titlecase operation.

Go, Python and other languages use the same mechanism.
My recommendation would be to start with this approach and provide a more sophisticated and more complete in a separate library liblocale. I am currently working on such. An example of a locale library I like is http://www.boost.org/doc/libs/1_55_0/libs/locale/doc/html/index.html

Currently all Rust character handling is based on the the (naive) assumption, that one Rust char being one codepoint maps to one letter, and thus the implemented simple case conversion seems appropriate.

@alexcrichton
Copy link
Member

I'd be looking for comments on the functions themselves. The current documentation only states:

/// Convert a char to its uppercase equivalent
///
/// The case-folding performed is the common or simple mapping:
/// it only maps a codepoint to its equivalent if it is also a single codepoint
///
/// # Return value
///
/// Returns the char itself if no conversion if possible

This isn't very descriptive about how it's doing the uppercase/lowercase behind the scenes.

@huonw
Copy link
Member

huonw commented Mar 10, 2014

I agree with @alexcrichton: having references and citations to the canonical source of algorithms is really good so that everyone is on the same page with precisely what is implemented.

@pzol
Copy link
Contributor Author

pzol commented Mar 13, 2014

How about

/// Convert a char to its uppercase equivalent
///
/// The case-folding performed is the common or simple mapping:
/// it maps one unicode codepoint (one char in Rust) to its uppercase equivalent according
/// to the Unicode database at ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
/// The additional SpecialCasing.txt is not considered here, as it expands to multiple
/// codepoints in some cases.
///
/// A full reference can be found here
/// http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G33992
///
/// # Return value
///
/// Returns the char itself if no conversion was made
#[inline]
pub fn to_uppercase(c: char) -> char {
    conversions::to_upper(c)
}
/// Convert a char to its lowercase equivalent
///
/// The case-folding performed is the common or simple mapping
/// see `to_uppercase` for references and more information
///
/// # Return value
///
/// Returns the char itself if no conversion if possible
#[inline]
pub fn to_lowercase(c: char) -> char {
    conversions::to_lower(c)
}

@huonw
Copy link
Member

huonw commented Mar 13, 2014

Seems fine to me. (cc #12862 re the links, you don't have to do anything about them now, though.)

@pzol
Copy link
Contributor Author

pzol commented Mar 13, 2014

Sorry, should have updated them before, done now.

@flaper87
Copy link
Contributor

@pzol could you squash your last commit into the one that implements the upper, lower case conversion? With that and @huonw comments addressed, it LGTM

@huonw
Copy link
Member

huonw commented Mar 13, 2014

(My comment is already addressed... as I said in it, there is nothing that needs work.)

@flaper87
Copy link
Contributor

(I wasn't referring to that one, anyway, looks fine)

Remove whitespace

Update documentation for to_uppercase, to_lowercase
@pzol
Copy link
Contributor Author

pzol commented Mar 13, 2014

Squashed the last commit!

bors added a commit that referenced this pull request Mar 13, 2014
Added common and simple case folding, i.e. mapping one to one character mapping. For more information see http://www.unicode.org/faq/casemap_charprop.html

Removed auto-generated dead code which wasn't used.
@bors bors merged commit dba5625 into rust-lang:master Mar 13, 2014
@pzol pzol deleted the char-case branch March 13, 2014 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants