-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Implement upper, lower case conversion for char #12561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
See also #9084 |
Would it be possible to do this in a separate commit (in this PR), for ease of review and general good-git-practice? |
Ok, will do that in two separate commits. Need to iron out an issue I have found first. |
Thanks. |
Done. |
Docs updated. |
@flaper87 removed code duplication. |
@@ -486,6 +518,39 @@ fn test_to_digit() { | |||
} | |||
|
|||
#[test] | |||
fn test_to_lowercase() { | |||
assert_eq!('A'.to_lowercase(), 'a'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In important test case to have would be the infamous Turkish i
.
It's an issue in many languages/frameworks:
http://blogs.msdn.com/b/deeptanshuv/archive/2004/09/04/225720.aspx
http://haacked.com/archive/2012/07/05/turkish-i-problem-and-why-you-should-care.aspx/
https://groups.google.com/d/topic/golang-nuts/w8eZxT3dA48/discussion
http://stackoverflow.com/questions/16830570/qt-turkish-characters-case-conversion
http://wiki.tcl.tk/748
esamattis/underscore.string#252
nicolas-grekas/Patchwork-UTF8#2
http://lotusnotus.com/lotusnotus_en.nsf/dx/dotless-i-tolowercase-and-touppercase-functions-use-responsibly.htm
I could add more, but basically every language/framework gets this wrong and it has cost people's lives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem comes out when you deal with string comparison in a language sensitive context. If you just convert to upper case and then compare, without ever considering a locale, you cannot know we're dealing with turkish. A plain conversion to lower and to upper works according to UnicodeData.txt - without the special language sensitive context, i.e. the dotted upper case I with a dot, convers to a simple i.
So currently this would not pass:
assert_eq!('ı'.to_uppercase(), 'İ');
assert_eq!('İ'.to_lowercase(), 'i');
let tr_alphabet = "abcçdefgğhıijklmnoöprsştuüvyz";
let tr_upper = "ABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZ";
let tr_lower = "abcçdefgğhıijklmnoöprsştuüvyz";
for (a, e) in upper_chars(tr_alphabet).zip(tr_upper.chars()) {
assert!(a == e, format!("actual {} != expected {}", a, e));
This should be tackled in a different lib. I will be proposing a libi18n that should have things like case insensitive comparison in a locale context.
On 1 mar 2014, at 21:05, Val Markovic [email protected] wrote:
In src/libstd/char.rs:
@@ -486,6 +518,39 @@ fn test_to_digit() {
}#[test]
+fn test_to_lowercase() {
- assert_eq!('A'.to_lowercase(), 'a');
In important test case to have would be the infamous Turkish i.It's an issue in many languages/frameworks:
http://blogs.msdn.com/b/deeptanshuv/archive/2004/09/04/225720.aspx
http://haacked.com/archive/2012/07/05/turkish-i-problem-and-why-you-should-care.aspx/
https://groups.google.com/d/topic/golang-nuts/w8eZxT3dA48/discussion
http://stackoverflow.com/questions/16830570/qt-turkish-characters-case-conversion
http://wiki.tcl.tk/748
esamattis/underscore.string#252
nicolas-grekas/Patchwork-UTF8#2
http://lotusnotus.com/lotusnotus_en.nsf/dx/dotless-i-tolowercase-and-touppercase-functions-use-responsibly.htmI could add more, but basically every language/framework gets this wrong and it has cost people's lives.
—
Reply to this email directly or view it on GitHub.
Just a small nit from a partial review. It looks good to me! Thanks a lot! |
@pzol could you squash the last 2 commits into the second one? This is looking good. Thanks |
Squashed! |
LGTM, @huonw mind taking a final look here? |
In the past I've found unicode case sensitivity to be a very tricky and hairy topic. I've heard things like it's based on locale, based on which variant of unicode you're using, it changes from revision to revision, etc. My only worry about this is that the I'm a little worried to merge this as I'm certainly no unicode expert, but the code looks good to me and I'd be willing to r+ with more comprehensive comments. |
@alexcrichton comments with references in the code or in the commit? Case folding is decribed here http://unicode.org/reports/tr21/tr21-3.html. The above mentioned documented mentions:
Go, Python and other languages use the same mechanism. Currently all Rust character handling is based on the the (naive) assumption, that one Rust char being one codepoint maps to one letter, and thus the implemented simple case conversion seems appropriate. |
I'd be looking for comments on the functions themselves. The current documentation only states:
This isn't very descriptive about how it's doing the uppercase/lowercase behind the scenes. |
I agree with @alexcrichton: having references and citations to the canonical source of algorithms is really good so that everyone is on the same page with precisely what is implemented. |
How about /// Convert a char to its uppercase equivalent
///
/// The case-folding performed is the common or simple mapping:
/// it maps one unicode codepoint (one char in Rust) to its uppercase equivalent according
/// to the Unicode database at ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
/// The additional SpecialCasing.txt is not considered here, as it expands to multiple
/// codepoints in some cases.
///
/// A full reference can be found here
/// http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G33992
///
/// # Return value
///
/// Returns the char itself if no conversion was made
#[inline]
pub fn to_uppercase(c: char) -> char {
conversions::to_upper(c)
}
/// Convert a char to its lowercase equivalent
///
/// The case-folding performed is the common or simple mapping
/// see `to_uppercase` for references and more information
///
/// # Return value
///
/// Returns the char itself if no conversion if possible
#[inline]
pub fn to_lowercase(c: char) -> char {
conversions::to_lower(c)
} |
Seems fine to me. (cc #12862 re the links, you don't have to do anything about them now, though.) |
Sorry, should have updated them before, done now. |
(My comment is already addressed... as I said in it, there is nothing that needs work.) |
(I wasn't referring to that one, anyway, looks fine) |
Remove whitespace Update documentation for to_uppercase, to_lowercase
Squashed the last commit! |
Added common and simple case folding, i.e. mapping one to one character mapping. For more information see http://www.unicode.org/faq/casemap_charprop.html Removed auto-generated dead code which wasn't used.
Added common and simple case folding, i.e. mapping one to one character mapping. For more information see http://www.unicode.org/faq/casemap_charprop.html
Removed auto-generated dead code which wasn't used.