Add lint on `&[char]` #5598

SoniEx2 · 2020-05-14T15:52:28Z

Since first writing this issue, all the following have come to our attention. Originally only the first point was written here:

&[char] doesn't provide O(1) indexing on characters, but on codepoints. a single character can be multiple codepoints (see also char.to_uppercase, which can output one or more chars!), in this case &[&str] (or equivalent) is more appropriate.
&[char] also doesn't provide O(1) indexing on columns, for some definition of "column". a single column can be multiple codepoints.
There is confusion about what &[char] actually means in a Pattern. As an example "___abc___".split(&['a', 'b', 'c'][..]).collect::<Vec<_>>() makes ["___", "", "", "___"] but some may expect it to make ["___", "___"].
1. One way someone may expect this is as Pattern for &[char] being UTF-32. It is not. It is well-documented that Rust doesn't support UTF-32 in std.
2. Another way is "any combination of these chars". It is also not, as that would break str.strip_prefix.
Even if there wasn't such confusion, &[char] may be slower than a hypothetical naive &[&str] for a small enough n, as the former requires encoding/decoding UTF-8 whereas the latter can rely on UTF-8 being a self-synchronizing prefix-free code and just match bytes.
If you actually wanted to use an &[char] as a Pattern, but you wanted to uppercase the chars for some reason... well, see (1).
For large n, &[char]: Pattern is slow. You should under no circumstances use ('a'..='z').collect::<Vec<_>>().as_slice() as a Pattern. Use |c| matches!(c, 'a'..='z') instead.

As such, there are a whole lot of issues with &[char] in practice. We think it's a good idea to lint against it.

The text was updated successfully, but these errors were encountered:

flip1995 · 2020-05-15T08:00:34Z

&[char] is practically a &str, not a &[&str], where it is impossible to accidentally index in between a char, but not as memory efficient. But on &str you index on codepoints (with ranges), whereas with &[char] you index on characters.

Note, that a char in Rust is not a u8, but a 4 byte character encoding.

Am I missing something?

SoniEx2 · 2020-05-15T13:37:05Z

char is a unicode scalar value (USV). which is a subset of codepoint.

a character like á can be made of multiple codepoints.

flip1995 · 2020-05-16T20:09:45Z

Ah TIL, thanks for the clarification.

llogiq · 2020-05-26T20:12:47Z

As a counterpoint, sometimes using [char] is the way to go when interfacing with scripting languages, though.

SoniEx2 · 2020-05-26T20:48:23Z

@llogiq I'd suggest using a proper wrapper around those, and turn off the lint in the wrapper crate.

esamudera · 2020-07-05T03:59:19Z

Hello, I'd like to try working on this story.

So from what I can gather, the requirement is simply to create a new lint that detects &[char_type_var] and suggest converting it into &[&str_type_var] instead?

brightly-salty · 2020-12-31T00:38:42Z

I'm willing to attempt to implement this and submit a PR if @esamudera isn't still working on it.

esamudera · 2020-12-31T03:57:33Z

Hi @brightly-salty, yes please take over this issue. Thanks!

brightly-salty · 2020-12-31T03:59:13Z

@rustbot claim

thomcc · 2021-02-10T09:25:37Z

This feels a bit odd because adding a lint against a method like this is effectively deprecating it on behalf of stdlib, without going through the libs team or a wider community discussion process. The discussions on it (here and IRLO) seems far from conclusive on the subject too...

SoniEx2 · 2021-02-10T12:50:50Z

Fwiw there is a comment in the stdlib which strongly hints that &[char] Patterns should be deprecated.

tesuji · 2021-02-10T14:01:42Z

Fwiw there is a comment in the stdlib which strongly hints that &[char] Patterns should be deprecated.

@SoniEx2 Could you linking the specific code that hints exactly that ?
I tried to grep with &[char] but cannot find the relevant parts .

SoniEx2 · 2021-02-10T14:45:01Z

https://github.com/rust-lang/rust/blob/0d97f7a96877a96015d70ece41ad08bb7af12377/library/core/src/str/pattern.rs#L759

As you may be able to tell, even the examples are bad:

/// Searches for chars that are equal to any of the [`char`]s in the slice.
///
/// # Examples
///
/// ```
/// assert_eq!("Hello world".find(&['l', 'l'] as &[_]), Some(2));
/// assert_eq!("Hello world".find(&['l', 'l'][..]), Some(2));
/// ```

thomcc · 2021-02-10T18:05:53Z

rust-lang/rust@0d97f7a/library/core/src/str/pattern.rs#L759
// Todo: Change / Remove due to ambiguity in meaning.

Unless I'm mistaken (maybe the author of that comment, @Kimundi, is around and can say definitively if this is what it meant), this is saying it's ambiguous because it's not clear if it's searching sequentially vs looking for the first match of the list.

The thing is, I think that issue applies to &[&str] too, which we're considering suggesting as the recommended replacement.

SoniEx2 · 2021-02-10T20:34:42Z

There is no support for &[&str] currently, but an array of strings is slightly different from an array of characters, especially for anyone coming from C.

SoniEx2 · 2021-02-22T14:26:37Z

Oh, sorry, we misunderstood what you meant. However, the examples provided are pretty clear about whoever added those examples thinking it was actually an UTF-32 search, and it somehow getting through a code review, with the relevant PR being rust-lang/rust#71097

llogiq · 2021-03-03T19:11:40Z

Another counterexample I recently stumbled upon is compiler/rustc_errors/src/styled_buffer.rs in which the StyledBuffer type keeps a Vec<Vec<char>>, and for good reason.

SoniEx2 · 2021-03-03T19:34:31Z

We'd actually consider that a prime example for why we should have this lint, for at least 2 reasons:

Assumes chars are fixed-width.
Is slow due to encoding/decoding of char (altho this is a lousy argument because taking care to check actual character/grapheme cluster width would likely make it slower).

Personally we would recommend instead using a smallstr::SmallString for that.

llogiq · 2021-03-03T21:25:21Z

How would a SmallString benefit this code? A char is 4 bytes, and each one has its own Style. A SmallString is at least 24 bytes, IIRC, and this code only works the way it does because of O(1) addressing.

SoniEx2 · 2021-03-03T21:41:21Z

But it should be operating on grapheme clusters/columns, not chars. A char isn't the correct abstraction for this. A SmallString would handle your average grapheme cluster without allocation, only falling back to allocation for edge cases.

thomcc · 2021-03-04T00:54:18Z

Displayed width isn't in units of grapheme clusters. It's also not in units of char — it's font dependent (yes, even for "monospace").

Edit: clarifications below

UAX #11 (e.g. unicode-width) defines about as close as we have to "displayed width" — In particular, while that UAX explicitly says terminal emulators shouldn't use it for this, many many do (since it goes into wcwidth)

It will still return the wrong value in some obvious cases though, but graphemes tend to also be pretty wrong. (Note that "ᄀᄀᄀ각ᆨᆨ" is a single grapheme — explicitly required to be too http://www.unicode.org/reports/tr29/#Hangul_Syllable_Boundary_Determination)

(Only reliable way to know is to print something out, and query the terminal about the current position afterwards. This isn't even perfect, but if it fails the terminal isn't self-consistent with its rendering and theres nothing you can do. See https://github.com/thomcc/term-width-example/ (bash testcases.sh after making your terminal window relatively large will show different approaches). Unfortunately, this approach doesn't really work for bidi, but it's about the best approach I know of. I suspect the "width" of a character (or perhaps the notion of a character itself) doesn't really exist in a way that works well here...)

Anyway, CCing @Manishearth who I'd trust more on this sort of question more than anybody else

llogiq · 2021-03-04T06:37:35Z

Please remember we're talking about displaying code in the example I gave. The worst that might happen is that the formatting in an error message may be off, so an 80% correct approach is completely acceptable here.

Being dogmatic and overlooking the cost of the 100% solution is not a tenet of Rust design.

thomcc · 2021-03-04T06:59:14Z

Ah, to be clear: I definitely wasn't arguing that the code as is was bad. I don't think there's a meaningfully better approach, actually. The one I describe (query after writing) is only really applicable for terminal UI programs. Something like rust error output shouldn't be tied to the actual outputting that way.

Manishearth · 2021-03-04T07:41:23Z

UAX #11 (e.g. unicode-width) defines about as close as we have to "displayed width" — In particular, while that UAX explicitly says terminal emulators shouldn't use it for this, many many do (since it goes into wcwidth)

I'd also like to point out that this is specifically for east asian text only, this doesn't really handle emoji/etc. Grapheme clusters are often a good approximation for displayed width in monospace fonts, but also not really.

The only reasonable thing to do here is look at your specific use case and see what you need. As you say, querying the font/terminal is often the right option.

As for this lint, I think &[char] can be used for many, many legit reasons. You might be interoperating with UTF-32 code, or otherwise need UTF-32. You might just need an array of characters for a bloom filter or something. I really don't agree with the motivation in the issue body; a lot of these complaints apply to &str as well, and ultimately this stuff is extremely use case specific and it's not possible to write a good general lint for them.

SoniEx2 · 2021-03-04T13:25:34Z

Why do these complaints apply to &[&str]? Or do you mean a bare &str? Because as far as we know, you can make an &[&str] where the index is the actual physical column and the value is all the codepoints that make up that column (or empty if carrying over from a previous column).

Sure, we don't know how to bring up all these issues in the error message for this one lint, because this is a lot of stuff. For UTF-32 use an actual abstraction layer that isn't gonna have you trying to use "UTF-32" as a Pattern and getting weird results. For grapheme clusters and columns and figuring out how much left/backspace should move the cursor use a &[&str] or Vec<String> or so on. For using [char] as an actual Pattern make sure you know what you're doing (it is easy to get wrong) and turn off the lint for that section. While each use-case requires a different solution, &[char] is almost never the appropriate solution.

Manishearth · 2021-03-04T17:25:30Z

Yes, on a bare &str.

I don't understand your focus on Pattern, your issue is asking for &[char] to be linted against in general. If you want something specifically for Pattern please clearly spell it out, ideally in the separate issue. As it stands the bulk of the ask here is too broad and IMO not correct.

SoniEx2 · 2021-03-04T18:06:54Z

&[char] is the problem, which includes Pattern.

and yes the problem IS broad. that's the whole point. this is "internationalization 101" stuff and everyone gets it wrong.

Manishearth · 2021-03-04T18:20:00Z

In my many years of writing Rust code I have never seen &[char] used in this way. People just do not use that type. When I've seen it be used it's usually for UTF32 compat or something.

SoniEx2 · 2021-03-04T18:22:19Z

&[char], Vec<char>, same difference really.

Kimundi · 2021-03-30T19:02:11Z

rust-lang/rust@0d97f7a/library/core/src/str/pattern.rs#L759
// Todo: Change / Remove due to ambiguity in meaning.
Unless I'm mistaken (maybe the author of that comment, @Kimundi, is around and can say definitively if this is what it meant), this is saying it's ambiguous because it's not clear if it's searching sequentially vs looking for the first match of the list.

Yes that is what I meant with that comment.

flip1995 added good-first-issue These issues are a good way to get started with Clippy A-lint Area: New lints labels May 16, 2020

ebroto added the hacktoberfest label Oct 1, 2020

rustbot assigned brightly-salty Dec 31, 2020

brightly-salty mentioned this issue Dec 31, 2020

Wip: Add new lint "char_slices" #6529

Closed

Manishearth closed this as completed Mar 4, 2021

Luro02 mentioned this issue Sep 4, 2021

Suggest usage of array.as_slice() (instead of &array[..]) and array.as_mut_slice() #7633

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lint on `&[char]` #5598

Add lint on `&[char]` #5598

SoniEx2 commented May 14, 2020 •

edited

Loading

flip1995 commented May 15, 2020 •

edited

Loading

SoniEx2 commented May 15, 2020

flip1995 commented May 16, 2020

llogiq commented May 26, 2020

SoniEx2 commented May 26, 2020

esamudera commented Jul 5, 2020 •

edited

Loading

brightly-salty commented Dec 31, 2020

esamudera commented Dec 31, 2020

brightly-salty commented Dec 31, 2020

thomcc commented Feb 10, 2021

SoniEx2 commented Feb 10, 2021 •

edited

Loading

tesuji commented Feb 10, 2021

SoniEx2 commented Feb 10, 2021 •

edited

Loading

thomcc commented Feb 10, 2021

SoniEx2 commented Feb 10, 2021

SoniEx2 commented Feb 22, 2021

llogiq commented Mar 3, 2021

SoniEx2 commented Mar 3, 2021 •

edited

Loading

llogiq commented Mar 3, 2021

SoniEx2 commented Mar 3, 2021 •

edited

Loading

thomcc commented Mar 4, 2021 •

edited

Loading

llogiq commented Mar 4, 2021

thomcc commented Mar 4, 2021

Manishearth commented Mar 4, 2021

SoniEx2 commented Mar 4, 2021 •

edited

Loading

Manishearth commented Mar 4, 2021

SoniEx2 commented Mar 4, 2021

Manishearth commented Mar 4, 2021

SoniEx2 commented Mar 4, 2021

Kimundi commented Mar 30, 2021

Add lint on &[char] #5598

Add lint on &[char] #5598

Comments

SoniEx2 commented May 14, 2020 • edited Loading

flip1995 commented May 15, 2020 • edited Loading

SoniEx2 commented May 15, 2020

flip1995 commented May 16, 2020

llogiq commented May 26, 2020

SoniEx2 commented May 26, 2020

esamudera commented Jul 5, 2020 • edited Loading

brightly-salty commented Dec 31, 2020

esamudera commented Dec 31, 2020

brightly-salty commented Dec 31, 2020

thomcc commented Feb 10, 2021

SoniEx2 commented Feb 10, 2021 • edited Loading

tesuji commented Feb 10, 2021

SoniEx2 commented Feb 10, 2021 • edited Loading

thomcc commented Feb 10, 2021

SoniEx2 commented Feb 10, 2021

SoniEx2 commented Feb 22, 2021

llogiq commented Mar 3, 2021

SoniEx2 commented Mar 3, 2021 • edited Loading

llogiq commented Mar 3, 2021

SoniEx2 commented Mar 3, 2021 • edited Loading

thomcc commented Mar 4, 2021 • edited Loading

llogiq commented Mar 4, 2021

thomcc commented Mar 4, 2021

Manishearth commented Mar 4, 2021

SoniEx2 commented Mar 4, 2021 • edited Loading

Manishearth commented Mar 4, 2021

SoniEx2 commented Mar 4, 2021

Manishearth commented Mar 4, 2021

SoniEx2 commented Mar 4, 2021

Kimundi commented Mar 30, 2021

Add lint on `&[char]` #5598

Add lint on `&[char]` #5598

SoniEx2 commented May 14, 2020 •

edited

Loading

flip1995 commented May 15, 2020 •

edited

Loading

esamudera commented Jul 5, 2020 •

edited

Loading

SoniEx2 commented Feb 10, 2021 •

edited

Loading

SoniEx2 commented Feb 10, 2021 •

edited

Loading

SoniEx2 commented Mar 3, 2021 •

edited

Loading

SoniEx2 commented Mar 3, 2021 •

edited

Loading

thomcc commented Mar 4, 2021 •

edited

Loading

SoniEx2 commented Mar 4, 2021 •

edited

Loading