Skip to content

Add lint on &[char] #5598

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
SoniEx2 opened this issue May 14, 2020 · 30 comments
Closed

Add lint on &[char] #5598

SoniEx2 opened this issue May 14, 2020 · 30 comments
Assignees
Labels
A-lint Area: New lints good-first-issue These issues are a good way to get started with Clippy

Comments

@SoniEx2
Copy link

SoniEx2 commented May 14, 2020

Since first writing this issue, all the following have come to our attention. Originally only the first point was written here:

  1. &[char] doesn't provide O(1) indexing on characters, but on codepoints. a single character can be multiple codepoints (see also char.to_uppercase, which can output one or more chars!), in this case &[&str] (or equivalent) is more appropriate.
  2. &[char] also doesn't provide O(1) indexing on columns, for some definition of "column". a single column can be multiple codepoints.
  3. There is confusion about what &[char] actually means in a Pattern. As an example "___abc___".split(&['a', 'b', 'c'][..]).collect::<Vec<_>>() makes ["___", "", "", "___"] but some may expect it to make ["___", "___"].
    1. One way someone may expect this is as Pattern for &[char] being UTF-32. It is not. It is well-documented that Rust doesn't support UTF-32 in std.
    2. Another way is "any combination of these chars". It is also not, as that would break str.strip_prefix.
  4. Even if there wasn't such confusion, &[char] may be slower than a hypothetical naive &[&str] for a small enough n, as the former requires encoding/decoding UTF-8 whereas the latter can rely on UTF-8 being a self-synchronizing prefix-free code and just match bytes.
  5. If you actually wanted to use an &[char] as a Pattern, but you wanted to uppercase the chars for some reason... well, see (1).
  6. For large n, &[char]: Pattern is slow. You should under no circumstances use ('a'..='z').collect::<Vec<_>>().as_slice() as a Pattern. Use |c| matches!(c, 'a'..='z') instead.

As such, there are a whole lot of issues with &[char] in practice. We think it's a good idea to lint against it.

@flip1995
Copy link
Member

flip1995 commented May 15, 2020

&[char] is practically a &str, not a &[&str], where it is impossible to accidentally index in between a char, but not as memory efficient. But on &str you index on codepoints (with ranges), whereas with &[char] you index on characters.

Note, that a char in Rust is not a u8, but a 4 byte character encoding.

Am I missing something?

@SoniEx2
Copy link
Author

SoniEx2 commented May 15, 2020

char is a unicode scalar value (USV). which is a subset of codepoint.

a character like á can be made of multiple codepoints.

@flip1995
Copy link
Member

Ah TIL, thanks for the clarification.

@flip1995 flip1995 added good-first-issue These issues are a good way to get started with Clippy A-lint Area: New lints labels May 16, 2020
@llogiq
Copy link
Contributor

llogiq commented May 26, 2020

As a counterpoint, sometimes using [char] is the way to go when interfacing with scripting languages, though.

@SoniEx2
Copy link
Author

SoniEx2 commented May 26, 2020

@llogiq I'd suggest using a proper wrapper around those, and turn off the lint in the wrapper crate.

@esamudera
Copy link
Contributor

esamudera commented Jul 5, 2020

Hello, I'd like to try working on this story.

So from what I can gather, the requirement is simply to create a new lint that detects &[char_type_var] and suggest converting it into &[&str_type_var] instead?

@brightly-salty
Copy link
Contributor

I'm willing to attempt to implement this and submit a PR if @esamudera isn't still working on it.

@esamudera
Copy link
Contributor

Hi @brightly-salty, yes please take over this issue. Thanks!

@brightly-salty
Copy link
Contributor

@rustbot claim

@thomcc
Copy link
Member

thomcc commented Feb 10, 2021

This feels a bit odd because adding a lint against a method like this is effectively deprecating it on behalf of stdlib, without going through the libs team or a wider community discussion process. The discussions on it (here and IRLO) seems far from conclusive on the subject too...

@SoniEx2
Copy link
Author

SoniEx2 commented Feb 10, 2021

Fwiw there is a comment in the stdlib which strongly hints that &[char] Patterns should be deprecated.

@tesuji
Copy link
Contributor

tesuji commented Feb 10, 2021

Fwiw there is a comment in the stdlib which strongly hints that &[char] Patterns should be deprecated.

@SoniEx2 Could you linking the specific code that hints exactly that ?
I tried to grep with &[char] but cannot find the relevant parts .

@SoniEx2
Copy link
Author

SoniEx2 commented Feb 10, 2021

https://github.com/rust-lang/rust/blob/0d97f7a96877a96015d70ece41ad08bb7af12377/library/core/src/str/pattern.rs#L759

As you may be able to tell, even the examples are bad:

/// Searches for chars that are equal to any of the [`char`]s in the slice.
///
/// # Examples
///
/// ```
/// assert_eq!("Hello world".find(&['l', 'l'] as &[_]), Some(2));
/// assert_eq!("Hello world".find(&['l', 'l'][..]), Some(2));
/// ```

@thomcc
Copy link
Member

thomcc commented Feb 10, 2021

rust-lang/rust@0d97f7a/library/core/src/str/pattern.rs#L759

// Todo: Change / Remove due to ambiguity in meaning.

Unless I'm mistaken (maybe the author of that comment, @Kimundi, is around and can say definitively if this is what it meant), this is saying it's ambiguous because it's not clear if it's searching sequentially vs looking for the first match of the list.

The thing is, I think that issue applies to &[&str] too, which we're considering suggesting as the recommended replacement.

@SoniEx2
Copy link
Author

SoniEx2 commented Feb 10, 2021

There is no support for &[&str] currently, but an array of strings is slightly different from an array of characters, especially for anyone coming from C.

@SoniEx2
Copy link
Author

SoniEx2 commented Feb 22, 2021

Oh, sorry, we misunderstood what you meant. However, the examples provided are pretty clear about whoever added those examples thinking it was actually an UTF-32 search, and it somehow getting through a code review, with the relevant PR being rust-lang/rust#71097

@llogiq
Copy link
Contributor

llogiq commented Mar 3, 2021

Another counterexample I recently stumbled upon is compiler/rustc_errors/src/styled_buffer.rs in which the StyledBuffer type keeps a Vec<Vec<char>>, and for good reason.

@SoniEx2
Copy link
Author

SoniEx2 commented Mar 3, 2021

We'd actually consider that a prime example for why we should have this lint, for at least 2 reasons:

  1. Assumes chars are fixed-width.
  2. Is slow due to encoding/decoding of char (altho this is a lousy argument because taking care to check actual character/grapheme cluster width would likely make it slower).

Personally we would recommend instead using a smallstr::SmallString for that.

@llogiq
Copy link
Contributor

llogiq commented Mar 3, 2021

How would a SmallString benefit this code? A char is 4 bytes, and each one has its own Style. A SmallString is at least 24 bytes, IIRC, and this code only works the way it does because of O(1) addressing.

@SoniEx2
Copy link
Author

SoniEx2 commented Mar 3, 2021

But it should be operating on grapheme clusters/columns, not chars. A char isn't the correct abstraction for this. A SmallString would handle your average grapheme cluster without allocation, only falling back to allocation for edge cases.

@thomcc
Copy link
Member

thomcc commented Mar 4, 2021

Displayed width isn't in units of grapheme clusters. It's also not in units of char — it's font dependent (yes, even for "monospace").

Edit: clarifications below

UAX #11 (e.g. unicode-width) defines about as close as we have to "displayed width" — In particular, while that UAX explicitly says terminal emulators shouldn't use it for this, many many do (since it goes into wcwidth)

It will still return the wrong value in some obvious cases though, but graphemes tend to also be pretty wrong. (Note that "ᄀᄀᄀ각ᆨᆨ" is a single grapheme — explicitly required to be too http://www.unicode.org/reports/tr29/#Hangul_Syllable_Boundary_Determination)

(Only reliable way to know is to print something out, and query the terminal about the current position afterwards. This isn't even perfect, but if it fails the terminal isn't self-consistent with its rendering and theres nothing you can do. See https://github.com/thomcc/term-width-example/ (bash testcases.sh after making your terminal window relatively large will show different approaches). Unfortunately, this approach doesn't really work for bidi, but it's about the best approach I know of. I suspect the "width" of a character (or perhaps the notion of a character itself) doesn't really exist in a way that works well here...)

Anyway, CCing @Manishearth who I'd trust more on this sort of question more than anybody else

@llogiq
Copy link
Contributor

llogiq commented Mar 4, 2021

Please remember we're talking about displaying code in the example I gave. The worst that might happen is that the formatting in an error message may be off, so an 80% correct approach is completely acceptable here.

Being dogmatic and overlooking the cost of the 100% solution is not a tenet of Rust design.

@thomcc
Copy link
Member

thomcc commented Mar 4, 2021

Ah, to be clear: I definitely wasn't arguing that the code as is was bad. I don't think there's a meaningfully better approach, actually. The one I describe (query after writing) is only really applicable for terminal UI programs. Something like rust error output shouldn't be tied to the actual outputting that way.

@Manishearth
Copy link
Member

UAX #11 (e.g. unicode-width) defines about as close as we have to "displayed width" — In particular, while that UAX explicitly says terminal emulators shouldn't use it for this, many many do (since it goes into wcwidth)

I'd also like to point out that this is specifically for east asian text only, this doesn't really handle emoji/etc. Grapheme clusters are often a good approximation for displayed width in monospace fonts, but also not really.

The only reasonable thing to do here is look at your specific use case and see what you need. As you say, querying the font/terminal is often the right option.


As for this lint, I think &[char] can be used for many, many legit reasons. You might be interoperating with UTF-32 code, or otherwise need UTF-32. You might just need an array of characters for a bloom filter or something. I really don't agree with the motivation in the issue body; a lot of these complaints apply to &str as well, and ultimately this stuff is extremely use case specific and it's not possible to write a good general lint for them.

@SoniEx2
Copy link
Author

SoniEx2 commented Mar 4, 2021

Why do these complaints apply to &[&str]? Or do you mean a bare &str? Because as far as we know, you can make an &[&str] where the index is the actual physical column and the value is all the codepoints that make up that column (or empty if carrying over from a previous column).

Sure, we don't know how to bring up all these issues in the error message for this one lint, because this is a lot of stuff. For UTF-32 use an actual abstraction layer that isn't gonna have you trying to use "UTF-32" as a Pattern and getting weird results. For grapheme clusters and columns and figuring out how much left/backspace should move the cursor use a &[&str] or Vec<String> or so on. For using [char] as an actual Pattern make sure you know what you're doing (it is easy to get wrong) and turn off the lint for that section. While each use-case requires a different solution, &[char] is almost never the appropriate solution.

@Manishearth
Copy link
Member

Yes, on a bare &str.

I don't understand your focus on Pattern, your issue is asking for &[char] to be linted against in general. If you want something specifically for Pattern please clearly spell it out, ideally in the separate issue. As it stands the bulk of the ask here is too broad and IMO not correct.

@SoniEx2
Copy link
Author

SoniEx2 commented Mar 4, 2021

&[char] is the problem, which includes Pattern.

and yes the problem IS broad. that's the whole point. this is "internationalization 101" stuff and everyone gets it wrong.

@Manishearth
Copy link
Member

In my many years of writing Rust code I have never seen &[char] used in this way. People just do not use that type. When I've seen it be used it's usually for UTF32 compat or something.

@SoniEx2
Copy link
Author

SoniEx2 commented Mar 4, 2021

&[char], Vec<char>, same difference really.

@Kimundi
Copy link
Member

Kimundi commented Mar 30, 2021

rust-lang/rust@0d97f7a/library/core/src/str/pattern.rs#L759

// Todo: Change / Remove due to ambiguity in meaning.

Unless I'm mistaken (maybe the author of that comment, @Kimundi, is around and can say definitively if this is what it meant), this is saying it's ambiguous because it's not clear if it's searching sequentially vs looking for the first match of the list.

Yes that is what I meant with that comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-lint Area: New lints good-first-issue These issues are a good way to get started with Clippy
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants