-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Add lint on &[char]
#5598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Note, that a Am I missing something? |
a character like |
Ah TIL, thanks for the clarification. |
As a counterpoint, sometimes using |
@llogiq I'd suggest using a proper wrapper around those, and turn off the lint in the wrapper crate. |
Hello, I'd like to try working on this story. So from what I can gather, the requirement is simply to create a new lint that detects |
I'm willing to attempt to implement this and submit a PR if @esamudera isn't still working on it. |
Hi @brightly-salty, yes please take over this issue. Thanks! |
@rustbot claim |
This feels a bit odd because adding a lint against a method like this is effectively deprecating it on behalf of stdlib, without going through the libs team or a wider community discussion process. The discussions on it (here and IRLO) seems far from conclusive on the subject too... |
Fwiw there is a comment in the stdlib which strongly hints that |
@SoniEx2 Could you linking the specific code that hints exactly that ? |
As you may be able to tell, even the examples are bad:
|
Unless I'm mistaken (maybe the author of that comment, @Kimundi, is around and can say definitively if this is what it meant), this is saying it's ambiguous because it's not clear if it's searching sequentially vs looking for the first match of the list. The thing is, I think that issue applies to |
There is no support for |
Oh, sorry, we misunderstood what you meant. However, the examples provided are pretty clear about whoever added those examples thinking it was actually an UTF-32 search, and it somehow getting through a code review, with the relevant PR being rust-lang/rust#71097 |
Another counterexample I recently stumbled upon is |
We'd actually consider that a prime example for why we should have this lint, for at least 2 reasons:
Personally we would recommend instead using a |
How would a SmallString benefit this code? A |
But it should be operating on grapheme clusters/columns, not |
Displayed width isn't in units of grapheme clusters. It's also not in units of Edit: clarifications below UAX #11 (e.g. It will still return the wrong value in some obvious cases though, but graphemes tend to also be pretty wrong. (Note that "ᄀᄀᄀ각ᆨᆨ" is a single grapheme — explicitly required to be too http://www.unicode.org/reports/tr29/#Hangul_Syllable_Boundary_Determination) (Only reliable way to know is to print something out, and query the terminal about the current position afterwards. This isn't even perfect, but if it fails the terminal isn't self-consistent with its rendering and theres nothing you can do. See https://github.com/thomcc/term-width-example/ ( Anyway, CCing @Manishearth who I'd trust more on this sort of question more than anybody else |
Please remember we're talking about displaying code in the example I gave. The worst that might happen is that the formatting in an error message may be off, so an 80% correct approach is completely acceptable here. Being dogmatic and overlooking the cost of the 100% solution is not a tenet of Rust design. |
Ah, to be clear: I definitely wasn't arguing that the code as is was bad. I don't think there's a meaningfully better approach, actually. The one I describe (query after writing) is only really applicable for terminal UI programs. Something like rust error output shouldn't be tied to the actual outputting that way. |
I'd also like to point out that this is specifically for east asian text only, this doesn't really handle emoji/etc. Grapheme clusters are often a good approximation for displayed width in monospace fonts, but also not really. The only reasonable thing to do here is look at your specific use case and see what you need. As you say, querying the font/terminal is often the right option. As for this lint, I think |
Why do these complaints apply to Sure, we don't know how to bring up all these issues in the error message for this one lint, because this is a lot of stuff. For UTF-32 use an actual abstraction layer that isn't gonna have you trying to use "UTF-32" as a Pattern and getting weird results. For grapheme clusters and columns and figuring out how much left/backspace should move the cursor use a |
Yes, on a bare I don't understand your focus on |
and yes the problem IS broad. that's the whole point. this is "internationalization 101" stuff and everyone gets it wrong. |
In my many years of writing Rust code I have never seen |
|
Yes that is what I meant with that comment. |
Since first writing this issue, all the following have come to our attention. Originally only the first point was written here:
&[char]
doesn't provide O(1) indexing on characters, but on codepoints. a single character can be multiple codepoints (see also char.to_uppercase, which can output one or more chars!), in this case&[&str]
(or equivalent) is more appropriate.&[char]
also doesn't provide O(1) indexing on columns, for some definition of "column". a single column can be multiple codepoints.&[char]
actually means in aPattern
. As an example"___abc___".split(&['a', 'b', 'c'][..]).collect::<Vec<_>>()
makes["___", "", "", "___"]
but some may expect it to make["___", "___"]
.&[char]
being UTF-32. It is not. It is well-documented that Rust doesn't support UTF-32 in std.str.strip_prefix
.&[char]
may be slower than a hypothetical naive&[&str]
for a small enough n, as the former requires encoding/decoding UTF-8 whereas the latter can rely on UTF-8 being a self-synchronizing prefix-free code and just match bytes.&[char]
as aPattern
, but you wanted to uppercase the chars for some reason... well, see (1).&[char]: Pattern
is slow. You should under no circumstances use('a'..='z').collect::<Vec<_>>().as_slice()
as aPattern
. Use|c| matches!(c, 'a'..='z')
instead.As such, there are a whole lot of issues with
&[char]
in practice. We think it's a good idea to lint against it.The text was updated successfully, but these errors were encountered: