Skip to content

Remove duplicate impl of string unescape from parse_format #137995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

hkBst
Copy link
Member

@hkBst hkBst commented Mar 4, 2025

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Mar 4, 2025
@rustbot
Copy link
Collaborator

rustbot commented Mar 4, 2025

rust-analyzer is developed in its own repository. If possible, consider making this change to rust-lang/rust-analyzer instead.

cc @rust-lang/rust-analyzer

@nnethercote
Copy link
Contributor

This is a large (+447/−547) change with zero explanation. I need some context and motivation, please! Also, from skimming it I think there might be multiple distinct changes in the single commit? If so, it would be easier to review if they were separate.

@hkBst
Copy link
Member Author

hkBst commented Mar 5, 2025

This is a large (+447/−547) change with zero explanation. I need some context and motivation, please!

Ah, sorry about that. Let me try and fix that:

The idea for this comes from this code at the bottom of rustc_parse_format/lib.rs:

fn unescape_string(string: &str) -> Option<String> {
    let mut buf = String::new();
    let mut ok = true;
    unescape::unescape_unicode(string, unescape::Mode::Str, &mut |_, unescaped_char| {
        match unescaped_char {
            Ok(c) => buf.push(c),
            Err(_) => ok = false,
        }
    });

    ok.then_some(buf)
}

which does unescaping but throws away all span information from the original string (via the _ in &mut |_, unescaped_char|. This function is called in fn find_width_map_from_snippet:

    let Some(unescaped) = unescape_string(snippet) else {
        return InputStringKind::NotALiteral;
    };

which then does its own light version of string unescape to build a width map (if the unescaped string matches the input string), which is basically a list of expansions from the unescaped string back to the original string, which has to be traversed to determine the old position (this happens in fn remap_pos, fn to_span_index, fn to_span_width, and fn span). (Doing a Vec traversal for each original position that is needed is quadratic behavior as it is linear in both the length of the width map Vec and the number of such translations from new to old position.)

The new code does the unescaping in Parser::new, while collecting the position information into a Vec, and checking that the unescaped string matches the input string like so:

                // snippet is not a raw string
                if snippet.starts_with('"') {
                    // snippet looks like an ordinary string literal
                    // check whether it is the escaped version of input
                    let without_quotes = &snippet[1..snippet.len() - 1];
                    let (mut ok, mut vec) = (true, vec![]);
                    let mut chars = input.chars();
                    unescape::unescape_unicode(
                        without_quotes,
                        unescape::Mode::Str,
                        &mut |range, res| match res {
                            Ok(ch) if ok && chars.next().is_some_and(|c| ch == c) => {
                                vec.push((range, ch));
                            }
                            _ => {
                                ok = false;
                                vec = vec![];
                            }
                        },
                    );

Here we're feeling some pain from the callback-based nature of unescaping, which forces the collection of span info into a Vec (at least I could not see a good alternative). Basically, the Parser needs to know the position of each character in input (Peekable<Char<>> in the current code) and in the original string as typed (snippet) (width_map plus translation functions in the current code). This new code ultimately collects a Vec<(original span, char byte pos in input, char in input)> for the same purpose. It is thus probably using more memory.

Most of the other changes are because of this change of span info representation.

Also, from skimming it I think there might be multiple distinct changes in the single commit?

It is possible, but this code went through a lot of iterations, as I came to understand the exact constraints imposed by the ui tests, and this is basically the first version that passes all those ui tests.

There are a few minor things that come to mind:

  • I ended up inlining fn self.suggest_format_parameter into its single use, since most of the code was just duplicate work. I'm not sure if that is also possible with the old way of handling the span info.
  • I also ended up inlining err and err_with_note, one of which had a single use, and the other two or three uses, since they did not seem to be carrying their weight.

I'm hoping this is enough to get the broad idea, such that you can start asking about specific bits of this change, but let me know if there is more you need or that I can do to clarify.

@hkBst
Copy link
Member Author

hkBst commented Mar 5, 2025

Given the changes in work done and the probable increase in memory use:
@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@bors
Copy link
Collaborator

bors commented Mar 5, 2025

@hkBst: 🔑 Insufficient privileges: not in try users

@rust-timer

This comment has been minimized.

@nnethercote
Copy link
Contributor

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 5, 2025
@bors
Copy link
Collaborator

bors commented Mar 5, 2025

⌛ Trying commit 94fb87a with merge fe03ab0...

bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 5, 2025
…=<try>

Remove duplicate impl of string unescape from parse_format

r? `@nnethercote`
@bors
Copy link
Collaborator

bors commented Mar 5, 2025

☀️ Try build successful - checks-actions
Build commit: fe03ab0 (fe03ab008f4474bd7092d967e9eb28cfe09d0664)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (fe03ab0): comparison URL.

Overall result: no relevant changes - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

This benchmark run did not return any relevant results for this metric.

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 777.245s -> 777.886s (0.08%)
Artifact size: 362.11 MiB -> 362.15 MiB (0.01%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 5, 2025
Copy link
Contributor

@nnethercote nnethercote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slow review. This looks good. A few nits to address. I didn't follow every single detail, but it looks like a clear simplification. The removal of InnerSpan, InputStringKind and InnerWidthMapping in particular are good.

You inlined a few functions, it would have been good to do them in separate commits. Also I wonder if the InnerSpan-to-Range change could have been done in its own commit, before the other changes. (Just thinking out loud; it's always a good idea to split up changes into multiple commits where possible, to make life easier for the reviewer.)

};
let Some(argument_binding) = ty.kind.is_simple_path() else {
continue;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is unnecessary, AFAICT.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, change removed.

@@ -90,24 +44,24 @@ pub enum Piece<'a> {
}

/// Representation of an argument specification.
#[derive(Copy, Clone, Debug, PartialEq)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use #![feature(new_range_api)] you can use the new experimental core::range::Range type, which implements Copy. That would avoid some clone calls you've had to add.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using this crate in rust-analyzer, so we'd appreciate if it kept building on stable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given, @lnicola's objection I'll leave this for now.

havewidth = true;
} else {
spec.zero_pad = true;
}
}

if !havewidth {
let start = self.current_pos();
spec.width = self.count(start);
let start_ix = self.index;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start_idx or start_index would be more idiomatic for this code base.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

@@ -234,91 +188,90 @@ pub enum Suggestion {
pub struct Parser<'a> {
mode: ParseMode,
input: &'a str,
cur: std::iter::Peekable<std::str::CharIndices<'a>>,
input_vec: Vec<(Range<usize>, usize, char)>,
index: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could have a better name, one that indicates what it indexes into. Is it input_vec? If so, input_vec_index would be appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments on these new fields (and maybe input) would also be helpful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name changed and comments added.

@nnethercote
Copy link
Contributor

@rustbot author

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 20, 2025
@hkBst
Copy link
Member Author

hkBst commented Mar 21, 2025

Sorry for the slow review. This looks good. A few nits to address. I didn't follow every single detail, but it looks like a clear simplification. The removal of InnerSpan, InputStringKind and InnerWidthMapping in particular are good.

No problem. Glad we're in agreement here.

You inlined a few functions, it would have been good to do them in separate commits. Also I wonder if the InnerSpan-to-Range change could have been done in its own commit, before the other changes. (Just thinking out loud; it's always a good idea to split up changes into multiple commits where possible, to make life easier for the reviewer.)

Thanks for the advice. It is good to get your perspective. I'll try to be more mindful of the reviewer's job. Thanks for reviewing!

@rust-cloud-vms rust-cloud-vms bot force-pushed the parse_format_reuse_unescape branch from 94fb87a to 4711153 Compare March 21, 2025 14:57
@hkBst
Copy link
Member Author

hkBst commented Mar 21, 2025

@rustbot ready

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 21, 2025
@nnethercote
Copy link
Contributor

@bors r+

@bors
Copy link
Collaborator

bors commented Mar 24, 2025

📌 Commit 4711153 has been approved by nnethercote

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 24, 2025
bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 24, 2025
…=nnethercote

Remove duplicate impl of string unescape from parse_format

r? `@nnethercote`
@bors
Copy link
Collaborator

bors commented Mar 24, 2025

⌛ Testing commit 4711153 with merge 4652ee6...

@jieyouxu
Copy link
Member

Sorry, perf and CI LLVM is a bit broken atm. Please re-approve once bootstrap & perf is fixed.
@bors retry r-

@bors bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Mar 24, 2025
@hkBst
Copy link
Member Author

hkBst commented Apr 4, 2025

@rustbot ready

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Apr 4, 2025
@bors
Copy link
Collaborator

bors commented Apr 6, 2025

☔ The latest upstream changes (presumably #139452) made this pull request unmergeable. Please resolve the merge conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants