Skip to content

Correctly map between UTF-8 and UTF-16 positions #227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 16, 2018
Merged

Correctly map between UTF-8 and UTF-16 positions #227

merged 3 commits into from
Nov 16, 2018

Conversation

aochagavia
Copy link
Contributor

Fixes #202

@aochagavia
Copy link
Contributor Author

aochagavia commented Nov 13, 2018

Current concerns:

  • ColIndex is always used together with LineIndex. Though I had to write quite some boilerplate to keep them separate, in the end I would rather make ColIndex a field of LineIndex. Any objections to this?
  • It would be great to have integration tests for this, but I have no idea where they should go. Any suggestions are welcome.

@aochagavia
Copy link
Contributor Author

Note: the LSP supports three line ending styles: \r, \n and \r\n. The current implementation supports the last two, but not \r. This should be fixed at some point.

@kjeremy
Copy link
Contributor

kjeremy commented Nov 14, 2018

  • ColIndex is always used together with LineIndex. Though I had to write quite some boilerplate to keep them separate, in the end I would rather make ColIndex a field of LineIndex. Any objections to this?

Yeah that's a lot of duplicated boilerplate. I would rather have one "file_line_index" type method instead of both file_line_index and file_utf16_line_index.

@matklad
Copy link
Member

matklad commented Nov 15, 2018

Yeah, let's make utf16_lines: FxHashMap<u32, Vec<Utf16Char>>, a field of LineIndex!

let mut utf16_chars = Vec::new();
let mut line = 0;
let mut curr = 0.into();
for c in text.chars() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this loop could work better as

for line_idx, line in text.lines().enumerate() {
}

That way, we don't need to worry about lines interfering.

ColIndex { utf16_lines }
}

pub fn utf8_to_utf16_col(&self, mut line_col: LineCol) -> LineCol {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this LineCol -> LineCol API seems a bit error prone, because it uses the same type for different units of measure. This can lead to errors: just yesterday my bank showed me the amount of money on my account in rubles, while using euro as a currency sign :D

I think a lower-level API might be safer:

pub fn col_as_utf16(&self, line_col: LineCol) -> usize {...}

Note that, by definition, TextUnit is always a utf_8 length, so using it for utf-16 is not correct.

assert!(col_index.utf16_to_utf8_col(line_col) == line_col);

// UTF-16 to UTF-8
assert!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's assert_eq! macro

@aochagavia
Copy link
Contributor Author

bors r+

bors bot added a commit that referenced this pull request Nov 15, 2018
227: Correctly map between UTF-8 and UTF-16 positions r=aochagavia a=aochagavia

Fixes #202 

Co-authored-by: Adolfo Ochagavía <[email protected]>
@matklad
Copy link
Member

matklad commented Nov 15, 2018

bors r-

@bors
Copy link
Contributor

bors bot commented Nov 15, 2018

Canceled

@matklad
Copy link
Member

matklad commented Nov 15, 2018

I think it's important to mark somehow that col is utf16, it's not obvouls from looking at the type definition. I think a doc comment would be fine, but probably just naming field col_utf16 works best?

@matklad
Copy link
Member

matklad commented Nov 15, 2018

Otherwise, LGTM! 👍

@aochagavia
Copy link
Contributor Author

@matklad In the values of the utf16_lines hashmap we could use SmallVec instead of Vec to store instances of Utf16Char. We don't expect too much UTF16 chars per line, so we might as well store them inline. Do you think it is worth it or should we keep the Vec?

@matklad
Copy link
Member

matklad commented Nov 16, 2018

@aochagavia I was thinking about SmallVec, but I think that's a premature optimization: the number of lines with UTF-16 is small, so it should not matter if storage for a particular line is big.

@aochagavia
Copy link
Contributor Author

Just pushed a commit to update col to col_utf16

@aochagavia
Copy link
Contributor Author

bors r+

bors bot added a commit that referenced this pull request Nov 16, 2018
227: Correctly map between UTF-8 and UTF-16 positions r=aochagavia a=aochagavia

Fixes #202 

Co-authored-by: Adolfo Ochagavía <[email protected]>
Co-authored-by: Adolfo Ochagavía <[email protected]>
@bors
Copy link
Contributor

bors bot commented Nov 16, 2018

Canceled

@aochagavia
Copy link
Contributor Author

bors r+

bors bot added a commit that referenced this pull request Nov 16, 2018
227: Correctly map between UTF-8 and UTF-16 positions r=aochagavia a=aochagavia

Fixes #202 

Co-authored-by: Adolfo Ochagavía <[email protected]>
Co-authored-by: Adolfo Ochagavía <[email protected]>
@bors
Copy link
Contributor

bors bot commented Nov 16, 2018

Build succeeded

@bors bors bot merged commit acd51cb into rust-lang:master Nov 16, 2018
@aochagavia aochagavia deleted the utf16-mapping branch December 7, 2022 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants