Correctly map between UTF-8 and UTF-16 positions #227

aochagavia · 2018-11-13T14:19:05Z

Fixes #202

aochagavia · 2018-11-13T14:21:56Z

Current concerns:

ColIndex is always used together with LineIndex. Though I had to write quite some boilerplate to keep them separate, in the end I would rather make ColIndex a field of LineIndex. Any objections to this?
It would be great to have integration tests for this, but I have no idea where they should go. Any suggestions are welcome.

aochagavia · 2018-11-13T14:27:11Z

Note: the LSP supports three line ending styles: \r, \n and \r\n. The current implementation supports the last two, but not \r. This should be fixed at some point.

kjeremy · 2018-11-14T14:52:18Z

ColIndex is always used together with LineIndex. Though I had to write quite some boilerplate to keep them separate, in the end I would rather make ColIndex a field of LineIndex. Any objections to this?

Yeah that's a lot of duplicated boilerplate. I would rather have one "file_line_index" type method instead of both file_line_index and file_utf16_line_index.

matklad · 2018-11-15T11:10:41Z

Yeah, let's make utf16_lines: FxHashMap<u32, Vec<Utf16Char>>, a field of LineIndex!

matklad · 2018-11-15T10:25:24Z

crates/ra_editor/src/col_index.rs

+        let mut utf16_chars = Vec::new();
+        let mut line = 0;
+        let mut curr = 0.into();
+        for c in text.chars() {


I think this loop could work better as

for line_idx, line in text.lines().enumerate() { }

That way, we don't need to worry about lines interfering.

matklad · 2018-11-15T11:07:22Z

crates/ra_editor/src/col_index.rs

+        ColIndex { utf16_lines }
+    }
+
+    pub fn utf8_to_utf16_col(&self, mut line_col: LineCol) -> LineCol {


Hm, this LineCol -> LineCol API seems a bit error prone, because it uses the same type for different units of measure. This can lead to errors: just yesterday my bank showed me the amount of money on my account in rubles, while using euro as a currency sign :D

I think a lower-level API might be safer:

pub fn col_as_utf16(&self, line_col: LineCol) -> usize {...}

Note that, by definition, TextUnit is always a utf_8 length, so using it for utf-16 is not correct.

matklad · 2018-11-15T11:07:54Z

crates/ra_editor/src/col_index.rs

+        assert!(col_index.utf16_to_utf8_col(line_col) == line_col);
+
+        // UTF-16 to UTF-8
+        assert!(


There's assert_eq! macro

aochagavia · 2018-11-15T16:37:30Z

bors r+

227: Correctly map between UTF-8 and UTF-16 positions r=aochagavia a=aochagavia Fixes #202 Co-authored-by: Adolfo Ochagavía <[email protected]>

matklad · 2018-11-15T16:38:34Z

bors r-

bors · 2018-11-15T16:38:35Z

Canceled

matklad · 2018-11-15T16:39:29Z

I think it's important to mark somehow that col is utf16, it's not obvouls from looking at the type definition. I think a doc comment would be fine, but probably just naming field col_utf16 works best?

matklad · 2018-11-15T16:39:41Z

Otherwise, LGTM! 👍

aochagavia · 2018-11-15T21:52:00Z

@matklad In the values of the utf16_lines hashmap we could use SmallVec instead of Vec to store instances of Utf16Char. We don't expect too much UTF16 chars per line, so we might as well store them inline. Do you think it is worth it or should we keep the Vec?

matklad · 2018-11-16T08:49:11Z

@aochagavia I was thinking about SmallVec, but I think that's a premature optimization: the number of lines with UTF-16 is small, so it should not matter if storage for a particular line is big.

aochagavia · 2018-11-16T11:12:28Z

Just pushed a commit to update col to col_utf16

aochagavia · 2018-11-16T11:12:41Z

bors r+

227: Correctly map between UTF-8 and UTF-16 positions r=aochagavia a=aochagavia Fixes #202 Co-authored-by: Adolfo Ochagavía <[email protected]> Co-authored-by: Adolfo Ochagavía <[email protected]>

bors · 2018-11-16T11:16:54Z

Canceled

aochagavia · 2018-11-16T11:22:26Z

bors r+

227: Correctly map between UTF-8 and UTF-16 positions r=aochagavia a=aochagavia Fixes #202 Co-authored-by: Adolfo Ochagavía <[email protected]> Co-authored-by: Adolfo Ochagavía <[email protected]>

bors · 2018-11-16T11:27:40Z

Build succeeded

continuous-integration/travis-ci/push

matklad reviewed Nov 15, 2018

View reviewed changes

bors bot added a commit that referenced this pull request Nov 15, 2018

Merge #227

a985096

227: Correctly map between UTF-8 and UTF-16 positions r=aochagavia a=aochagavia Fixes #202 Co-authored-by: Adolfo Ochagavía <[email protected]>

bors bot added a commit that referenced this pull request Nov 16, 2018

Merge #227

5ea4ef4

227: Correctly map between UTF-8 and UTF-16 positions r=aochagavia a=aochagavia Fixes #202 Co-authored-by: Adolfo Ochagavía <[email protected]> Co-authored-by: Adolfo Ochagavía <[email protected]>

aochagavia and others added 3 commits November 16, 2018 12:15

Support UTF-16 chars in LineIndex

136d186

Rename col to col_utf16

bccbee5

cargo format

acd51cb

bors bot added a commit that referenced this pull request Nov 16, 2018

Merge #227

97532c8

227: Correctly map between UTF-8 and UTF-16 positions r=aochagavia a=aochagavia Fixes #202 Co-authored-by: Adolfo Ochagavía <[email protected]> Co-authored-by: Adolfo Ochagavía <[email protected]>

bors bot merged commit acd51cb into rust-lang:master Nov 16, 2018

aochagavia deleted the utf16-mapping branch December 7, 2022 14:47

DavisVaughan mentioned this pull request Jan 8, 2024

Draft hacky implementation of diagnostic refresh on code execution posit-dev/ark#83

Closed

4 tasks

Correctly map between UTF-8 and UTF-16 positions #227

Correctly map between UTF-8 and UTF-16 positions #227

Uh oh!

Conversation

aochagavia commented Nov 13, 2018

Uh oh!

aochagavia commented Nov 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aochagavia commented Nov 13, 2018

Uh oh!

kjeremy commented Nov 14, 2018

Uh oh!

matklad commented Nov 15, 2018

Uh oh!

matklad Nov 15, 2018

Choose a reason for hiding this comment

Uh oh!

matklad Nov 15, 2018

Choose a reason for hiding this comment

Uh oh!

matklad Nov 15, 2018

Choose a reason for hiding this comment

Uh oh!

aochagavia commented Nov 15, 2018

Uh oh!

matklad commented Nov 15, 2018

Uh oh!

bors bot commented Nov 15, 2018

Canceled

Uh oh!

matklad commented Nov 15, 2018

Uh oh!

matklad commented Nov 15, 2018

Uh oh!

aochagavia commented Nov 15, 2018

Uh oh!

matklad commented Nov 16, 2018

Uh oh!

aochagavia commented Nov 16, 2018

Uh oh!

aochagavia commented Nov 16, 2018

Uh oh!

bors bot commented Nov 16, 2018

Canceled

Uh oh!

aochagavia commented Nov 16, 2018

Uh oh!

bors bot commented Nov 16, 2018

Build succeeded

Uh oh!

Uh oh!

aochagavia commented Nov 13, 2018 •

edited

Loading