Not able to convert between byte index and UTF indices #12216

yegappan · 2023-04-01T15:42:18Z

The language server protocol supports specifying offsets in text documents using UTF-8 or UTF-16 or UTF-32 code units.
The UTF-16 code unit is the default.

https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments

Different language servers have different levels of support for using the different code units. Vim uses the UTF-32
code units for the offsets. This makes it difficult to support different language servers from a Vim LSP plugin.

The following changes are introduced in this PR:

Add the utf16idx() function to return the UTF16 offset in a string given either byte or character offset.
Add the UTF-16 flag to the byteidx(), byteidxcomp() and charidx() functions to accept a UTF-16 offset and return the corresponding byte or character offset.
Add the strutf16len() function to return the length of a string in UTF-16 code points.

codecov · 2023-04-01T15:53:53Z

Codecov Report

Merging #12216 (5ffc5ba) into master (f39d9e9) will decrease coverage by 0.09%.
The diff coverage is 82.10%.

❗ Current head 5ffc5ba differs from pull request most recent head 67ea267. Consider uploading reports for the commit 67ea267 to get more accurate results

@@            Coverage Diff             @@
##           master   #12216      +/-   ##
==========================================
- Coverage   82.04%   81.96%   -0.09%     
==========================================
  Files         160      164       +4     
  Lines      193181   194254    +1073     
  Branches    43367    43869     +502     
==========================================
+ Hits       158505   159229     +724     
- Misses      21807    22184     +377     
+ Partials    12869    12841      -28

Flag	Coverage Δ
huge-clang-none	`82.68% <80.00%> (-0.04%)`	⬇️
huge-gcc-none	`53.88% <80.00%> (?)`
huge-gcc-testgui	`51.97% <80.00%> (?)`
huge-gcc-unittests	`0.29% <0.00%> (?)`
linux	`82.40% <80.00%> (-0.32%)`	⬇️
mingw-x64-HUGE	`76.56% <80.00%> (+0.01%)`	⬆️
mingw-x86-HUGE	`77.02% <80.00%> (+0.01%)`	⬆️
windows	`78.15% <80.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/evalfunc.c	`90.38% <ø> (+0.09%)`	⬆️
src/strings.c	`92.26% <82.10%> (-0.55%)`	⬇️

... and 121 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

brammool · 2023-04-12T17:35:37Z

Yegappan wrote:

The language server protocol supports specifying offsets in text documents using UTF-8 or UTF-16 or UTF-32 code units. The UTF-16 code unit is the default. https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments Different language servers have different levels of support for using the different code units. Vim uses the UTF-32 code units for the offsets. This makes it difficult to support different language servers from a Vim LSP plugin. Port the strutfindex() and strbyteindex() functions from Neovim to support this.

I find the function names hard to read and confusing. We might be able to think of better names when the exact functionality is described. The terminology is confusing. "UTF-32 byte index" contradicts itself, since each character is four bytes. I think what is meant is "UTF-32 encoded character index", which is equal to "character index", since there is no Unicode character that takes more than one UTF-32 code point. In Vim all Unicode characters are internally encoded with UTF-8. Thus the "{string}" argument of strbyteindex() will be UTF-8 encoded. This is also confusing. The help should be clearer about what this means exactly. I'm not sure how, saying something like "the character index of "{string}" if it would be encoded with UTF-32" makes it complex. I think that instead of using "UTF-32 index" we can just use "character index", and somewhere mention that "UTF-32" can be considered the same (if we need to mention this at all, since the term "UTF-32" isn't widely used). For "UTF-16" it gets more complicated, we can't avoid mentioning that the index applies to "{string}" encoded as UTF-16. Looking back UTF-16 should have never been made a standard IMHO, but it exists and it is used (especially on MS-Windows), thus we need to support it. Conversion between UTF-8 and character index already exists, you can use charidx() and byteidx()/byteidxcomp(). Possibly we only need to add functions to convert between UTF-8 and UTF-16 indexes? Or between character (UTF-32) and UTF-16 indexes? The latter makes more sense. It should also be possible to specify the handling of composing characters. Either as an argument, like with charidx(), or using separate functions, as with byteidx()/byteidxcomp().

…

-- My girlfriend told me I should be more affectionate. So I got TWO girlfriends. /// Bram Moolenaar -- ***@***.*** -- http://www.Moolenaar.net \\\ /// \\\ \\\ sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ /// \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

DominiquePelle-TomTom · 2023-04-12T18:06:50Z

This feature looks related to one of my earlier post at https://groups.google.com/g/vim_dev/c/AVpp8DT2_Vc/m/L_p6gzATBQAJ

I will probably find it useful to have this feature for my vim-LanguageTool plugin.

DominiquePelle-TomTom · 2023-04-12T18:07:31Z

runtime/doc/builtin.txt

@@ -604,6 +606,7 @@ strptime({format}, {timestring})
 strridx({haystack}, {needle} [, {start}])
 				Number	last index of {needle} in {haystack}
 strtrans({expr})		String	translate string to make it printable
+strutfindex({expr} [, {index}])	List	byte index to utf-32 and ut-16 indices


ut-16? I assume you meant utf-16.

DominiquePelle-TomTom · 2023-04-12T18:11:33Z

runtime/doc/builtin.txt

@@ -8975,8 +8978,22 @@ str2nr({string} [, {base} [, {quoted}]])			*str2nr()*

 		Can also be used as a |method|: >
 			GetText()->str2nr()
+<
+strbyteindex({string} [, {index} [, {use_utf16}])	*strbyteindex()*
+		Convert a UTF-32 or UTF-16 {index} to a byte index. If


Sometimes the doc in the PR uses "UTF-16" and sometimes "utf-16".
Let's be consistent (the capitalized one is better IMO).

brammool · 2023-04-12T19:11:07Z

This feature looks related to one of my ealier post at https://groups.google.com/g/vim_dev/c/AVpp8DT2_Vc/m/L_p6gzATBQAJ

The essential part of that post is to count characters from the start of the file. This PR is about an index relative to the start of a string. And also about conversion to/from UTF-16 index. Looking from the implementation side there is not much in common.

…

-- In Africa some of the native tribes have a custom of beating the ground with clubs and uttering spine chilling cries. Anthropologists call this a form of primitive self-expression. In America we call it golf. /// Bram Moolenaar -- ***@***.*** -- http://www.Moolenaar.net \\\ /// \\\ \\\ sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ /// \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

Shane-XB-Qian · 2023-04-12T23:23:39Z

This feature looks related to one of my earlier post at https://groups.google.com/g/vim_dev/c/AVpp8DT2_Vc/m/L_p6gzATBQAJ

this is for LSP impl, the default encoding of lsp server is utf-16, hence some e.g non-utf32 chars symbol maybe located incorrectly at client if no such funcs (e.g from this pr) from vim itself.

vim-ml · 2023-04-13T04:52:06Z

Hi Bram,

On Wed, Apr 12, 2023 at 10:36 AM Bram Moolenaar ***@***.***> wrote: Yegappan wrote: > The language server protocol supports specifying offsets in text > documents using UTF-8 or UTF-16 or UTF-32 code units. > The UTF-16 code unit is the default. > > https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments > > Different language servers have different levels of support for using > the different code units. Vim uses the UTF-32 code units for the > offsets. This makes it difficult to support different language > servers from a Vim LSP plugin. > > Port the strutfindex() and strbyteindex() functions from Neovim to > support this. I find the function names hard to read and confusing. We might be able to think of better names when the exact functionality is described. The terminology is confusing. "UTF-32 byte index" contradicts itself, since each character is four bytes. I think what is meant is "UTF-32 encoded character index", which is equal to "character index", since there is no Unicode character that takes more than one UTF-32 code point. In Vim all Unicode characters are internally encoded with UTF-8. Thus the "{string}" argument of strbyteindex() will be UTF-8 encoded. This is also confusing. The help should be clearer about what this means exactly. I'm not sure how, saying something like "the character index of "{string}" if it would be encoded with UTF-32" makes it complex. I think that instead of using "UTF-32 index" we can just use "character index", and somewhere mention that "UTF-32" can be considered the same (if we need to mention this at all, since the term "UTF-32" isn't widely used). For "UTF-16" it gets more complicated, we can't avoid mentioning that the index applies to "{string}" encoded as UTF-16. Looking back UTF-16 should have never been made a standard IMHO, but it exists and it is used (especially on MS-Windows), thus we need to support it. Conversion between UTF-8 and character index already exists, you can use charidx() and byteidx()/byteidxcomp(). Possibly we only need to add functions to convert between UTF-8 and UTF-16 indexes? Or between character (UTF-32) and UTF-16 indexes? The latter makes more sense.

What about introducing a function that converts a character index in a string to a UTF-16 index? utf16idx({string}, {idx} [, {countcc}]) This is similar to the existing charidx() function. The "idx" here specifies the character index in {string} and this function returns the corresponding UTF-16 index. To convert from a UTF-16 index to a character index, we can either introduce a new function or modify the existing charidx() function to accept an additional boolean argument. If this argument is specified, then {idx} is a UTF-16 index instead of a byte index. If we are going with a new function for this, what do you think about naming the function as utf16tocharidx()? - Yegappan

…

It should also be possible to specify the handling of composing characters. Either as an argument, like with charidx(), or using separate functions, as with byteidx()/byteidxcomp().

vim-ml · 2023-04-14T04:56:23Z

Hi Bram,

On Wed, Apr 12, 2023 at 10:36 AM Bram Moolenaar ***@***.***> wrote: Yegappan wrote: > The language server protocol supports specifying offsets in text > documents using UTF-8 or UTF-16 or UTF-32 code units. > The UTF-16 code unit is the default. > > https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments > > Different language servers have different levels of support for using > the different code units. Vim uses the UTF-32 code units for the > offsets. This makes it difficult to support different language > servers from a Vim LSP plugin. > > Port the strutfindex() and strbyteindex() functions from Neovim to > support this. I find the function names hard to read and confusing. We might be able to think of better names when the exact functionality is described. The terminology is confusing. "UTF-32 byte index" contradicts itself, since each character is four bytes. I think what is meant is "UTF-32 encoded character index", which is equal to "character index", since there is no Unicode character that takes more than one UTF-32 code point. In Vim all Unicode characters are internally encoded with UTF-8. Thus the "{string}" argument of strbyteindex() will be UTF-8 encoded. This is also confusing. The help should be clearer about what this means exactly. I'm not sure how, saying something like "the character index of "{string}" if it would be encoded with UTF-32" makes it complex. I think that instead of using "UTF-32 index" we can just use "character index", and somewhere mention that "UTF-32" can be considered the same (if we need to mention this at all, since the term "UTF-32" isn't widely used). For "UTF-16" it gets more complicated, we can't avoid mentioning that the index applies to "{string}" encoded as UTF-16. Looking back UTF-16 should have never been made a standard IMHO, but it exists and it is used (especially on MS-Windows), thus we need to support it. Conversion between UTF-8 and character index already exists, you can use charidx() and byteidx()/byteidxcomp(). Possibly we only need to add functions to convert between UTF-8 and UTF-16 indexes? Or between character (UTF-32) and UTF-16 indexes? The latter makes more sense. It should also be possible to specify the handling of composing characters. Either as an argument, like with charidx(), or using separate functions, as with byteidx()/byteidxcomp().

I have updated the PR to add the utf16idx() function and introduced an optional UTF-16 flag to the byteidx() and byteidxcomp() functions. - Yegappan

vim-ml · 2023-04-14T19:55:40Z

Yegappan wrote:

> > The language server protocol supports specifying offsets in text > > documents using UTF-8 or UTF-16 or UTF-32 code units. > > The UTF-16 code unit is the default. > > > > > https://microsoft.github.io/language-server-protocol/specifications/lsp/3= .17/specification/#textDocuments > > > > Different language servers have different levels of support for using > > the different code units. Vim uses the UTF-32 code units for the > > offsets. This makes it difficult to support different language > > servers from a Vim LSP plugin. > > > > Port the strutfindex() and strbyteindex() functions from Neovim to > > support this. > > I find the function names hard to read and confusing. We might be able > to think of better names when the exact functionality is described. > > The terminology is confusing. "UTF-32 byte index" contradicts itself, > since each character is four bytes. I think what is meant is "UTF-32 > encoded character index", which is equal to "character index", since > there is no Unicode character that takes more than one UTF-32 code > point. > > In Vim all Unicode characters are internally encoded with UTF-8. Thus > the "{string}" argument of strbyteindex() will be UTF-8 encoded. This > is also confusing. The help should be clearer about what this means > exactly. I'm not sure how, saying something like "the character index > of "{string}" if it would be encoded with UTF-32" makes it complex. I > think that instead of using "UTF-32 index" we can just use "character > index", and somewhere mention that "UTF-32" can be considered the same > (if we need to mention this at all, since the term "UTF-32" isn't widely > used). > > For "UTF-16" it gets more complicated, we can't avoid mentioning that > the index applies to "{string}" encoded as UTF-16. Looking back UTF-16 > should have never been made a standard IMHO, but it exists and it is > used (especially on MS-Windows), thus we need to support it. > > Conversion between UTF-8 and character index already exists, you can use > charidx() and byteidx()/byteidxcomp(). Possibly we only need to add > functions to convert between UTF-8 and UTF-16 indexes? Or between > character (UTF-32) and UTF-16 indexes? The latter makes more sense. What about introducing a function that converts a character index in a string to a UTF-16 index? utf16idx({string}, {idx} [, {countcc}]) This is similar to the existing charidx() function. The "idx" here specifies the character index in {string} and this function returns the corresponding UTF-16 index.

charidx() converts a byte index of an UTF-8 encoded string to a character index. This can't simply be changed to UTF-16, since we don't support UTF-16 encoded strings. We could (pretend to) convert the string to UTF-16 and then apply {idx}. But that is doing the opposite of what you suggested.

To convert from a UTF-16 index to a character index, we can either introduce a new function or modify the existing charidx() function to accept an additional boolean argument. If this argument is specified, then {idx} is a UTF-16 index instead of a byte index. If we are going with a new function for this, what do you think about naming the function as utf16tocharidx()?

The function still returns a character index, thus using "charidx" with something appended works better. At least then they sort next to each other. For the other direction an equivalent to byteidx(). That could be utf16idx() perhaps.

…

-- ARTHUR: What does it say? BROTHER MAYNARD: It reads ... "Here may be found the last words of Joseph of Aramathea." "He who is valorous and pure of heart may find the Holy Grail in the aaaaarrrrrrggghhh..." ARTHUR: What? BROTHER MAYNARD: "The Aaaaarrrrrrggghhh..." "Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD /// Bram Moolenaar -- ***@***.*** -- http://www.Moolenaar.net \\\ /// \\\ \\\ sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ /// \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

… the byteidx(), byteidxcomp() and charidx() functions

Problem: no functions for converting from/to UTF-16 index. Solution: Add UTF-16 flag to existing funtions and add strutf16len() and utf16idx(). (Yegappan Lakshmanan, closes vim/vim#12216) vim/vim@67672ef Co-authored-by: Christian Brabandt <[email protected]>

Problem: no functions for converting from/to UTF-16 index. Solution: Add UTF-16 flag to existing funtions and add strutf16len() and utf16idx(). (Yegappan Lakshmanan, closes vim/vim#12216) vim/vim@67672ef Co-authored-by: Yegappan Lakshmanan <[email protected]>

…23318) Problem: no functions for converting from/to UTF-16 index. Solution: Add UTF-16 flag to existing funtions and add strutf16len() and utf16idx(). (Yegappan Lakshmanan, closes vim/vim#12216) vim/vim@67672ef Co-authored-by: Yegappan Lakshmanan <[email protected]>

vim-ml · 2023-05-02T23:39:13Z

[resend, picky postmaster refused the message] Yegappan wrote:

> > The language server protocol supports specifying offsets in text > > documents using UTF-8 or UTF-16 or UTF-32 code units. > > The UTF-16 code unit is the default. > > > > > https://microsoft.github.io/language-server-protocol/specifications/lsp/3= .17/specification/#textDocuments > > > > Different language servers have different levels of support for using > > the different code units. Vim uses the UTF-32 code units for the > > offsets. This makes it difficult to support different language > > servers from a Vim LSP plugin. > > > > Port the strutfindex() and strbyteindex() functions from Neovim to > > support this. > > I find the function names hard to read and confusing. We might be able > to think of better names when the exact functionality is described. > > The terminology is confusing. "UTF-32 byte index" contradicts itself, > since each character is four bytes. I think what is meant is "UTF-32 > encoded character index", which is equal to "character index", since > there is no Unicode character that takes more than one UTF-32 code > point. > > In Vim all Unicode characters are internally encoded with UTF-8. Thus > the "{string}" argument of strbyteindex() will be UTF-8 encoded. This > is also confusing. The help should be clearer about what this means > exactly. I'm not sure how, saying something like "the character index > of "{string}" if it would be encoded with UTF-32" makes it complex. I > think that instead of using "UTF-32 index" we can just use "character > index", and somewhere mention that "UTF-32" can be considered the same > (if we need to mention this at all, since the term "UTF-32" isn't widely > used). > > For "UTF-16" it gets more complicated, we can't avoid mentioning that > the index applies to "{string}" encoded as UTF-16. Looking back UTF-16 > should have never been made a standard IMHO, but it exists and it is > used (especially on MS-Windows), thus we need to support it. > > Conversion between UTF-8 and character index already exists, you can use > charidx() and byteidx()/byteidxcomp(). Possibly we only need to add > functions to convert between UTF-8 and UTF-16 indexes? Or between > character (UTF-32) and UTF-16 indexes? The latter makes more sense. What about introducing a function that converts a character index in a string to a UTF-16 index? utf16idx({string}, {idx} [, {countcc}]) This is similar to the existing charidx() function. The "idx" here specifies the character index in {string} and this function returns the corresponding UTF-16 index.

charidx() converts a byte index of an UTF-8 encoded string to a character index. This can't simply be changed to UTF-16, since we don't support UTF-16 encoded strings. We could (pretend to) convert the string to UTF-16 and then apply {idx}. But that is doing the opposite of what you suggested.

To convert from a UTF-16 index to a character index, we can either introduce a new function or modify the existing charidx() function to accept an additional boolean argument. If this argument is specified, then {idx} is a UTF-16 index instead of a byte index. If we are going with a new function for this, what do you think about naming the function as utf16tocharidx()?

The function still returns a character index, thus using "charidx" with something appended works better. At least then they sort next to each other. For the other direction an equivalent to byteidx(). That could be utf16idx() perhaps.

…

-- ARTHUR: What does it say? BROTHER MAYNARD: It reads ... "Here may be found the last words of Joseph of Aramathea." "He who is valorous and pure of heart may find the Holy Grail in the aaaaarrrrrrggghhh..." ARTHUR: What? BROTHER MAYNARD: "The Aaaaarrrrrrggghhh..." "Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD /// Bram Moolenaar -- ***@***.*** -- http://www.Moolenaar.net \\\ /// \\\ \\\ sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ /// \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

…eovim#23318) Problem: no functions for converting from/to UTF-16 index. Solution: Add UTF-16 flag to existing funtions and add strutf16len() and utf16idx(). (Yegappan Lakshmanan, closes vim/vim#12216) vim/vim@67672ef Co-authored-by: Yegappan Lakshmanan <[email protected]>

yegappan force-pushed the vimlsp branch from efb120a to 9604b6f Compare April 2, 2023 02:14

yegappan mentioned this pull request Apr 6, 2023

Inform language servers that we really only supports utf-32 yegappan/lsp#197

Merged

yegappan force-pushed the vimlsp branch from 9604b6f to a91fe4b Compare April 6, 2023 02:29

DominiquePelle-TomTom reviewed Apr 12, 2023

View reviewed changes

yegappan force-pushed the vimlsp branch 2 times, most recently from ab0ac01 to 51281f4 Compare April 14, 2023 04:52

yegappan force-pushed the vimlsp branch 10 times, most recently from bf8424a to 61cbea7 Compare April 22, 2023 01:40

Add the utf16idx() and strutf16len() functions and add UTF-16 flag to…

67ea267

… the byteidx(), byteidxcomp() and charidx() functions

yegappan force-pushed the vimlsp branch from 61cbea7 to 67ea267 Compare April 23, 2023 14:26

brammool closed this in 67672ef Apr 24, 2023

zeertzjq mentioned this pull request Apr 26, 2023

vim-patch:9.0.1485: no functions for converting from/to UTF-16 index neovim/neovim#23318

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to convert between byte index and UTF indices #12216

Not able to convert between byte index and UTF indices #12216

yegappan commented Apr 1, 2023 •

edited

Loading

codecov bot commented Apr 1, 2023 •

edited

Loading

brammool commented Apr 12, 2023 via email

DominiquePelle-TomTom commented Apr 12, 2023 •

edited

Loading

DominiquePelle-TomTom Apr 12, 2023

DominiquePelle-TomTom Apr 12, 2023

brammool commented Apr 12, 2023 via email

Shane-XB-Qian commented Apr 12, 2023

vim-ml commented Apr 13, 2023 via email

vim-ml commented Apr 14, 2023 via email

vim-ml commented Apr 14, 2023 via email

vim-ml commented May 2, 2023 via email

Not able to convert between byte index and UTF indices #12216

Not able to convert between byte index and UTF indices #12216

Conversation

yegappan commented Apr 1, 2023 • edited Loading

codecov bot commented Apr 1, 2023 • edited Loading

Codecov Report

brammool commented Apr 12, 2023 via email

DominiquePelle-TomTom commented Apr 12, 2023 • edited Loading

DominiquePelle-TomTom Apr 12, 2023

Choose a reason for hiding this comment

DominiquePelle-TomTom Apr 12, 2023

Choose a reason for hiding this comment

brammool commented Apr 12, 2023 via email

Shane-XB-Qian commented Apr 12, 2023

vim-ml commented Apr 13, 2023 via email

vim-ml commented Apr 14, 2023 via email

vim-ml commented Apr 14, 2023 via email

vim-ml commented May 2, 2023 via email

yegappan commented Apr 1, 2023 •

edited

Loading

codecov bot commented Apr 1, 2023 •

edited

Loading

DominiquePelle-TomTom commented Apr 12, 2023 •

edited

Loading