-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Not able to convert between byte index and UTF indices #12216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #12216 +/- ##
==========================================
- Coverage 82.04% 81.96% -0.09%
==========================================
Files 160 164 +4
Lines 193181 194254 +1073
Branches 43367 43869 +502
==========================================
+ Hits 158505 159229 +724
- Misses 21807 22184 +377
+ Partials 12869 12841 -28
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 121 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Yegappan wrote:
The language server protocol supports specifying offsets in text
documents using UTF-8 or UTF-16 or UTF-32 code units.
The UTF-16 code unit is the default.
https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments
Different language servers have different levels of support for using
the different code units. Vim uses the UTF-32 code units for the
offsets. This makes it difficult to support different language
servers from a Vim LSP plugin.
Port the strutfindex() and strbyteindex() functions from Neovim to
support this.
I find the function names hard to read and confusing. We might be able
to think of better names when the exact functionality is described.
The terminology is confusing. "UTF-32 byte index" contradicts itself,
since each character is four bytes. I think what is meant is "UTF-32
encoded character index", which is equal to "character index", since
there is no Unicode character that takes more than one UTF-32 code
point.
In Vim all Unicode characters are internally encoded with UTF-8. Thus
the "{string}" argument of strbyteindex() will be UTF-8 encoded. This
is also confusing. The help should be clearer about what this means
exactly. I'm not sure how, saying something like "the character index
of "{string}" if it would be encoded with UTF-32" makes it complex. I
think that instead of using "UTF-32 index" we can just use "character
index", and somewhere mention that "UTF-32" can be considered the same
(if we need to mention this at all, since the term "UTF-32" isn't widely
used).
For "UTF-16" it gets more complicated, we can't avoid mentioning that
the index applies to "{string}" encoded as UTF-16. Looking back UTF-16
should have never been made a standard IMHO, but it exists and it is
used (especially on MS-Windows), thus we need to support it.
Conversion between UTF-8 and character index already exists, you can use
charidx() and byteidx()/byteidxcomp(). Possibly we only need to add
functions to convert between UTF-8 and UTF-16 indexes? Or between
character (UTF-32) and UTF-16 indexes? The latter makes more sense.
It should also be possible to specify the handling of composing
characters. Either as an argument, like with charidx(), or using
separate functions, as with byteidx()/byteidxcomp().
…--
My girlfriend told me I should be more affectionate.
So I got TWO girlfriends.
/// Bram Moolenaar -- ***@***.*** -- http://www.Moolenaar.net \\\
/// \\\
\\\ sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///
|
This feature looks related to one of my earlier post at https://groups.google.com/g/vim_dev/c/AVpp8DT2_Vc/m/L_p6gzATBQAJ I will probably find it useful to have this feature for my vim-LanguageTool plugin. |
runtime/doc/builtin.txt
Outdated
@@ -604,6 +606,7 @@ strptime({format}, {timestring}) | |||
strridx({haystack}, {needle} [, {start}]) | |||
Number last index of {needle} in {haystack} | |||
strtrans({expr}) String translate string to make it printable | |||
strutfindex({expr} [, {index}]) List byte index to utf-32 and ut-16 indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ut-16? I assume you meant utf-16.
runtime/doc/builtin.txt
Outdated
@@ -8975,8 +8978,22 @@ str2nr({string} [, {base} [, {quoted}]]) *str2nr()* | |||
|
|||
Can also be used as a |method|: > | |||
GetText()->str2nr() | |||
< | |||
strbyteindex({string} [, {index} [, {use_utf16}]) *strbyteindex()* | |||
Convert a UTF-32 or UTF-16 {index} to a byte index. If |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes the doc in the PR uses "UTF-16" and sometimes "utf-16".
Let's be consistent (the capitalized one is better IMO).
This feature looks related to one of my ealier post at
https://groups.google.com/g/vim_dev/c/AVpp8DT2_Vc/m/L_p6gzATBQAJ
The essential part of that post is to count characters from the start of
the file. This PR is about an index relative to the start of a string.
And also about conversion to/from UTF-16 index. Looking from the
implementation side there is not much in common.
…--
In Africa some of the native tribes have a custom of beating the ground
with clubs and uttering spine chilling cries. Anthropologists call
this a form of primitive self-expression. In America we call it golf.
/// Bram Moolenaar -- ***@***.*** -- http://www.Moolenaar.net \\\
/// \\\
\\\ sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///
|
this is for LSP impl, the default encoding of lsp server is utf-16, hence some e.g non-utf32 chars symbol maybe located incorrectly at client if no such funcs (e.g from this pr) from vim itself. |
Hi Bram,
On Wed, Apr 12, 2023 at 10:36 AM Bram Moolenaar ***@***.***> wrote:
Yegappan wrote:
> The language server protocol supports specifying offsets in text
> documents using UTF-8 or UTF-16 or UTF-32 code units.
> The UTF-16 code unit is the default.
>
>
https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments
>
> Different language servers have different levels of support for using
> the different code units. Vim uses the UTF-32 code units for the
> offsets. This makes it difficult to support different language
> servers from a Vim LSP plugin.
>
> Port the strutfindex() and strbyteindex() functions from Neovim to
> support this.
I find the function names hard to read and confusing. We might be able
to think of better names when the exact functionality is described.
The terminology is confusing. "UTF-32 byte index" contradicts itself,
since each character is four bytes. I think what is meant is "UTF-32
encoded character index", which is equal to "character index", since
there is no Unicode character that takes more than one UTF-32 code
point.
In Vim all Unicode characters are internally encoded with UTF-8. Thus
the "{string}" argument of strbyteindex() will be UTF-8 encoded. This
is also confusing. The help should be clearer about what this means
exactly. I'm not sure how, saying something like "the character index
of "{string}" if it would be encoded with UTF-32" makes it complex. I
think that instead of using "UTF-32 index" we can just use "character
index", and somewhere mention that "UTF-32" can be considered the same
(if we need to mention this at all, since the term "UTF-32" isn't widely
used).
For "UTF-16" it gets more complicated, we can't avoid mentioning that
the index applies to "{string}" encoded as UTF-16. Looking back UTF-16
should have never been made a standard IMHO, but it exists and it is
used (especially on MS-Windows), thus we need to support it.
Conversion between UTF-8 and character index already exists, you can use
charidx() and byteidx()/byteidxcomp(). Possibly we only need to add
functions to convert between UTF-8 and UTF-16 indexes? Or between
character (UTF-32) and UTF-16 indexes? The latter makes more sense.
What about introducing a function that converts a character index in a
string
to a UTF-16 index?
utf16idx({string}, {idx} [, {countcc}])
This is similar to the existing charidx() function. The "idx" here
specifies
the character index in {string} and this function returns the corresponding
UTF-16 index.
To convert from a UTF-16 index to a character index, we can either introduce
a new function or modify the existing charidx() function to accept an
additional
boolean argument. If this argument is specified, then {idx} is a UTF-16
index
instead of a byte index. If we are going with a new function for this, what
do you think about naming the function as utf16tocharidx()?
- Yegappan
…
It should also be possible to specify the handling of composing
characters. Either as an argument, like with charidx(), or using
separate functions, as with byteidx()/byteidxcomp().
|
ab0ac01
to
51281f4
Compare
Hi Bram,
On Wed, Apr 12, 2023 at 10:36 AM Bram Moolenaar ***@***.***> wrote:
Yegappan wrote:
> The language server protocol supports specifying offsets in text
> documents using UTF-8 or UTF-16 or UTF-32 code units.
> The UTF-16 code unit is the default.
>
>
https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments
>
> Different language servers have different levels of support for using
> the different code units. Vim uses the UTF-32 code units for the
> offsets. This makes it difficult to support different language
> servers from a Vim LSP plugin.
>
> Port the strutfindex() and strbyteindex() functions from Neovim to
> support this.
I find the function names hard to read and confusing. We might be able
to think of better names when the exact functionality is described.
The terminology is confusing. "UTF-32 byte index" contradicts itself,
since each character is four bytes. I think what is meant is "UTF-32
encoded character index", which is equal to "character index", since
there is no Unicode character that takes more than one UTF-32 code
point.
In Vim all Unicode characters are internally encoded with UTF-8. Thus
the "{string}" argument of strbyteindex() will be UTF-8 encoded. This
is also confusing. The help should be clearer about what this means
exactly. I'm not sure how, saying something like "the character index
of "{string}" if it would be encoded with UTF-32" makes it complex. I
think that instead of using "UTF-32 index" we can just use "character
index", and somewhere mention that "UTF-32" can be considered the same
(if we need to mention this at all, since the term "UTF-32" isn't widely
used).
For "UTF-16" it gets more complicated, we can't avoid mentioning that
the index applies to "{string}" encoded as UTF-16. Looking back UTF-16
should have never been made a standard IMHO, but it exists and it is
used (especially on MS-Windows), thus we need to support it.
Conversion between UTF-8 and character index already exists, you can use
charidx() and byteidx()/byteidxcomp(). Possibly we only need to add
functions to convert between UTF-8 and UTF-16 indexes? Or between
character (UTF-32) and UTF-16 indexes? The latter makes more sense.
It should also be possible to specify the handling of composing
characters. Either as an argument, like with charidx(), or using
separate functions, as with byteidx()/byteidxcomp().
I have updated the PR to add the utf16idx() function and introduced an
optional
UTF-16 flag to the byteidx() and byteidxcomp() functions.
- Yegappan
|
Yegappan wrote:
> > The language server protocol supports specifying offsets in text
> > documents using UTF-8 or UTF-16 or UTF-32 code units.
> > The UTF-16 code unit is the default.
> >
> >
> https://microsoft.github.io/language-server-protocol/specifications/lsp/3=
.17/specification/#textDocuments
> >
> > Different language servers have different levels of support for using
> > the different code units. Vim uses the UTF-32 code units for the
> > offsets. This makes it difficult to support different language
> > servers from a Vim LSP plugin.
> >
> > Port the strutfindex() and strbyteindex() functions from Neovim to
> > support this.
>
> I find the function names hard to read and confusing. We might be able
> to think of better names when the exact functionality is described.
>
> The terminology is confusing. "UTF-32 byte index" contradicts itself,
> since each character is four bytes. I think what is meant is "UTF-32
> encoded character index", which is equal to "character index", since
> there is no Unicode character that takes more than one UTF-32 code
> point.
>
> In Vim all Unicode characters are internally encoded with UTF-8. Thus
> the "{string}" argument of strbyteindex() will be UTF-8 encoded. This
> is also confusing. The help should be clearer about what this means
> exactly. I'm not sure how, saying something like "the character index
> of "{string}" if it would be encoded with UTF-32" makes it complex. I
> think that instead of using "UTF-32 index" we can just use "character
> index", and somewhere mention that "UTF-32" can be considered the same
> (if we need to mention this at all, since the term "UTF-32" isn't widely
> used).
>
> For "UTF-16" it gets more complicated, we can't avoid mentioning that
> the index applies to "{string}" encoded as UTF-16. Looking back UTF-16
> should have never been made a standard IMHO, but it exists and it is
> used (especially on MS-Windows), thus we need to support it.
>
> Conversion between UTF-8 and character index already exists, you can use
> charidx() and byteidx()/byteidxcomp(). Possibly we only need to add
> functions to convert between UTF-8 and UTF-16 indexes? Or between
> character (UTF-32) and UTF-16 indexes? The latter makes more sense.
What about introducing a function that converts a character index in a
string to a UTF-16 index?
utf16idx({string}, {idx} [, {countcc}])
This is similar to the existing charidx() function. The "idx" here
specifies the character index in {string} and this function returns
the corresponding UTF-16 index.
charidx() converts a byte index of an UTF-8 encoded string to a
character index. This can't simply be changed to UTF-16, since we don't
support UTF-16 encoded strings. We could (pretend to) convert the
string to UTF-16 and then apply {idx}. But that is doing the opposite
of what you suggested.
To convert from a UTF-16 index to a character index, we can either introduce
a new function or modify the existing charidx() function to accept an
additional boolean argument. If this argument is specified, then
{idx} is a UTF-16 index instead of a byte index. If we are going with
a new function for this, what do you think about naming the function
as utf16tocharidx()?
The function still returns a character index, thus using "charidx" with
something appended works better. At least then they sort next to each
other.
For the other direction an equivalent to byteidx(). That could be
utf16idx() perhaps.
…--
ARTHUR: What does it say?
BROTHER MAYNARD: It reads ... "Here may be found the last words of Joseph of
Aramathea." "He who is valorous and pure of heart may find
the Holy Grail in the aaaaarrrrrrggghhh..."
ARTHUR: What?
BROTHER MAYNARD: "The Aaaaarrrrrrggghhh..."
"Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD
/// Bram Moolenaar -- ***@***.*** -- http://www.Moolenaar.net \\\
/// \\\
\\\ sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///
|
bf8424a
to
61cbea7
Compare
… the byteidx(), byteidxcomp() and charidx() functions
Problem: no functions for converting from/to UTF-16 index. Solution: Add UTF-16 flag to existing funtions and add strutf16len() and utf16idx(). (Yegappan Lakshmanan, closes vim/vim#12216) vim/vim@67672ef Co-authored-by: Christian Brabandt <[email protected]>
Problem: no functions for converting from/to UTF-16 index. Solution: Add UTF-16 flag to existing funtions and add strutf16len() and utf16idx(). (Yegappan Lakshmanan, closes vim/vim#12216) vim/vim@67672ef Co-authored-by: Yegappan Lakshmanan <[email protected]>
…23318) Problem: no functions for converting from/to UTF-16 index. Solution: Add UTF-16 flag to existing funtions and add strutf16len() and utf16idx(). (Yegappan Lakshmanan, closes vim/vim#12216) vim/vim@67672ef Co-authored-by: Yegappan Lakshmanan <[email protected]>
[resend, picky postmaster refused the message]
Yegappan wrote:
> > The language server protocol supports specifying offsets in text
> > documents using UTF-8 or UTF-16 or UTF-32 code units.
> > The UTF-16 code unit is the default.
> >
> >
> https://microsoft.github.io/language-server-protocol/specifications/lsp/3=
.17/specification/#textDocuments
> >
> > Different language servers have different levels of support for using
> > the different code units. Vim uses the UTF-32 code units for the
> > offsets. This makes it difficult to support different language
> > servers from a Vim LSP plugin.
> >
> > Port the strutfindex() and strbyteindex() functions from Neovim to
> > support this.
>
> I find the function names hard to read and confusing. We might be able
> to think of better names when the exact functionality is described.
>
> The terminology is confusing. "UTF-32 byte index" contradicts itself,
> since each character is four bytes. I think what is meant is "UTF-32
> encoded character index", which is equal to "character index", since
> there is no Unicode character that takes more than one UTF-32 code
> point.
>
> In Vim all Unicode characters are internally encoded with UTF-8. Thus
> the "{string}" argument of strbyteindex() will be UTF-8 encoded. This
> is also confusing. The help should be clearer about what this means
> exactly. I'm not sure how, saying something like "the character index
> of "{string}" if it would be encoded with UTF-32" makes it complex. I
> think that instead of using "UTF-32 index" we can just use "character
> index", and somewhere mention that "UTF-32" can be considered the same
> (if we need to mention this at all, since the term "UTF-32" isn't widely
> used).
>
> For "UTF-16" it gets more complicated, we can't avoid mentioning that
> the index applies to "{string}" encoded as UTF-16. Looking back UTF-16
> should have never been made a standard IMHO, but it exists and it is
> used (especially on MS-Windows), thus we need to support it.
>
> Conversion between UTF-8 and character index already exists, you can use
> charidx() and byteidx()/byteidxcomp(). Possibly we only need to add
> functions to convert between UTF-8 and UTF-16 indexes? Or between
> character (UTF-32) and UTF-16 indexes? The latter makes more sense.
What about introducing a function that converts a character index in a
string to a UTF-16 index?
utf16idx({string}, {idx} [, {countcc}])
This is similar to the existing charidx() function. The "idx" here
specifies the character index in {string} and this function returns
the corresponding UTF-16 index.
charidx() converts a byte index of an UTF-8 encoded string to a
character index. This can't simply be changed to UTF-16, since we don't
support UTF-16 encoded strings. We could (pretend to) convert the
string to UTF-16 and then apply {idx}. But that is doing the opposite
of what you suggested.
To convert from a UTF-16 index to a character index, we can either introduce
a new function or modify the existing charidx() function to accept an
additional boolean argument. If this argument is specified, then
{idx} is a UTF-16 index instead of a byte index. If we are going with
a new function for this, what do you think about naming the function
as utf16tocharidx()?
The function still returns a character index, thus using "charidx" with
something appended works better. At least then they sort next to each
other.
For the other direction an equivalent to byteidx(). That could be
utf16idx() perhaps.
…--
ARTHUR: What does it say?
BROTHER MAYNARD: It reads ... "Here may be found the last words of Joseph of
Aramathea." "He who is valorous and pure of heart may find
the Holy Grail in the aaaaarrrrrrggghhh..."
ARTHUR: What?
BROTHER MAYNARD: "The Aaaaarrrrrrggghhh..."
"Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD
/// Bram Moolenaar -- ***@***.*** -- http://www.Moolenaar.net \\\
/// \\\
\\\ sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///
|
…eovim#23318) Problem: no functions for converting from/to UTF-16 index. Solution: Add UTF-16 flag to existing funtions and add strutf16len() and utf16idx(). (Yegappan Lakshmanan, closes vim/vim#12216) vim/vim@67672ef Co-authored-by: Yegappan Lakshmanan <[email protected]>
The language server protocol supports specifying offsets in text documents using UTF-8 or UTF-16 or UTF-32 code units.
The UTF-16 code unit is the default.
https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments
Different language servers have different levels of support for using the different code units. Vim uses the UTF-32
code units for the offsets. This makes it difficult to support different language servers from a Vim LSP plugin.
The following changes are introduced in this PR: