-
-
Notifications
You must be signed in to change notification settings - Fork 673
Switch AssemblyScript to UTF-8 by default? #1653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You know my opinion. I agree with everything except the point that it should be the default and suggest creating a new string class using UTF-8 and more stricter / limited API with a shorter name |
The problem I see with maintaining two string classes is that we are essentially making WebAssembly's problems ours, as part of the language. Of course providing say a separate |
Apologies if anything I say is irrelevant to this, or if I've misunderstood anything. WebAssembly/interface-types#13 (comment) According to the interface-types proposal itself, UTF-16 will not be at any performance disadvantage compared to UTF-8 or even ASCII. Btw, WTF-8 shouldn't be a viable target, as out of range Unicode characters would trap, which UTF-8 does not enforce.
This could simply be made an O(n) operation by iterating. Unfortunate, but it would keep TS/JS compatability. (also, unindexable strings lose a whole lot of value) |
The performance disadvantage I worry about is when chaining modules. Say for instance you have a webapp written in Rust, a form validator written in AssemblyScript, a RegEx engine written in C++ and a database written in Java. Then a user inputs "something".
Now that's kinda a constructed example, but it helps to illustrate the general problem we'll eventually be dealing with. At any point, in any module graph, there may be an AssemblyScript module in between two Rust modules or vice-versa, just passing a string forward while not even necessarily doing something with it, producing two unnecessary re-encodings on the string's way through. My expectation is that this will only become worse once Interface Types become available, as it makes small modules more viable while currently most binaries are monoliths because it isn't available. To fully solve the problems of true polyglot WebAssembly, we'd either need
|
That is not the case, because of laziness and adapter fusing it is possible to optimize away the encoding. This technique is already used extensively in Haskell and Rust, so it should be possible to apply it to Wasm engines as well. Note that because interface adapters work lazily, that means they are basically like a JavaScript Iterator, so it does not require intermediate memory. Encoding might still be necessary if the intermediate modules use different encodings and need to actually use the strings... but that is not a Wasm-specific problem... that is just unavoidable in general unless all the modules use the same encoding. Interface types smooth things over by providing a universal encoding-agnostic exchange format ( Note that the most recent interface types proposal is far more general than your Universal Strings proposal, since |
I do agree that the current state of IT is based on a clever idea, I'd just not agree that it leads to significant improvements in practice. For instance, as soon as a string goes into a linear memory based language, the string goes into memory more often than not. So, unfortunately, whenever that happens, or the string is otherwise "used", there's re-encoding, leading us to
as the common case (also in the future even more because the proposed canon encoding effectively biases towards linear memory languages), which is unfortunate for AssemblyScript, and this issue is about. Let me know if you'd agree, but I'd appreciate if we could discuss the more general challenges within the other issue. |
Yes, but unfortunately there's no way around that, I am unaware of any technical solution other than "languages must agree on the same representation format if they want to pass strings without any performance penalty". So there are only two solutions:
This problem isn't just limited to strings, it also happens when passing records/tuples/lists between languages. It's unfeasible to require all languages to have the same string/list/tuple/record representation, since Wasm is not an abstract runtime like the JVM, and it does not try to be like the JVM. Also don't forget that Wasm is intentionally designed around a shared-nothing model, where modules do not share their memory with each other, so each module has a separate linear memory. So a certain amount of copying is unavoidable with that model (even with GC fixes that copying problem, but only for languages which choose to use GC (and which have the same GC representation). So non-GC languages will still need to use the adapter system forever. And so interface types gives a solution which works for linear memory now, and which can be upgraded to GC later, and which supports interop between GC and non-GC languages.
Why do you say that? Yes interface types are intentionally designed so they can be used without a GC (and thus work well for "linear memory languages"), but interface types also work with GC, and GC strings can be added in the future, they are orthogonal features. If anything, it's biased in the opposite way: GC languages will be able to use GC strings (and thus can pass strings with no performance penalty, since they're just passing pointers), whereas "linear memory languages" are always forced to copy the full string into linear memory (even with |
GC has the same problem if languages use different encodings, or somehow different object layouts, or otherwise require side-effects when constructing a string, yeah. The story just continues forever, unless we start to at least think about alternatives. Universal Strings is my attempt to come up with an idea that minimizes the problem by introducing an encoding-agnostic universal string type. It has a graceful fallback for non-GC languages (plus a graceful outlook to define a Other than that, I cannot express how tired I am of repeating myself while achieving exactly nothing. Probably not an unexpected outcome. |
That proposal is basically just GC strings, which are great, but that doesn't replace interface types, and GC strings can be layered on top of interface types. Since AssemblyScript is supposed to mirror TypeScript, I think it makes sense for it to use GC strings (when they become available). And in the meantime I don't think it matters too much what encoding is used. The only benefit of choosing UTF-8 is that it can be slightly more efficient (if You could also choose to make AssemblyScript encoding-agnostic (like Python), so the encoding would be an internal implementation detail which is hidden from the user. The idea is that (from the user's perspective) strings would just be arrays of code points.
Converging to a single GC string type (e.g. Universal Strings) is also not an option. You're never going to convince all languages to converge toward a single string representation. Languages will have different string representations, that's just a fact we have to accept, a JVM-style universal representation is not feasible. This isn't about Rust, or C++, it's about all languages. There are dozens of ways of handling strings, and each language makes a different choice:
This problem is a lot trickier than choosing between UTF-8 and UTF-16 (or choosing between linear memory and GC). Which is why interface types are designed to be shared-nothing, to support all encodings, and to support different representations. This has some performance costs, but it's the only feasible way to support different languages. |
Perhaps we can agree on the observation that Interface Types is a beautiful solution to a subset of the problem, while Universal Strings could or could not be the basis of yours truely messy compromise. On that premise, we are all right, and can use this issue to discuss about AS switching to UTF-8 by default again. What do you think? Is AS considering to join the UTF-8 bandwagon, for example for efficient interop with Rust, something you'd welcome? |
Like I said, I don't think it matters too much. I think the performance gain from So I think the choice should be primarily determined by what makes the most sense in terms of functionality / user API. So I see three choices:
For AssemblyScript I would personally choose list-of-codepoints, since it is the nicest user API, it has |
Note that the authors of UTF-8 everywhere disagree with me, they believe that UTF-8 is superior to list-of-codepoints. They make some good points, and I agree with them that grapheme clusters is really what most people should be using, and so I think Rust's approach (UTF-8 + iterator of grapheme clusters) is ideal. The difference is whether you care a lot about However, JavaScript programmers expect |
Here are some anecdotes about UTF-8 usage:
So there could be advantages to UTF-8 if you want to create an HTTP server or UWP application with AssemblyScript. And there could be performance gains to using UTF-8 when using WASI. This is largely a choice between "good user API" (list-of-codepoints), "maximum performance" (UTF-8), and "strict compatibility and familiarity" (WTF-16). All three options have merits, so it's up to the AssemblyScript community to decide which of the three options it wants. |
Perhaps a couple additional observations, given AssemblyScript's goals: Performance: With multiple string representations, and abstracting over them, we may end up in a situation where re-encoding happens implicitly within the language's standard library. Depending on the use case that may be more or less often. More complex solutions may also be slower in general due to doing more work. As such, picking the most reasonable default and sticking with it has certain advantages. Binary size: One difference to non-Wasm Java, Python and JS is for example that AS typically ships its string functions (and related standard library) with a Wasm binary, while in traditional VMs these functions are installed on the machine. This may relativize when binaries become bigger, but it doesn't necessarily make the "just a simple and small Wasm module" use case less important. The more complex our string handling becomes, the more these small modules will tank. The list of codepoints idea may be related to both, as if I understood correctly it for example requires to maintain a separate lookup structure of "breadcrumbs" (once read something over at Swift about it, essentially caching the stride one knows already) to minimize the impact of random access. Of course I agree that this would be nice API-wise, as it would kinda just "fix" WTF-16 by eliminating surrogate pairs (incl. making underlying encoding irrelevant), but then there remains the question of how far we can or want to go for a nice API without re-targeting to a different audience. Given that, Rust's approach may be a more reasonable choice. As would be staying with WTF-16 for JS compat. That's the unfortunate reality we are looking at. AS is essentially torn towards either of the approaches for one or another reason, and much of the heated discussion so far originated in Wasm not (yet) helping. Some sort of universal strings, not necessarily mine, could for example move all the heavy machinery to the VM instead of shipping it with every binary, to give just one example. I still hope we can get something like that eventually (for all languages), it's just that GC strings aren't moving into that direction as well currently, equally motivating this issue. |
Perhaps an improvement to universal strings could be exactly that. Make them an encoding-independent list of codepoints, with an encoding-specific Swift-like breadcrumb mechanism to enable fast random access. Hmm... Interestingly, AS could then opt to just fix WTF-16 and expose strings as a list of codepoints, with some performance cost but without any binary size cost. Perhaps a nice compromise. |
No, it should not require a separate lookup structure. It just requires strings to have 2 bits at the beginning which mark whether it is ASCII, UTF-16, or UTF-32. When concatenating strings together, it always uses the largest encoding (e.g. concatenating an ASCII string with a UTF-32 string returns a UTF-32 string). When indexing into a string, it looks at the 2 bits and then multiplies the index depending on the bit:
Then it just does a byte lookup with the index, so this is all very fast. When sending strings to other modules, it checks the 2 bits. If it's ASCII, it uses When receiving strings from other modules, it determines whether the string is ASCII/UTF-16/UTF-32 and prepends the 2 bits accordingly. The only cost to this is looking up the 2 bits when indexing, and also some code bloat because it supports 3 encodings.
Yes, code size is a good reason to pick a single encoding rather than using list-of-codepoints.
From reading your posts in that other thread, it sounds like you think a universal string type can work for 90+% of languages, but I don't think that's true at all. Instead it would only work for a small handful of languages. Like I said before, this issue is so much bigger than just UTF-8 vs UTF-16. The JVM and CLR are great examples of that: languages which are ported to them end up with significant restrictions and incompatibilities (e.g. Clojure doesn't have tail calls, Jython uses WTF-16, Kotlin is stuck with using WTF-16 because that's what the JVM uses, Eta uses WTF-16 because that's what the JVM uses, etc.) It's simply impossible to support languages as diverse as Scheme, Haskell, OCaml, Python, Ruby, Clojure, JS, Rust, Go, C++, etc. with a single string type. No matter what representation you choose, you will be excluding dozens and dozens of languages from the Wasm ecosystem, which is not acceptable. So let's step back a bit... the goal of having One String To Rule Them All is not going to happen, languages will never agree on a single string type. So there are three (achievable) goals:
So you can already use And if another language happens to also use JS strings (e.g. Java or Clojure or C#) then you can efficiently pass the JS string to them as well. And after interface types are ready, you can then integrate the Given the above situation, are there any remaining problems for AssemblyScript? |
(Also note that it is possible for the host or WASI to provide helper functions for encoding/decoding/validation, which will remove a lot of code bloat, so I don't think code bloat is a good justification for choosing a particular string representation) |
I am primarily interested in evaluating what would be best for AssemblyScript at this point (i.e. switch to UTF-8), under the assumption that for instance a universal string type is not going to happen (i.e. GC strings are likely going to have the same problem). The relevant challenges to your points are, in short:
Everything else is merely me thinking loud about improving upon the options we have. The "won't happen" and "can't be done" sentiment is worrying me still, though. The encoder/decoder/validator aren't the biggest size problems btw. These are single functions. Adapting the entire standard library to do something sophisticated, like exposing a WTF-16 API with UTF-8 encoding underneath with breadcrumbs, or supporting both WTF-16 and UTF-8 with a single string class, are the size problems. Either just WTF-16 or just UTF-8, with IT as-is, is not a size problem. But list of codepoints, to for example "fix" WTF-16 at the occasion (also in context of the other thread I linked), needs machinery, that I'd wish Wasm could help with. Of course we can also just call it a day and say it won't happen anyway, if you prefer, but then we are really only back to square one, picking either because nobody gives a damn. |
Note that WTF-16 and list-of-codepoints can use Also note that if your goal is to have efficient interop with the browser, then your only option is to use JS strings, and so the entire UTF-8 vs UTF-16 argument becomes moot. And if your goal is to have efficient interop with the maximum number of languages, JS strings are also probably your best option, since other languages also want efficient interop with the browser, so JS strings are the de facto universal "wasm string". Breadcrumbs are not necessary for list-of-codepoints or optimizing WTF-16. I explained in my previous post how to implement them. Breadcrumbs are used in Swift exclusively for calculating indexes when calling Objective-C APIs, that's not really relevant for AssemblyScript. Here are a couple good articles on Swift strings: https://swift.org/blog/utf8-string/ https://forums.swift.org/t/piercing-the-string-veil/21700
I don't think it would be a lot of code, the implementation strategy I explained is quite simple, certainly much simpler than breadcrumbs. |
What do you mean? ASCII strings are not two-way compatible with WTF-16, unless stored with the compaction trick, but then it isn't WTF-16 anymore but ASCII. Perhaps this misunderstanding explains...
I don't think you explained breadcrumbs, at all. I see some basic bit flags there, explaining something that is kinda simple and obvious (multiple encodings need a flag somewhere and can use compaction, surprise). Perhaps actually read this section of the document you linked.
They are used to abstract the difference in underlying encoding away, i.e. if the API is designed to do indexes access in a way not mapping linearly to the underlying encoding. This would be the case in AS if it kept a JS-compatible API, but would use UTF-8 (or multiple string representations, with one of them being UTF-8) under the hood. How is that not relevant to this issue, that is about UTF-8? What are you even talking about?
You don't think? Have you implemented the AS standard library or what to come to this conclusion, with its underlying libc-like helpers and related APIs like in arrays assuming a specific string encoding etc.? Are you even interested in improving AssemblyScript? Or what is this? Why are you here? I just don't understand. I think I remember again why I became so frustrated about this. |
I appreciate the creative problem solving in the comment you linked me to, but I don't have an immediate reaction to it. "fix WTF-16 and expose strings as a list of codepoints" makes it sound like it's not that different from what AS currently has, but I assume I'm lacking context. My thumbs up was just for the idea that WASI can provide helper functions. That's a specific idea that I can understand without a lot of context here, and is something I generally think is a good idea, independently of other concerns here. |
I see, thank you for explaining. Providing helpers sounds good to me as well as these will be useful independently. Sorry for the noise.
And yeah, that's why I came to think about it as well as it fixes two problems people are interested in at once with little API impact on AS. The underlying encoding could then be UTF-8, but it would benefit significantly from Wasm providing that breadcrumbs mechanism for where non-trivial indexed access is performed so the language doesn't have to ship it itself where strings are exposed as a list of codepoints (or UTF-16 for that matter), and inter-language communication can take advantage of it as well. Sounds like a neat compromise to me, but requires some sort of specified string type to enable the mechanism. So that'd be where Wasm could come into play. Not saying we should do it, but I am interested in thinking about it more and have your thoughts. |
If the string uses 1 bit to mark whether it's ASCII or WTF-16 then it is trivial (and fast) to use
Yes, exactly, I was explaining how Python implements list-of-codepoints, which does not require breadcrumbs at all. It is just 2 bits marking which type of encoding it is using:
The downside is that it needs some extra code to handle the multiple encodings (but that shouldn't be too complicated, since each encoding is fixed length, so it just needs to multiply the index, it doesn't need breadcrumbs).
Yes... I did read it (and I even explained what breadcrumbs are used for). You seem to have poor reading comprehension, since you've multiple times misunderstood what multiple other people are saying. I think your frustration comes mostly from your own misunderstandings and ignorance.
I don't recommend that, I think that would be worse performance than simply using WTF-16, which is why I assumed that wasn't an option.
All I have ever done is given factually correct information and explained various possible options and tradeoffs. I am not a part of the AssemblyScript community, so it's not my place to make decisions, I can only give information. It seems you aren't interested in that, so this will be my last message to you. |
It's very complicated actually even for compact strings (when we have only latin1 bit). With three different encodings we should handle 8 different implementations for binary methods like string concat, compare, replace, includes / indexOf and etc. It will increase runtime signifintally |
@MaxGraey You would not need whole separate implementations, as you couldn't know, at compile-time, if a string is in one encoding or another. More likely, there would just be a simple switch case or match inside of every method, slowing down runtime performance. Dealing with How about, just leaving the current WTF-16, and taking the alleged performance drop incurred by using interface type strings? |
No, because the key point is that all string operations are normalized to the highest encoding, and the only difference between the encodings is the number of bytes (there are no surrogate pairs). That means you only need 3 conversions ( For example, here is some pseudo-code for a concat implementation: function concat(x, y) {
// Get the byte size of the strings
const xSize = x[0];
const ySize = y[0];
const size = Math.max(xSize, ySize);
// Preallocate the precise amount of memory
const output = malloc(1 + (x.length * size) + (y.length * size));
// The output byte size is always the biggest of the input byte sizes
output[0] = size;
if (xSize >= ySize) {
// Standard string concatenation
memcpy(x, output, 1, x.length * size);
} else {
// Same as memcpy except it adds padding to each byte
memcpyPad(x, output, 1, x.length * size, xSize, size);
}
if (ySize >= xSize) {
// Standard string concatenation
memcpy(y, output, 1 + (x.length * size), y.length * size);
} else {
// Same as memcpy except it adds padding to each byte
memcpyPad(y, output, 1 + (x.length * size), y.length * size, ySize, size);
}
return output;
} You only need that one implementation, not a separate implementation per encoding. Compared to a standard string concat, the above implementation only has a tiny bit of extra overhead: looking up the byte size, doing The same principle applies to other functions like function split(str, needle) {
// Get the byte size of the strings
const sSize = str[0];
const nSize = needle[0];
// If this is true then we know the needle can never match, so we return early.
if (nSize > sSize) {
return str;
}
// If the needle encoding is different then we have to add byte padding.
// This can be avoided with some tricky logic, but it's probably not worth it.
if (nSize < sSize) {
needle = reencode(needle, sSize);
}
// Do standard split algorithm...
// Since both strings are guaranteed to be the same encoding, and we know the byte
// size, we can use the same algorithm for all encodings.
return realSplit(str, needle, sSize);
} As you can see, If you decide to reject list-of-codepoints, that's fine, but reject it for the right reasons, not based on misinformation. |
I don't see any benefits at all for Your implementation doesn't make sense at all. You can't just memcpy with some padding. It's the same time like convert |
Just see on unicode comparison in python: https://github.com/python/cpython/blob/master/Objects/unicodeobject.c#L11228 Languages like Java and Python which have a fat runtime that they don't include with every program and can afford anything. In addition, Java / C # also has JIT compilation that allow them to speed up the whole thing by eliminating most of extra runtime checks. We cannot afford to complicate the runtime this way. |
Maybe insight to be had They do a hashmap to convert from internal u32 for pandas interop and note in the blog “Hashing data is not free, but counterintuitively it can be faster in addition to being vastly more memory efficient...with many instances of the same string values.” Which maybe the case not only with categorical data columns, but also templating strings (html and css). Although , Arrow format might even be reasonable for the as runtime. Id guess the workloads for arrow formatted data and wasm processing are probably pretty similar. |
The UTF-32 is critical to make it work, it is not optional.
UTF-32 is necessary in order to ensure that the string is a fixed length encoding (and to ensure that algorithms are efficient since they only need to pad bytes).
No... as I have explained many times, this system does not have surrogate pairs, that is the secret to why it is efficient and easy to implement. And that's also why it requires UTF-32 (for code points above U+FFFF).
No, you don't, my implementation handles both perfectly fine. For example, here is Python's implementation of concat (you'll note it is very similar to the implementation I posted before, except mine is shorter, since I'm using It calls the
They are handling each case individually with function compare(x, y) {
const xSize = x[0];
const ySize = y[0];
const max = Math.max(xSize, ySize);
const minLength = Math.min(x.length, y.length);
if (xSize === ySize) {
return memcmp(x, y, 1, 1, minLength);
} else {
for (let i = 0; i < minLength; ++i) {
const xChar = readChar(x, i, xSize);
const yChar = readChar(y, i, ySize);
if (xChar < yChar) {
return -1;
} else if (xChar > yChar) {
return 1;
}
}
}
if (x.length === y.length) {
return 0;
} else if (x.length < y.length) {
return -1;
} else {
return 1;
}
}
You are misreading the code. That isn't for Latin1 strings (since those are already handled by |
It's not copy! It's conversion (lowering or widening per element) from |
You should need extra bit to determine it's two If you're suggesting to use fixed encodings instead variable. It will be even worse due to we should check and re-encode to upper char type after every string operation |
Yes... it's adding a zero byte to the beginning of each code unit (i.e. padding). The loop unrolling is just a micro-optimization, it's not necessary and it's not relevant to this discussion. And in any case the loop unrolling would be done in a function, so it wouldn't bloat up the code size. Here is a single implementation which would work for all the encodings: function memcpyPad(input, output, index, length, fromSize, toSize) {
const diff = toSize - fromSize;
while (index < length) {
const fromIndex = index * fromSize;
const toIndex = index * toSize;
for (let i = 0; i < diff; ++i) {
output[toIndex + i] = 0;
}
for (let i = 0; i < fromSize; ++i) {
output[toIndex + i + diff] = input[fromIndex + i];
}
++index;
}
}
Yes... because memcpy does not pad bytes, so it uses a loop, just like my implementation.
No, it is not necessary, look at my implementation, it handles all possible combinations.
No... because the byte size is stored as the first byte of the string, as I explained, so it always knows the byte size.
But this is not a UTF-8 encoding, it is list-of-codepoints.
No none of that is necessary.
Yes, as I have clearly stated multiple times, I am explaining how list-of-codepoints is implemented (which is a fixed length encoding, as I have stated multiple times). And yes it must re-encode one of the strings if the encodings are different and it is returning a new string. But in exchange for that you get lower memory usage, And in practice the APIs which return new strings must loop over the string anyways (e.g. If you are willing to give up And if you want to have fast interop with the browser, then your only option is to use JS strings (with |
no it's called widening and narrowing conversions. Copy with padding it's absolutely different operation. You're suggesting add extra amortization time and space for all string methods to avoid O(N) time for string length and O(1) for random indexing access. Are you sure it's faster in general case? Because even if this fast enough it has a lot of disadvantages like huge memory footprint, huge codesize footprint, huge maintenance overhead. And for some reason it seems to me that it will not be fast, it will be very mediocre. Since this only applies in Python, which never made performance goals. |
Of course you would have to run benchmarks to be sure of the speed, but similar systems are used all over the place (including Java, and JS engines, which care about performance a lot). And it's complicated further because of
How is it a huge memory footprint? Most strings are ASCII, so they use 1 byte, and anything less than U+FFFF uses 2 bytes (same as UTF-16), so in practice only emojis would cause an inflation to 4 bytes. Of course there are pathological cases where it would use 4 bytes a lot (e.g. if you're processing emojis a lot), but in that case UTF-8 and UTF-16 will also use more memory (since they both use 4 bytes per emoji). Rather than guessing, it's better to use benchmarks on real programs. The results are often not what you expect.
As I showed, the implementation only adds a fairly small amount of extra code to each function.
I agree it is more complex than using a single encoding, and that's a valid reason to prefer a single encoding.
From a performance perspective, yes I agree pure UTF-8 is generally better, however it breaks compatibility with JS (and JS programmer's expectations). JS programmers are used to having fast indexing, and they're used to ignoring Unicode and assuming that each character is a code point. That assumption works for characters below U+FFFF and JS programmers rarely test for characters above that. All of that would fall apart with UTF-8, now every API needs to be changed to be Unicode-aware, and JS programmers can no longer just process strings in a simple loop, instead they have to take into account the variable length of every character. In other words, this sort of thing would no longer work (despite being a very common idiom): function foo(str) {
for (let i = 0; i < str.length; ++i) {
processCharacter(str[i]);
}
} Since AssemblyScript is specifically catering and marketing to JS programmers, that is a big problem. So if you want to maintain the same API, and maintain JS programmer's expectations, I think list-of-codepoints is a good system that makes good tradeoffs (at the cost of a small amount of performance and code size). But if instead you want to break from JS and make AssemblyScript into its own separate thing, with a different API, then yes I would recommend UTF-8. |
UTF-32 in most cases unnecessary. UTF-16 based languages like Java and C# use string compaction. It's pretty similar to your approach but use only one boolean flag for indication latin1 / UTF-16 string. Btw in Java string compaction mode turned off by default and could enable via
|
Yes, that is correct, but then you (and the users) have to deal with surrogate pairs. And if you're already adding in that complexity, you might as well fix the problem with surrogate pairs.
No, since Java 9 it is on by default, because in general it improves the performance of strings. https://www.baeldung.com/java-9-compact-string#1-difference-in-performance https://openjdk.java.net/jeps/254 https://www.javagists.com/compact-strings-java-9 And of course it is always enabled in JS engines (also because it improves performance in general). |
As I mentioned before we can't do this in our case. Standalone VMs can use rope data structures for string concatenations, hashing and many other clever tricks becouse they're not limit by size. We're should expose small runtime |
Before I comment, I want to mention that I'm not as well versed in AS internals nor the WASM specification as you all are. Apologies if I'm misunderstanding something. Throwing in a +1 for UTF-8 by default (or at least, by configuration) due to conversions in other languages being trickier to do than in Javascript (which can be achieved, albeit messily, using While performance is critical to most people in most cases, and that someone will have to pay for the encoding discrepancies between platforms (since Javascript will most likely always be UTF-16 specified), it makes more sense to me that if you're running WASM payloads in a native environment, you probably have a more performance-critical application than e.g. running in Electron, a browser, or some other web-based environment. Thus, being able to avoid allocations in native environments would be more preferable than in the web, where such performance is indeed a concern but not as much relative to native environments - in my opinion, at least. For context, I intend to have my WASM payload execute in both web and native environments; as such, I very much prefer my native environment to be the more performant of the two. I realize this is subjective, but I'm here in this issue simply because of the headache of having to convert all of my string parameters between UTF-16 and UTF-8 manually, for all of the import bindings that work with strings. Doing this conversion in the Javascript world instead would be less of a headache due to the syntactic and semantic sugars afforded by the Javascript language. If nothing else, being able to specify that I want the strings to use UTF-8 at compile time would be a killer feature - unless I'm missing something obvious, this is not currently the case. I can see the ability to switch between the two depending on my application being very, very useful. |
The main problem with UTF-8 Strings it will be completely different String API and DX. No |
@MaxGraey are strings immutable under the hood? If so, this solves the C Pseudo-code: typedef uint64_t strlen_t;
// allocate
char *ptr = malloc(sizeof(strlen_t) + string_length + 1);
*(strlen_t *)ptr = string_length;
ptr += sizeof(strlen_t);
memcpy(ptr, src_string, string_length + 1); // remove +1 is src_string is NOT null terminated
ptr[string_length] = 0; // remove entirely if src_string IS null terminated
// free
free(ptr - sizeof(strlen_t));
// get length
strlen_t len = ((strlen_t *)ptr)[-1]; I would also argue that the For the most part, I use those to convert from character codes to ordinals for single characters at a time, e.g. the index parameter almost always being Another thing to consider, though I don't know if it's possible - if you go the configuration route, then have two different string types (or one for however many string types you want to be configurable) and change an alias for the main This might place more burden on library vendors but in the case of AS I think it makes a lot of sense - we already do this in other languages without much of a problem. |
Yes, but
Hmm. What's languages? All UTF-16 based langause Java, C# and Dart still use UTF-16 Strings. Only Swift migrated to UTF-8 but it uses some special mechanism for compatibility with NSString and completely changed API for strings |
Sure though there are ways to encode this (both bytes and codepoints, using all-1-bits to signal "not yet calculated" and doing a lazy calculation, etc.). I understand the concern though - it's not cheap. The problem with u32, on the other hand, is indeed static string space consumption. Given that WASM payloads are already going to be larger than we'd all like, and that they're going to be on the wire much more often than other binary application payloads have historically been (installers, MSIs, etc.) then it's important to trim sizes where it matters. I think u32 works against that. I also second the sentiment of usefulness to developers - the only time I've personally used u32 is when writing unicode-aware parsers.
C (namely WinAPI, which has both Zig is a new and upcoming one, and all strings are UTF-8 encoded. Rust uses Go also uses UTF-8 encoded strings. Keep in mind NSString came before a time of utf-8 proliferation, and was largely unchanged from the original NextStep days if I understand correctly. This would explain its encoding not matching newer languages', along with Java's and Python's. As far as other VM languages go, Lua uses 8-bit under the hood (though its encoding is unspecified, generally considered to be ASCII but it supports utf-8 storage - the EDIT: honorable mention PHP, where strings have no specified encoding and thus literals are encoded depending on the file encoding (lol). I know it seems like "lol PHP who needs that" but I can guarantee you there will be a PECL module that exposes a WebAssembly engine at some point. Admittedly, those users will have the conversion done for them under the hood, so I wouldn't worry about them as much. |
Sorry I meant do you know languages which have UTF-16 strings in the past and switched to UTF-8 one without a problems except Swift? Languages which already designed with UTF-8 and latin1 (Like C/C++) in mind is not good examples for us. |
Oh, no. Which is why I suggest a compilation option, defaulted to whatever it is now (utf-16). :) This shifts the migration scope to the individual developer, not the entire community - meaning, if someone wants to change the string encoding from UTF-16 to UTF-8, it only affects their project, not all of the users of AssemblyScript. Of course it's something they will have to bear the cost of migrating, but at least it's a conscious decision. And for the rest of the users who are starting today or next week or whatever, they can choose this from the start. I agree with both sides of the issue and can see a lot of valid usecases for both. Why not give the developer the option? As long as it's well-documented that changing the encoding away from the default, you change the underlying data structures passed around, that's fine. To better detail what a compilation flag would look like:
It's a fair bit of work, but if the problem of encoding is a consistent issue, this could be one approach. EDIT: Also, if you ask me, this would be a really impressive and attractive feature to have for those evaluating AS.
This ignores those demographics, though. I think they should be taken into consideration - after all, WASM is a format specifically designed to be run in different environments as a sort of machine code "lingua franca". The fact it's web-first only means that it has inherent compatibility where other VMs do not. The fact is, WASM will be utilized from those languages and as such the DX should be considered just as you're considering the DX in Java, Swift, C#, etc. Just my opinion, however. I think ultimately, string-heavy languages or environments are going to benefit more from a fixed-width encoding such as UTF-16, wheras more imperative/binary/network environments are going to benefit from UTF-8 more. This ultimately comes down to application. I understand if AS wants to be opinionated in this regard, and it's certainly not going to deter me personally from using AS, but it's still something to be considered from my point of view. |
Right now, only the end application developer compiles their application (including with all libraries they imports). This is like using Webpack or Rollup to make a final bundle that you run in your web app. This means the user has a single This implies that the app developer make the compiler argument choices. Now suppose the app dev choose to compile with UTF-16 (or perhaps WTF-16). Now assume that app dev installed a library A whose author wrote code expecting UTF-8, and app dev also installed some other library B whose author wrote their library assuming UTF-16. Note, the library authors do not compile their code, they simply ship TypeScript source code on Considering that the app dev compiled the whole app using the option for Will that work seamlessly the way you are imagining it? |
In which case would this happen where the compiled code is not all compiled under the same compilation session? If you're interfacing two separate WASM payloads, it makes sense that you're going to have to know the datatypes and formats between them. We do that already with e.g. the Windows API and other system APIs, where (at least historically) we had to write stubs that converted window's This is no different. As far as I understand, there is no way to determine the underlying byte structure from the Javascript/Typescript end, and the API functions don't make that distinction either (except for, of course, time complexity). I guess I don't see the "library A -> library B" problem since everything is compiled at once. If you mean e.g. some sort of intermediate file, such as a COFF object file equivalent, it makes sense that two different ABIs cannot be sensibly compiled together. This is also the case in the C world. If I compile a source file with When entering the realm of compiled languages, these are things programmers have to care about - there's really no way around it. The environment in which you run your WASM will always matter, just like the environment you run any other system code matters. |
One of the most important goals of WebAssembly is to abstract from the CPU (and it really is). The main goal of WASI is to abstract from the operating system and it really is, although this only happens for standalone runtimes for now, browsers do not yet support WASI and require polyfill or work directly with the browser API. The upcoming Interface Types should bring another abstraction that will allow to interop data between wasm modules built in different languages without having to consider their specific circumstances, including the format of strings used in those languages. This is the power and philosophy of the WebAssembly ecosystem. |
But this isn't in regards to CPU specifics, this is string encoding. Calling conventions have little to do with CPUs too, which was the only analogy I mentioned. What you're asking for is a magic way to have both worlds, which isn't possible. The next best thing is giving the developer the power to do what they need to do to solve their problems. I'm reading between the lines here a bit, but it sounds like a "magic" solution is being sought, which I personally think is the absolute wrong approach. AssemblyScript is not as simple and easy as a Typescript runtime. It looks similar but in reality you're working with a slightly lower level set of tools and concepts. I think it's fine to allow users to specify which string encoding implementation they want to use during compilation. Understanding string encodings and the tradeoffs therein is a common part of software engineering, after all. If it's a gap in an individual's knowledge, it'll only benefit them to understand the difference anyway, and it won't limit the usefulness of the tool for those that already do. |
The same thing over and over again =) You cannot make strings with the same interface for UTF8 and UTF16 / WT16. They will be Absolutely two different APIs, even a thing like str.length will produce two different values for UTF8 and UTF16 strings. Not to mention the fact that it's going to be quite hard to maintain (because you will have to debug and test for both modes) |
There are solutions to the Use compat types for Make I'm sure there are a handful of similar great ideas. But it is impossible to get both worlds at once. If that is a hard requirement, then AssemblyScript is doomed. A tradeoff is necessary. My main criticism of this discussion at the meta level, however, is that it seems the committee is unwilling to make a tradeoff here where one is clearly necessary. There is no incredible, best solution here. All of the problems of interop between the two have been cleanly laid out. It's just politics at this point, about which approach is the least obtrusive. Beyond what I've said I don't think I can contribute much else. |
I thought AssemblyScript was not going to implement Interface Types proposal? |
it's THE problem all kinds of programming languages face. my suggestions aiming "least war": (in other word, most function provided)
(if no checking is desired, replace all utf with wtf) |
Given the amount of foregoing heated discussions on the topic, especially in context of Interface Types and GC, I am not getting the impression that anything of relevance is going to change, and we should start planning for the worst case.
So I have been thinking about what would be the implications of switching AssemblyScript's string encoding to W/UTF-8 by default again, and that doesn't look too bad if all one really wants is to get rid of WTF-16, is willing to break certain string APIs, wants it to be efficient after the breakage and otherwise is not judging.
Implications:
String#charCodeAt
would be removed in favor ofString#codePointAt
String#charAt
would be removed, or changed to retain codepoint boundaries if viableString#[]
would be removed, or changed to retain codepoint boundaries if viable, or to return bytes numeric like CString#length
would return the length of the string in bytesSting.fromCharCode
would be removed, or deprecated and polyfilledString#split
with an empty string separator would split at codepointsMeans we'd essentially jump the UTF-8 train to have
.className = "abc
)Note that the proposition of switching AS to UTF-8 is different from most of what has been discussed more recently, even though it has always been lingering in the background. Hasn't been a real topic so far due to the implied breakage with JS, but unlike the alternatives it can be made efficient when willing to break with JS. Certainly, the support-most-of-TypeScript folks may disagree as it picks a definite site.
If anything, however, we should switch the entire default and make a hard cut because
Thoughts?
The text was updated successfully, but these errors were encountered: