-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Embedded SDK hash in kernel binaries #41802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It is not clear why we can't bump kernel version when making incompatible changes in transformations to solve this problem. While format remains intact, if a change is not backwards compatible we should definitely bump kernel version and adjust minimum supported version in the VM so VM doesn't pretend it is supporting older versions. I agree that there is a value in bumping version automatically, but there are also disadvantages. The vast majority of changes are actually binary compatible, which provides more flexibility during development (for example consider bisecting/debugging a crash in gen_snapshot in Dart SDK repo on a dill file which was generated once in flutter). I think it would be useful to have an option to suppress this check if needed. In the past, there was a requirement to allow 1-2 week version skew most of the time, which we fulfilled with backwards compatible support of older kernel files in the VM. Incompatible changes were extremely rare and only when it was impossible to make a soft transition. While there might not be such requirement right now, hash-based versioning would prevent that in future if it is actually needed somewhere. I guess we will find out what actually relies on backwards compatibility of kernel binaries only when we roll new versioning scheme. |
There's two situations: If we have to be compatible (due to customers needs) some breaking changes become really hard (or not feasible at all) to roll out. If we don't have to maintain compatibility (no customer needs it), not providing it will significantly improve the VM's team productivity when doing breaking changes and provides our team with more flexibility. With no real downsides. (As you say, for VM development we can provide a way to opt-out of the verification) Right now it seems that there are no known customer needs for maintaining compatibility. We will test this assumption with this work. If there happens to be cases where customers depend on compatibility, we will now actually know - which is great :) Right now the only enforcement we have is the kernel binary format itself. There is no enforcement of ABI, kernel transformers that run, etc. We have had several occurrences of customers accidentally using incompatible kernel files which the VM didn't refuse, where the VM crashed in various ways. This caused a lot of wasted time to debug issues! So we want to enforce it, which we can only do by embedding a hash/version in the kernel file and making the VM verify it when it consumes kernel. Using a manually maintained number can be error prone if forgotten to be updated and also slows down development (we know that out of experience with the abi bot). |
It's also worth remembering that we can initialize from a dill and reuse the parts that's unchanged. This is for instance done in flutter. That it can be "error prone" - in my opinion - is not a good excuse. If nothing else you should be able to write a test that reminds you (e.g. have a file with your hash tied up to a binary version, if the hash doesn't match throw an error telling the user to bump the version and update the hash-to-version mapping or similar). |
From my perspective the benefits of having this mechanism outweigh negatives - for majority of use cases we need to be able to verify total consistency VM + kernel configuration without allowing any version skew. I think having this mechanism overall can lead to healthier user experience where our tools can automatically recover (e.g. ignore kernel binary which can't be used and produce a fresh one) or produce meaningful error messages. I don't think Kernel version number gives us enough information for that because it only covers Kernel AST format itself and does not cover core libraries or custom transformers applied to Kernel. AFAIK @johnniwinther was on board with this proposal. So overall I think we should move forward with tightening this up - if just to detect situations where people rely on VM allowing some version skew (which is not something that we have documented anywhere!) - though potentially maintain an escape hatch if we discover late in the roll process that we need to allow some version skew. |
Pros:
Cons:
But you can certainly update it despite not having changed the actual format. |
I am not sure why this is needed. Are you talking about development environment? If you are working on a change locally we can set things in such a way that this check is either disabled or is using base commit hash for your change - this means you don't have to rebuild anything.
It is compiled using new CFE running on old version of the VM. The key here is that files produced by the new CFE should only be then consumed by the new VM, not old VM.
I think I don't understand how this test would look like. Could you elaborate a bit more? I think the problem we are facing right now is that we don't really have a very clear definition of the ABI boundary, so we can't just write a simple test for it (we can slowly expand the test whenever we hit a incompatibility issue during the roll - but I don't think this is the right approach). |
If I update from commit
The way I see it, there's two options:
Ah. So we know that there is something that we can change that might change the compatibility, but we don't know what that something is so the "solution" is to assume it's everything? I'm all for making stuff blow up less (and if that includes having more information in the dill that's fine by me (although it also has to be easily checkable on the dart side which shouldn't accept it either)), but this feels like "shooting sparrows with cannons" (Danish saying that I'm not sure exists in English). |
We can make it work for local commits - but if you rebase your local branch then you need to through out your local dills. We can also give you an escape hatch for local CFE development, e.g. if you just read and write dill and don't run them on the VM then you can dills around without regenerating them.
It can also get the has from
Yes, but the sparrow is guaranteed to be kaput after this. Ideally, we need to have a really well defined and statically enforced API between all involved components. Unfortunately we are not in the ideal situation - and getting it fully under control would require considerable effort. So we need some solid but relatively easy solution to prevent |
This would only work when checked out via git, and it would require to run the git command every time which is certainly going to slow things down.
I'm guessing there's been an actual bug in regards to the async transformation since that's what's mentioned above. And yes, this would probably have "guaranteed" to have cought/avoided that. There'll probably still be sort of similar problems up the alley of customers accidentially using wrong dills that it doesn't fix though. |
Note that there have been at least 3-4 similar issues in the past couple months - most of which took time to understand and triage. So this is not happening just because there was a single issue - it is happening because similar issues occur repeatedly. |
Some tools do use forwards compatibility of snapshots as an optimization. It would be unfortunate to never be able to have that optimization and alway shave to re-snapshot across SDK versions, but that would be strongly preferred over having no indication of incompatibility until there is a runtime error. |
Embedding the SDK hash inside the Kernel file, checking for it and throwing an error makes kernel files almost similar to AOT snapshots in terms of version compatibility and is stricter than the proposal of using version numbers in the kernel file and providing compatibility until there is a breaking change, the versions have to be checked by the VM as proposed in #38969 |
Bumping to P1 as per #41913 (comment). |
After a discussion with @jakemac53 I learned that the optimization of reuse across SDK versions is more rare than I thought. If it's easier to embed the SDK hash I think it's fine to push on that than to try for less strict. |
In order to enable us to do breaking changes, we do not promise any compatibility (**) windows. So the usefulness of number-based / manual versioning reduces to a performance optimization via caching (we can avoid re-computing kernel files on Dart/Flutter SDK upgrades if there happened to be no breaking changes). What cadence do we expect Flutter / Dart users to upgrade to new SDKs on beta/stable channels? My guess would be that it's probably measured in weeks. How much gain do we expect then from this performance optimization vs what does it cost us? (**) Compatibility = Guarantee that VM at commit
Our AOT snapshot format is also not invalidated on every commit (far from it). Does it mean we should do manual versioning here as well? (The arguments for manually versioning kernel files also apply here. Local development, avoid recompilation (takes much longer than re-computing kernel files) ...) Ultimately this is a tradeoff: Reduced cachability of kernel files with guarantee of no bugs vs cost of manual versioning and possibility risking bugs for users. We have seen many bugs like #41913 and I'm in favor of taking the safe, bug-free approach. |
I'll summarize (my understanding of) the meeting we had May 12th: "The VM team would create a document describing in more practical details how this would actually work to see if it was feasible to actually include the hash (or something equivalent) in the dill." |
Yes. I've created such a doc, but I've been asked to focus on creating a Proof of Concept CL to show a practical implementation instead. |
I have now put together a proposal CL. Thanks |
Clement made now a concrete cl/150343 which has all the practical details - of how this would work - in it. The summary of how it works is:
The Dart and Flutter SDKs contain only snapshots of our tools (such as kernel service, front end server, gen kernel, ...) which will all have the evaluated sdk hash embedded in them. So all of them will automatically produce kernels with the right hash and will verify against the correct hash when consuming kernel. Local development has a very easy way to opt out of this via G3 will use the hash that rolls into it instead of trying to consult Summary: We will include the SDK hash in generated kernels and all tools that read kernel files will validate it. To the best of our knowledge nobody relies on compatibility between commit N and N+1 (if so, we will find out once this land and rolls). The only downside of this is less effective caching of kernel files for some tools (e.g. @mraleph @a-siva @jensjoha @alexmarkov If you have any concerns regarding this or concrete reasons why we should not do this, please let us know. We can also setup a VC for it. |
Adds a new SDK hash to kernels and the VM which is optionally checked to verify kernels are built for the same SDK as the VM. This helps catch incompatibilities that are currently causing subtle bugs and (not so subtle) crashes. The SDK hash is encoded in kernels as a new field in components. The hash is derived from the 10 byte git short hash. This new check can be disabled via: tools/gn.py ... --no-verify-sdk-hash This CL bumps the min. (and max.) supported kernel format version, making the VM backwards incompatible from this point back. Bug: #41802 Change-Id: I3cbb2d481239ee64dafdaa0e4aac36c80281931b Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/150343 Commit-Queue: Clement Skau <[email protected]> Reviewed-by: Jens Johansen <[email protected]> Reviewed-by: Martin Kustermann <[email protected]>
This reverts commit edde575. Reason for revert: Breaks the Dart to Flutter roll and golem Original change's description: > [SDK] Adds an SDK hash to kernels and the VM. > > Adds a new SDK hash to kernels and the VM which is optionally checked > to verify kernels are built for the same SDK as the VM. > This helps catch incompatibilities that are currently causing > subtle bugs and (not so subtle) crashes. > > The SDK hash is encoded in kernels as a new field in components. > The hash is derived from the 10 byte git short hash. > > This new check can be disabled via: > tools/gn.py ... --no-verify-sdk-hash > > This CL bumps the min. (and max.) supported kernel format version, > making the VM backwards incompatible from this point back. > > Bug: #41802 > Change-Id: I3cbb2d481239ee64dafdaa0e4aac36c80281931b > Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/150343 > Commit-Queue: Clement Skau <[email protected]> > Reviewed-by: Jens Johansen <[email protected]> > Reviewed-by: Martin Kustermann <[email protected]> [email protected],[email protected],[email protected] Change-Id: I34cc7d378e2babdaaca4d932d19c19d0f35422fc No-Presubmit: true No-Tree-Checks: true No-Try: true Bug: #41802 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/152703 Reviewed-by: Siva Annamalai <[email protected]> Commit-Queue: Siva Annamalai <[email protected]>
@mkustermann @cskau-g There are existing tests for ABI compatibility: |
@kevmoo Just to be very clear - the actual issue is not a Dart VM issue: The issue is that => The right solution is to change our tools to invalidate caches on SDK updates - cached kernel files are only safe to re-use if they are run on the same VM which was used to produce them. (**) What this github issue is about is to avoid such accidental, incorrect, usages of kernel files and fail earlier with a clear error message - instead of trying to run incompatible kernel programs which may result in unexpected behavior. (**) As a good example of a tool that does it the right way: The main |
Note: This is a reland of https://dart-review.googlesource.com/c/sdk/+/150343 Adds a new SDK hash to kernels and the VM which is optionally checked to verify kernels are built for the same SDK as the VM. This helps catch incompatibilities that are currently causing subtle bugs and (not so subtle) crashes. The SDK hash is encoded in kernels as a new field in components. The hash is derived from the 10 byte git short hash. This new check can be disabled via: tools/gn.py ... --no-verify-sdk-hash This CL bumps the min. (and max.) supported kernel format version, making the VM backwards incompatible from this point back. This also bumps the min. and current ABI version. Bug: #41802 Change-Id: I2f85945045a603eb9dcfd1f2c0d0d024bd84a956 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/152802 Commit-Queue: Clement Skau <[email protected]> Reviewed-by: Martin Kustermann <[email protected]>
This is a VM issue. It was an observable change in behavior which has caused ongoing breakage for over 2 months after being reported. The existence (and not closure) of these issues has led us to believe we did not need to do anything on our end and it would be fixed. For the past several years the VM has had predictable behavior that did not require users (such as pub, build_runner, etc) to do their own invalidation of snapshots. They could assume a snapshot (maybe its a kernel file now technically) was up to date, run it, and then use the failure of that run to detect that it was out of date. This was useful behavior that made sense to rely upon because anything else would require additional tracking of the previous sdk versions etc by all of these tools. That means additional files on disk and bookkeeping, which would come with its own bugs. If the VM would like to change this behavior, a breaking change proposal/announcement should be sent out indicating as such. |
We have been relying on the VM reporting for us the incompatibility with exit code 253 since 2014. https://codereview.chromium.org//745153002 We have been asking for a better way to detect this since around then as well. |
To the best of my knowledge there was no such guarantee for the past several years: During the transition from Dart 1 to Dart 2 the VM was changed to use kernel for script snapshots (4b1180a was seemingly the first commit to introduce this behavior, bc7220a made it probably the default). Before this time, the VM has embedded a hash (based on a subset of VM source files) into script snapshots, assuming the hash will change if snapshots become incompatible (see 3dfb90f) - whether this list covered everything (see #41802 (comment)) is another question. With the introduction of kernel, a kernel format version number was added to kernel files. The VM would, as usual, issue an API error if it cannot read such a kernel file. But as outlined in #41802 (comment), that version number does cover only the kernel format itself and not various other compatibilities. => I agree, we could consider this a breaking change, back in 2018.
Originally this seemingly came from c859d25. Though notice that the exit code 253 is not reserved for incompatible snapshots, we issue API errors also in other situations.
I understand that this seems very useful, though one could argue that it would be much better to add a flag to the VM to ask it whether it can run a snapshot instead of trying to run it and see what happens (e.g. imagine a user program is actually exiting with this exit code, would it trigger an infinite re-snapshoting attempts?) In any case, because of all these problems I have advocated (a lot, despite strong opposition) for embedding the SDK hash into the kernel, which we have now - the CL re-landed in 0ce8398. We will leave this bug open until it has rolled through flutter and then close it. |
Thanks, @mkustermann ! |
Was this communicated as a breaking change when the guarantee was dropped?
Yes, and in fact we have argued that would be better. #20802
I'm glad to see it's getting fixed. Can I assume that the previously existing concept of kernel format versioning is going away with this change? Will all tools that read and write kernel files be invalidating them by SDK hash? |
My understanding, from poking around this code, is that the existing 'Kernel Format Version' is concerned with the binary format of the kernel - i.e. which fields are expected to be there for a given version: sdk/runtime/vm/kernel_binary.cc Lines 84 to 107 in 0ce8398
So KFV and SDK Hash are perhaps complementary to each other. For that reason I imagine the two will coexist, and different tools will be interested in one or the other. |
Do we know of any specific use case for having different SDK hash kernel files that can be read or written by a single version of a tool? Are we retaining generality that is never exercised? @jakemac53, @jonahwilliams - are you aware of any places that we cache kernel files across SDK versions that aren't VM snapshots? |
the --initialize-from-dill frontend server option accepts a previously output kernel file to speed up initialization. The expectation is that this will correctly handle updates to the kernel version, or changes to the dart SDK. This is used by the flutter tool, but the expectation is mostly that it speeds up subsequent runs on the same version . |
I would imagine that the frontend server would need to also use the SDK hash for this use case. @mkustermann @cskau-g - do we know if it does? If it doesn't, will it cause problems? |
Also, there is whatever fuchsia is doing with the google3 rolls. I know previously there was a coordinated double roll, but I don't know if that has been solved by moving the fuchsia+flutter dependencies into flutter/engine. FYI @iskakaushik |
Maybe - but not afaik. Probably because the people who knew about this (still undocumented, not really tested) safely-to-consume-or-253-exit-code didn't know that kernel cannot be re-used, and the people who knew that kernel cannot be re-used didn't know about this safely-to-consume-or-253-exit-code.
There was no activity on that issue since 2018. Maybe the issue should have received stronger push for it and/or be escalated more.
As @cskau-g says, they are two different things. The version number in the kernel format is mainly versioning the container format (as opposed to the contents of it)
For example a tool that dumps the kernel as text format can consume any kernel as long as it understands the container format (content doesn't matter).
See below.
We have one place in our code base that that produces binary kernel files: All our end-user Dart tools (e.g. frontend_server, gen_kernel, ...) are distributed in Dart/Flutter SDKs as AppJIT or Kernel snapshots. The build rules used for generating those should cause the correct SDK hash to be embedded into the code mentioned above. Similarly, the VM gets the SDK hash baked in via it's build rules. For local development there are ways to disable this checking. If one runs those tools from source the checking is also disabled atm. That being said, => Let me try to verify that the implementation does what I have described above and maybe harden the checks that were added. |
I'm more concerned with the frontend server consuming and producing kernel files than I am with the snapshot of frontend_server. The flutter tool currently caches dill files across SDK versions and passed them via the |
…at produce Kernel The missing --short=10 was causing (depending on git version and configuration) us to sometimes default to using '0000000000'. Furthermore the build rules were missing two places where -Dsdk_hash has to be set. Issue #41802 Change-Id: I83dbfcce677e2594074c1139093bd9592d4fa3ee Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/154684 Commit-Queue: Martin Kustermann <[email protected]> Reviewed-by: Daco Harkes <[email protected]>
The snapshot of frontend_server contains the CFE, which contains When the frontend_server snapshot is then run, it will verify kernel binaries it reads against this hash and will also produce kernel binaries with the right hash. If it is given an incompatible kernel file it will throw an exception.
How does flutter tool know whether it can give an old dill to a new frontend server?
Yes. The VM should validate that the kernel was produced by a frontend server which was built at same Dart SDK commit as the VM itself. |
I have made now an experiment with a locally built engine where I observed the following: First I clear the caches and build the app with my locally built engine and check the SDK hash of the kernel file:
Then I make another build of the locally built engine with a different Dart SDK commit (everything else the same), rebuild the app, and re-examine the SDK hash:
As we can see, the flutter tools is correctly noticing that the SDK has changed and re-runs the compilation using the new frontend server. We can also see that the SDK hash got correctly embedded to the kernel files. So after 90bba3a landed things seem to be working exactly as intended. We're going to leave this issue open a little longer until the latest change has rolled further (also into google3). |
@natebosch The pub tool should now get it's 253 exit code behavior after every update of the Dart SDK. |
I'm still not clear on why we need both mechanisms for invalidation here. The kernel -> text tool being able to read kernel files across SDKs doesn't seem like it warrants keeping the complexity of both mechanisms long term, but I won't push on it any further. Thanks for the fix! |
…at produce Kernel The missing --short=10 was causing (depending on git version and configuration) us to sometimes default to using '0000000000'. Furthermore the build rules were missing two places where -Dsdk_hash has to be set. Issue #41802 Change-Id: I83dbfcce677e2594074c1139093bd9592d4fa3ee Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/154684 Commit-Queue: Martin Kustermann <[email protected]> Reviewed-by: Daco Harkes <[email protected]>
Is more work anticipated on this? Can we close the issue ? |
Only for g3. Let me make an internal bug and close this one. |
There are several layers of compatibility in the VM regarding to kernel. One is whether the VM can correctly parse Kernel files. Another is whether the kernel file actually has the right information inside it (e.g. the right async transformer has run, the right classes/fields/variables are in it).
The second point does not have any validation atm and can lead to customers accidentally using kernel files of an older version, which the VM can still parse, but the kernel files might have the wrong information in it (e.g. old async transformer ran instead of new one).
To make such cases not crash the VM and instead give a nicer error message, we want the VM to validate that the Kernel it consumes comes from the same Dart SDK commit as the VM was built at.
We already have support for embedding a hash in the VM which validates that e.g. AOT snapshots given to the VM were produced by
gen_snapshot
from the same sources.=> We would like to extend this to kernel files and embed a hash on those.
/cc @mraleph @johnniwinther
The text was updated successfully, but these errors were encountered: