Skip to content

Embedded SDK hash in kernel binaries #41802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mkustermann opened this issue May 7, 2020 · 44 comments
Closed

Embedded SDK hash in kernel binaries #41802

mkustermann opened this issue May 7, 2020 · 44 comments
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. P1 A high priority bug; for example, a single project is unusable or has many test failures

Comments

@mkustermann
Copy link
Member

There are several layers of compatibility in the VM regarding to kernel. One is whether the VM can correctly parse Kernel files. Another is whether the kernel file actually has the right information inside it (e.g. the right async transformer has run, the right classes/fields/variables are in it).

The second point does not have any validation atm and can lead to customers accidentally using kernel files of an older version, which the VM can still parse, but the kernel files might have the wrong information in it (e.g. old async transformer ran instead of new one).

To make such cases not crash the VM and instead give a nicer error message, we want the VM to validate that the Kernel it consumes comes from the same Dart SDK commit as the VM was built at.

We already have support for embedding a hash in the VM which validates that e.g. AOT snapshots given to the VM were produced by gen_snapshot from the same sources.

=> We would like to extend this to kernel files and embed a hash on those.

/cc @mraleph @johnniwinther

@mkustermann mkustermann added the area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. label May 7, 2020
@mkustermann mkustermann assigned ghost May 7, 2020
@alexmarkov
Copy link
Contributor

It is not clear why we can't bump kernel version when making incompatible changes in transformations to solve this problem. While format remains intact, if a change is not backwards compatible we should definitely bump kernel version and adjust minimum supported version in the VM so VM doesn't pretend it is supporting older versions.

I agree that there is a value in bumping version automatically, but there are also disadvantages. The vast majority of changes are actually binary compatible, which provides more flexibility during development (for example consider bisecting/debugging a crash in gen_snapshot in Dart SDK repo on a dill file which was generated once in flutter). I think it would be useful to have an option to suppress this check if needed.

In the past, there was a requirement to allow 1-2 week version skew most of the time, which we fulfilled with backwards compatible support of older kernel files in the VM. Incompatible changes were extremely rare and only when it was impossible to make a soft transition. While there might not be such requirement right now, hash-based versioning would prevent that in future if it is actually needed somewhere. I guess we will find out what actually relies on backwards compatibility of kernel binaries only when we roll new versioning scheme.

@mkustermann
Copy link
Member Author

There's two situations: If we have to be compatible (due to customers needs) some breaking changes become really hard (or not feasible at all) to roll out. If we don't have to maintain compatibility (no customer needs it), not providing it will significantly improve the VM's team productivity when doing breaking changes and provides our team with more flexibility. With no real downsides. (As you say, for VM development we can provide a way to opt-out of the verification)

Right now it seems that there are no known customer needs for maintaining compatibility. We will test this assumption with this work. If there happens to be cases where customers depend on compatibility, we will now actually know - which is great :)

Right now the only enforcement we have is the kernel binary format itself. There is no enforcement of ABI, kernel transformers that run, etc. We have had several occurrences of customers accidentally using incompatible kernel files which the VM didn't refuse, where the VM crashed in various ways. This caused a lot of wasted time to debug issues!

So we want to enforce it, which we can only do by embedding a hash/version in the kernel file and making the VM verify it when it consumes kernel. Using a manually maintained number can be error prone if forgotten to be updated and also slows down development (we know that out of experience with the abi bot).

Maybe @mraleph @a-siva have a different opinion?

@jensjoha
Copy link
Contributor

It's also worth remembering that we can initialize from a dill and reuse the parts that's unchanged. This is for instance done in flutter.
Adding some "hash" that has to match for the vm to be able to load it, would mean that we would either create some file that the VM couldn't consume, or that we'd have to check that hash as well as the binary version. Neither seems good to me, and bumping the binary version seems like the right choice to me.

That it can be "error prone" - in my opinion - is not a good excuse. If nothing else you should be able to write a test that reminds you (e.g. have a file with your hash tied up to a binary version, if the hash doesn't match throw an error telling the user to bump the version and update the hash-to-version mapping or similar).

@mraleph
Copy link
Member

mraleph commented May 12, 2020

From my perspective the benefits of having this mechanism outweigh negatives - for majority of use cases we need to be able to verify total consistency VM + kernel configuration without allowing any version skew.

I think having this mechanism overall can lead to healthier user experience where our tools can automatically recover (e.g. ignore kernel binary which can't be used and produce a fresh one) or produce meaningful error messages.

I don't think Kernel version number gives us enough information for that because it only covers Kernel AST format itself and does not cover core libraries or custom transformers applied to Kernel. AFAIK @johnniwinther was on board with this proposal.

So overall I think we should move forward with tightening this up - if just to detect situations where people rely on VM allowing some version skew (which is not something that we have documented anywhere!) - though potentially maintain an escape hatch if we discover late in the roll process that we need to allow some version skew.

@jensjoha
Copy link
Contributor

Pros:

  • Happens automatically

Cons:

  • Happens automatically, even when not needed
  • Would require to rebuild all sdk etc for every change
  • Would for all intends and purpose be a "subversion" of the "binary version" that every tool reading and writing dill files would have to know about and deal with.
  • Isn't the new sdk compiled with an old version of the VM?

I don't think Kernel version number gives us enough information for that because it only covers Kernel AST format itself

But you can certainly update it despite not having changed the actual format.
I still fail to see why you can't get the same "security" from a test...

@mraleph
Copy link
Member

mraleph commented May 12, 2020

Happens automatically, even when not needed

Would require to rebuild all sdk etc for every change

I am not sure why this is needed. Are you talking about development environment? If you are working on a change locally we can set things in such a way that this check is either disabled or is using base commit hash for your change - this means you don't have to rebuild anything.

Isn't the new sdk compiled with an old version of the VM?

It is compiled using new CFE running on old version of the VM. The key here is that files produced by the new CFE should only be then consumed by the new VM, not old VM.

I still fail to see why you can't get the same "security" from a test...

I think I don't understand how this test would look like. Could you elaborate a bit more?

I think the problem we are facing right now is that we don't really have a very clear definition of the ABI boundary, so we can't just write a simple test for it (we can slowly expand the test whenever we hit a incompatibility issue during the roll - but I don't think this is the right approach).

@jensjoha
Copy link
Contributor

Happens automatically, even when not needed

Would require to rebuild all sdk etc for every change

I am not sure why this is needed. Are you talking about development environment? If you are working on a change locally we can set things in such a way that this check is either disabled or is using base commit hash for your change - this means you don't have to rebuild anything.

If I update from commit n to commit n+1 the way I understand what's written here is that it should throw away the old dills and create new ones. (even doing local commits)

Isn't the new sdk compiled with an old version of the VM?

It is compiled using new CFE running on old version of the VM. The key here is that files produced by the new CFE should only be then consumed by the new VM, not old VM.

The way I see it, there's two options:

  • It gets the hash from the VM --- that way you wouldn't be able to create it using an old VM.
  • It has the hash build in to its sources --- that way you would have to have it committed in a file --- like the binary version.

I still fail to see why you can't get the same "security" from a test...

I think I don't understand how this test would look like. Could you elaborate a bit more?

I think the problem we are facing right now is that we don't really have a very clear definition of the ABI boundary, so we can't just write a simple test for it (we can slowly expand the test whenever we hit a incompatibility issue during the roll - but I don't think this is the right approach).

Ah. So we know that there is something that we can change that might change the compatibility, but we don't know what that something is so the "solution" is to assume it's everything?

I'm all for making stuff blow up less (and if that includes having more information in the dill that's fine by me (although it also has to be easily checkable on the dart side which shouldn't accept it either)), but this feels like "shooting sparrows with cannons" (Danish saying that I'm not sure exists in English).

@mraleph
Copy link
Member

mraleph commented May 12, 2020

If I update from commit n to commit n+1 the way I understand what's written here is that it should throw away the old dills and create new ones. (even doing local commits)

We can make it work for local commits - but if you rebase your local branch then you need to through out your local dills.

We can also give you an escape hatch for local CFE development, e.g. if you just read and write dill and don't run them on the VM then you can dills around without regenerating them.

The way I see it, there's two options

It has the hash build in to its sources --- that way you would have to have it committed in a file --- like the binary version.

It can also get the has from git command itself. I think precise mechanism here is TBD by involved teams - we need something that works for SDK development, SDK releases as well as SDK rolls into other repositories.

but this feels like "shooting sparrows with cannons" (Danish saying that I'm not sure exists in English).

Yes, but the sparrow is guaranteed to be kaput after this.

Ideally, we need to have a really well defined and statically enforced API between all involved components. Unfortunately we are not in the ideal situation - and getting it fully under control would require considerable effort. So we need some solid but relatively easy solution to prevent those pesky sparrows version skews from complicating rolls in confusing ways.

@jensjoha
Copy link
Contributor

It can also get the has from git command itself. I think precise mechanism here is TBD by involved teams - we need something that works for SDK development, SDK releases as well as SDK rolls into other repositories.

This would only work when checked out via git, and it would require to run the git command every time which is certainly going to slow things down.

but this feels like "shooting sparrows with cannons" (Danish saying that I'm not sure exists in English).

Yes, but the sparrow is guaranteed to be kaput after this.

I'm guessing there's been an actual bug in regards to the async transformation since that's what's mentioned above. And yes, this would probably have "guaranteed" to have cought/avoided that. There'll probably still be sort of similar problems up the alley of customers accidentially using wrong dills that it doesn't fix though.

@mraleph
Copy link
Member

mraleph commented May 12, 2020

I'm guessing there's been an actual bug in regards to the async transformation since that's what's mentioned above. And yes, this would probably have "guaranteed" to have cought/avoided that.

Note that there have been at least 3-4 similar issues in the past couple months - most of which took time to understand and triage. So this is not happening just because there was a single issue - it is happening because similar issues occur repeatedly.

@natebosch
Copy link
Member

Right now it seems that there are no known customer needs for maintaining compatibility.

Some tools do use forwards compatibility of snapshots as an optimization. It would be unfortunate to never be able to have that optimization and alway shave to re-snapshot across SDK versions, but that would be strongly preferred over having no indication of incompatibility until there is a runtime error.

@a-siva
Copy link
Contributor

a-siva commented May 20, 2020

Embedding the SDK hash inside the Kernel file, checking for it and throwing an error makes kernel files almost similar to AOT snapshots in terms of version compatibility and is stricter than the proposal of using version numbers in the kernel file and providing compatibility until there is a breaking change, the versions have to be checked by the VM as proposed in #38969
Incompatible async transformations like the one defined above would be considered breaking and would require a version bump, of course there is the issue of how do we guarantee that the version is bumped when a breaking change is made (the purpose of the ABI compatibility bot was meant to detect this, maybe it is not doing the job).

@devoncarew devoncarew added the P1 A high priority bug; for example, a single project is unusable or has many test failures label May 20, 2020
@devoncarew
Copy link
Member

Bumping to P1 as per #41913 (comment).

@natebosch
Copy link
Member

After a discussion with @jakemac53 I learned that the optimization of reuse across SDK versions is more rare than I thought. If it's easier to embed the SDK hash I think it's fine to push on that than to try for less strict.

@mkustermann
Copy link
Member Author

In order to enable us to do breaking changes, we do not promise any compatibility (**) windows.

So the usefulness of number-based / manual versioning reduces to a performance optimization via caching (we can avoid re-computing kernel files on Dart/Flutter SDK upgrades if there happened to be no breaking changes).

What cadence do we expect Flutter / Dart users to upgrade to new SDKs on beta/stable channels? My guess would be that it's probably measured in weeks. How much gain do we expect then from this performance optimization vs what does it cost us?

(**) Compatibility = Guarantee that VM at commit C and C^ will both understand same kernel file.

Embedding the SDK hash inside the Kernel file, checking for it and throwing an error makes
kernel files almost similar to AOT snapshots in terms of version compatibility and is stricter
than the proposal of using version numbers in the kernel file and providing compatibility until
until there is a breaking change ...

Our AOT snapshot format is also not invalidated on every commit (far from it). Does it mean we should do manual versioning here as well? (The arguments for manually versioning kernel files also apply here. Local development, avoid recompilation (takes much longer than re-computing kernel files) ...)

Ultimately this is a tradeoff: Reduced cachability of kernel files with guarantee of no bugs vs cost of manual versioning and possibility risking bugs for users.

We have seen many bugs like #41913 and I'm in favor of taking the safe, bug-free approach.

@jensjoha
Copy link
Contributor

I'll summarize (my understanding of) the meeting we had May 12th: "The VM team would create a document describing in more practical details how this would actually work to see if it was feasible to actually include the hash (or something equivalent) in the dill."
Is there any update on that front?

@ghost
Copy link

ghost commented Jun 12, 2020

Yes. I've created such a doc, but I've been asked to focus on creating a Proof of Concept CL to show a practical implementation instead.
I believe the idea is that we'd discuss the details of this CL instead of the doc.

@ghost
Copy link

ghost commented Jun 17, 2020

I have now put together a proposal CL.
Anyone interested is welcome to look it over and provide feedback, concerns, questions, comments, etc.

Thanks

@mkustermann
Copy link
Member Author

The VM team would create a document describing in more practical details how this would
actually work to see if it was feasible to actually include the hash (or something equivalent)
in the dill."

Clement made now a concrete cl/150343 which has all the practical details - of how this would work - in it.

The summary of how it works is:

  • During the build we determine which sdk_hash to use
    • either obtaining it from git
    • or using using 00...00 for local development if tools/gn.py --no-verify-sdk-hash was specified
  • During the build we then
    • include that sdk_hash in the generated C++ code which makes it available via Version::SdkHash()
    • pass it as -Dsdk_hash=<...> to all tools that produce kernel files or snapshots (package:kernel will use this hash and write it into generated kernel files)
  • Verification happens in
    • the VM which will validate that any kernel given to it has the expected hash
    • the kernel reader in package:kernel which will validate that any kernel it reads has the expected hash

The Dart and Flutter SDKs contain only snapshots of our tools (such as kernel service, front end server, gen kernel, ...) which will all have the evaluated sdk hash embedded in them. So all of them will automatically produce kernels with the right hash and will verify against the correct hash when consuming kernel.

Local development has a very easy way to opt out of this via tools/gn.py --no-verify-sdk-hash.

G3 will use the hash that rolls into it instead of trying to consult git. Everything else can work the same.

Summary: We will include the SDK hash in generated kernels and all tools that read kernel files will validate it. To the best of our knowledge nobody relies on compatibility between commit N and N+1 (if so, we will find out once this land and rolls). The only downside of this is less effective caching of kernel files for some tools (e.g. pub) - but users don't update sdks that often. The big benefit is that it becomes actually safe (e.g. the caching of tools like pub) and there is no longer a need to debug issues with mismatched kernel versions.

@mraleph @a-siva @jensjoha @alexmarkov If you have any concerns regarding this or concrete reasons why we should not do this, please let us know. We can also setup a VC for it.

dart-bot pushed a commit that referenced this issue Jun 26, 2020
Adds a new SDK hash to kernels and the VM which is optionally checked
to verify kernels are built for the same SDK as the VM.
This helps catch incompatibilities that are currently causing
subtle bugs and (not so subtle) crashes.

The SDK hash is encoded in kernels as a new field in components.
The hash is derived from the 10 byte git short hash.

This new check can be disabled via:
  tools/gn.py ... --no-verify-sdk-hash

This CL bumps the min. (and max.) supported kernel format version,
making the VM backwards incompatible from this point back.

Bug: #41802
Change-Id: I3cbb2d481239ee64dafdaa0e4aac36c80281931b
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/150343
Commit-Queue: Clement Skau <[email protected]>
Reviewed-by: Jens Johansen <[email protected]>
Reviewed-by: Martin Kustermann <[email protected]>
dart-bot pushed a commit that referenced this issue Jun 26, 2020
This reverts commit edde575.

Reason for revert: Breaks the Dart to Flutter roll and golem

Original change's description:
> [SDK] Adds an SDK hash to kernels and the VM.
>
> Adds a new SDK hash to kernels and the VM which is optionally checked
> to verify kernels are built for the same SDK as the VM.
> This helps catch incompatibilities that are currently causing
> subtle bugs and (not so subtle) crashes.
>
> The SDK hash is encoded in kernels as a new field in components.
> The hash is derived from the 10 byte git short hash.
>
> This new check can be disabled via:
>   tools/gn.py ... --no-verify-sdk-hash
>
> This CL bumps the min. (and max.) supported kernel format version,
> making the VM backwards incompatible from this point back.
>
> Bug: #41802
> Change-Id: I3cbb2d481239ee64dafdaa0e4aac36c80281931b
> Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/150343
> Commit-Queue: Clement Skau <[email protected]>
> Reviewed-by: Jens Johansen <[email protected]>
> Reviewed-by: Martin Kustermann <[email protected]>

[email protected],[email protected],[email protected]

Change-Id: I34cc7d378e2babdaaca4d932d19c19d0f35422fc
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug: #41802
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/152703
Reviewed-by: Siva Annamalai <[email protected]>
Commit-Queue: Siva Annamalai <[email protected]>
@alexmarkov
Copy link
Contributor

@mkustermann @cskau-g There are existing tests for ABI compatibility: vm-dartkb-linux-release-x64-abi bot, pkg/front_end/test/old_dill. If we no longer provide any compatibility, then we should stop testing that. We should remove these tests before landing the SDK hash change, otherwise this change (or subsequent changes with different hashes) will break those tests.

@mkustermann
Copy link
Member Author

@kevmoo The CL was landed in edde575, but got reverted. @cskau-g has made a re-land in cl/152802 with a few changes (to address issues uncovered when the first CL landed)

@mkustermann
Copy link
Member Author

We are hitting this a LOT it seems. Causing a lot of failures on our Travis CI runs.

@kevmoo Just to be very clear - the actual issue is not a Dart VM issue:

The issue is that pub and various tools produce kernel files (which is an internal format used between our SDK tools) and make assumptions about their compatibility across SDK versions - those assumptions are incorrect: we provide no compatibility guarantees!

=> The right solution is to change our tools to invalidate caches on SDK updates - cached kernel files are only safe to re-use if they are run on the same VM which was used to produce them. (**)

What this github issue is about is to avoid such accidental, incorrect, usages of kernel files and fail earlier with a clear error message - instead of trying to run incompatible kernel programs which may result in unexpected behavior.

(**) As a good example of a tool that does it the right way: The main flutter tool will re-generate the flutter tools kernel file whenever the SDK is updated

dart-bot pushed a commit that referenced this issue Jul 7, 2020
Note: This is a reland of https://dart-review.googlesource.com/c/sdk/+/150343

Adds a new SDK hash to kernels and the VM which is optionally checked
to verify kernels are built for the same SDK as the VM.
This helps catch incompatibilities that are currently causing
subtle bugs and (not so subtle) crashes.

The SDK hash is encoded in kernels as a new field in components.
The hash is derived from the 10 byte git short hash.

This new check can be disabled via:
  tools/gn.py ... --no-verify-sdk-hash

This CL bumps the min. (and max.) supported kernel format version,
making the VM backwards incompatible from this point back.

This also bumps the min. and current ABI version.

Bug: #41802
Change-Id: I2f85945045a603eb9dcfd1f2c0d0d024bd84a956
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/152802
Commit-Queue: Clement Skau <[email protected]>
Reviewed-by: Martin Kustermann <[email protected]>
@jakemac53
Copy link
Contributor

@kevmoo Just to be very clear - the actual issue is not a Dart VM issue:

This is a VM issue. It was an observable change in behavior which has caused ongoing breakage for over 2 months after being reported. The existence (and not closure) of these issues has led us to believe we did not need to do anything on our end and it would be fixed.

For the past several years the VM has had predictable behavior that did not require users (such as pub, build_runner, etc) to do their own invalidation of snapshots. They could assume a snapshot (maybe its a kernel file now technically) was up to date, run it, and then use the failure of that run to detect that it was out of date.

This was useful behavior that made sense to rely upon because anything else would require additional tracking of the previous sdk versions etc by all of these tools. That means additional files on disk and bookkeeping, which would come with its own bugs.

If the VM would like to change this behavior, a breaking change proposal/announcement should be sent out indicating as such.

@natebosch
Copy link
Member

We have been relying on the VM reporting for us the incompatibility with exit code 253 since 2014.

https://codereview.chromium.org//745153002
#15982

We have been asking for a better way to detect this since around then as well.

#20802

@mkustermann
Copy link
Member Author

This is a VM issue. It was an observable change in behavior which has caused ongoing breakage for over 2 months
after being reported. The existence (and not closure) of these issues has led us to believe we did not need to do anything on
our end and it would be fixed.

For the past several years the VM has had predictable behavior that did not require users (such as pub, build_runner, etc) to

To the best of my knowledge there was no such guarantee for the past several years: During the transition from Dart 1 to Dart 2 the VM was changed to use kernel for script snapshots (4b1180a was seemingly the first commit to introduce this behavior, bc7220a made it probably the default).

Before this time, the VM has embedded a hash (based on a subset of VM source files) into script snapshots, assuming the hash will change if snapshots become incompatible (see 3dfb90f) - whether this list covered everything (see #41802 (comment)) is another question.

With the introduction of kernel, a kernel format version number was added to kernel files. The VM would, as usual, issue an API error if it cannot read such a kernel file. But as outlined in #41802 (comment), that version number does cover only the kernel format itself and not various other compatibilities.

=> I agree, we could consider this a breaking change, back in 2018.
=> Since then it was effectively not really safe to re-use kernel files across SDK versions.
=> The reason this was not really noticed is because the VM only very infrequently updates kernel transformers, adds recognized methods, ... and only in certain situations would a user actually notice it.

We have been relying on the VM reporting for us the incompatibility with exit code 253 since 2014.

Originally this seemingly came from c859d25. Though notice that the exit code 253 is not reserved for incompatible snapshots, we issue API errors also in other situations.

This was useful behavior that made sense to rely upon because anything else would require additional
tracking of the previous sdk versions etc by all of these tools.

I understand that this seems very useful, though one could argue that it would be much better to add a flag to the VM to ask it whether it can run a snapshot instead of trying to run it and see what happens (e.g. imagine a user program is actually exiting with this exit code, would it trigger an infinite re-snapshoting attempts?)

In any case, because of all these problems I have advocated (a lot, despite strong opposition) for embedding the SDK hash into the kernel, which we have now - the CL re-landed in 0ce8398.

We will leave this bug open until it has rolled through flutter and then close it.

@kevmoo
Copy link
Member

kevmoo commented Jul 7, 2020

Thanks, @mkustermann !

@natebosch
Copy link
Member

To the best of my knowledge there was no such guarantee for the past several years

Was this communicated as a breaking change when the guarantee was dropped?

I understand that this seems very useful, though one could argue that it would be much better to add a flag to the VM to ask it whether it can run a snapshot instead of trying to run it and see what happens

Yes, and in fact we have argued that would be better. #20802

because of all these problems I have advocated (a lot, despite strong opposition) for embedding the SDK hash into the kernel

I'm glad to see it's getting fixed. Can I assume that the previously existing concept of kernel format versioning is going away with this change? Will all tools that read and write kernel files be invalidating them by SDK hash?

@ghost
Copy link

ghost commented Jul 9, 2020

I'm glad to see it's getting fixed. Can I assume that the previously existing concept of kernel format versioning is going away with this change? Will all tools that read and write kernel files be invalidating them by SDK hash?

My understanding, from poking around this code, is that the existing 'Kernel Format Version' is concerned with the binary format of the kernel - i.e. which fields are expected to be there for a given version:

const char* kKernelInvalidBinaryFormatVersion =
"Invalid kernel binary format version";
const char* kKernelInvalidSizeIndicated =
"Invalid kernel binary: Indicated size is invalid";
const char* kKernelInvalidSdkHash = "Invalid SDK hash";
const int kSdkHashSizeInBytes = 10;
const char* kSdkHashNull = "0000000000";
std::unique_ptr<Program> Program::ReadFrom(Reader* reader, const char** error) {
if (reader->size() < 70) {
// A kernel file (v43) currently contains at least the following:
// * Magic number (32)
// * Kernel version (32)
// * SDK Hash (10 * 8)
// * List of problems (8)
// * Length of source map (32)
// * Length of canonical name table (8)
// * Metadata length (32)
// * Length of string table (8)
// * Length of constant table (8)
// * Component index (11 * 32)
//
// so is at least 74 bytes.

So KFV and SDK Hash are perhaps complementary to each other.
You could encapsulate the same kernel content in different versions of the binary format. Likewise you could encapsulate different kernel content in the same kernel binary format.

For that reason I imagine the two will coexist, and different tools will be interested in one or the other.

@natebosch
Copy link
Member

For that reason I imagine the two will coexist, and different tools will be interested in one or the other.

Do we know of any specific use case for having different SDK hash kernel files that can be read or written by a single version of a tool? Are we retaining generality that is never exercised?

@jakemac53, @jonahwilliams - are you aware of any places that we cache kernel files across SDK versions that aren't VM snapshots?

@jonahwilliams
Copy link
Contributor

the --initialize-from-dill frontend server option accepts a previously output kernel file to speed up initialization. The expectation is that this will correctly handle updates to the kernel version, or changes to the dart SDK. This is used by the flutter tool, but the expectation is mostly that it speeds up subsequent runs on the same version .

@natebosch
Copy link
Member

I would imagine that the frontend server would need to also use the SDK hash for this use case. @mkustermann @cskau-g - do we know if it does? If it doesn't, will it cause problems?

@jonahwilliams
Copy link
Contributor

Also, there is whatever fuchsia is doing with the google3 rolls. I know previously there was a coordinated double roll, but I don't know if that has been solved by moving the fuchsia+flutter dependencies into flutter/engine. FYI @iskakaushik

@mkustermann
Copy link
Member Author

Was this communicated as a breaking change when the guarantee was dropped?

Maybe - but not afaik.

Probably because the people who knew about this (still undocumented, not really tested) safely-to-consume-or-253-exit-code didn't know that kernel cannot be re-used, and the people who knew that kernel cannot be re-used didn't know about this safely-to-consume-or-253-exit-code.

Yes, and in fact we have argued that would be better. #20802

There was no activity on that issue since 2018. Maybe the issue should have received stronger push for it and/or be escalated more.

I'm glad to see it's getting fixed. Can I assume that the previously existing concept of kernel format versioning is going away
with this change?

As @cskau-g says, they are two different things. The version number in the kernel format is mainly versioning the container format (as opposed to the contents of it)

Do we know of any specific use case for having different SDK hash kernel files that can be read or
written by a single version of a tool?

For example a tool that dumps the kernel as text format can consume any kernel as long as it understands the container format (content doesn't matter).

I would imagine that the frontend server would need to also use the SDK hash for this use case. @mkustermann @cskau-g -
do we know if it does? If it doesn't, will it cause problems?

See below.

Will all tools that read and write kernel files be invalidating them by SDK hash?

We have one place in our code base that that produces binary kernel files: pkg/kernel/lib/binary/ast_to_binary.dart
We have two places in our code base that consume kernel: pkg/kernel/lib/binary/ast_from_binary.dart and runtime/vm/kernel_binary.cc
=> Those have been updated to embed an SDK hash when producing kernel and verify against SDK hash when consuming.

All our end-user Dart tools (e.g. frontend_server, gen_kernel, ...) are distributed in Dart/Flutter SDKs as AppJIT or Kernel snapshots. The build rules used for generating those should cause the correct SDK hash to be embedded into the code mentioned above. Similarly, the VM gets the SDK hash baked in via it's build rules.

For local development there are ways to disable this checking. If one runs those tools from source the checking is also disabled atm.

That being said, package:kernel/binary/ast_from_binary.dart will not have the safely-to-consume-or-253-exit-code behavior, instead it would throw an exception I believe.

=> Let me try to verify that the implementation does what I have described above and maybe harden the checks that were added.

@natebosch
Copy link
Member

All our end-user Dart tools (e.g. frontend_server, gen_kernel, ...) are distributed in Dart/Flutter SDKs as AppJIT or Kernel snapshots.

I'm more concerned with the frontend server consuming and producing kernel files than I am with the snapshot of frontend_server.

The flutter tool currently caches dill files across SDK versions and passed them via the --initialize-from-dill flag. Will that flag invalidate by SDK hash? If it won't, will that cause problems? Ultimately flutter will be using the dill file produced by frontend server with the VM right?

dart-bot pushed a commit that referenced this issue Jul 16, 2020
…at produce Kernel

The missing --short=10 was causing (depending on git version and
configuration) us to sometimes default to using '0000000000'.

Furthermore the build rules were missing two places where -Dsdk_hash has
to be set.

Issue #41802

Change-Id: I83dbfcce677e2594074c1139093bd9592d4fa3ee
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/154684
Commit-Queue: Martin Kustermann <[email protected]>
Reviewed-by: Daco Harkes <[email protected]>
@mkustermann
Copy link
Member Author

I'm more concerned with the frontend server consuming and producing kernel files than I am with the
snapshot of frontend_server.

The snapshot of frontend_server contains the CFE, which contains package:kernel which needs to get the SDK hash baked in (i.e. the SDK is embedded into frontend_server during the build of the snapshot).

When the frontend_server snapshot is then run, it will verify kernel binaries it reads against this hash and will also produce kernel binaries with the right hash. If it is given an incompatible kernel file it will throw an exception.

The flutter tool currently caches dill files across SDK versions and passed them via
the --initialize-from-dill flag. Will that flag invalidate by SDK hash?

How does flutter tool know whether it can give an old dill to a new frontend server?
What happens if it gives an incompatible dill file to frontend server?
Is there an agreed protocol how frontend server should respond to that?

Ultimately flutter will be using the dill file produced by frontend server with the VM right?

Yes. The VM should validate that the kernel was produced by a frontend server which was built at same Dart SDK commit as the VM itself.

@mkustermann
Copy link
Member Author

How does flutter tool know whether it can give an old dill to a new frontend server?

I have made now an experiment with a locally built engine where I observed the following:

First I clear the caches and build the app with my locally built engine and check the SDK hash of the kernel file:

flutter/dev/integration_tests/flutter_gallery % rm -rf .dart_tool
flutter/dev/integration_tests/flutter_gallery % ../../../bin/flutter --local-engine-src-path=$HOME/fe --local-engine=android_profile_arm64 build -v apk --profile --target-platform=android-arm64
...
flutter/dev/integration_tests/flutter_gallery % hexdump -n18 -C  .dart_tool_backup2/flutter_build/d91780e3f1c8a999ad4157f0c159c779/app.dill                                                                                                                                                    
00000000  90 ab cd ef 00 00 00 2b  62 62 65 38 63 35 35 32  |.......+bbe8c552|                                                                               
00000010  64 31                                             |d1|             

Then I make another build of the locally built engine with a different Dart SDK commit (everything else the same), rebuild the app, and re-examine the SDK hash:

flutter/dev/integration_tests/flutter_gallery % ../../../bin/flutter --local-engine-src-path=$HOME/fe --local-engine=android_profile_arm64 build -v apk --profile --target-platform=android-arm64
...

[  +83 ms] [+1165 ms] kernel_snapshot: Starting due to {InvalidatedReason.inputChanged}
[        ] [  +37 ms] /.../fe/out/host_profile/dart-sdk/bin/dart --disable-dart-dev /.../fe/out/host_profile/gen/frontend_server.dart.snapshot --sdk-root /.../fe/out/android_profile_arm64/flutter_patched_sdk/ --target=flutter -Ddart.developer.causal_async_stacks=false -Ddart.vm.profile=true -Ddart.vm.product=false --bytecode-options=source-positions --aot --tfa --packages /.../flutter/dev/integration_tests/flutter_gallery/.packages --output-dill /.../flutter/dev/integration_tests/flutter_gallery/.dart_tool/flutter_build/d91780e3f1c8a999ad4157f0c159c779/app.dill --depfile /.../flutter/dev/integration_tests/flutter_gallery/.dart_tool/flutter_build/d91780e3f1c8a999ad4157f0c159c779/kernel_snapshot.d package:flutter_gallery/main.dart
...
flutter/dev/integration_tests/flutter_gallery % hexdump -n18 -C  .dart_tool/flutter_build/d91780e3f1c8a999ad4157f0c159c779/app.dill       
00000000  90 ab cd ef 00 00 00 2b  37 65 33 37 33 38 33 31  |.......+7e373831|
00000010  63 65                                             |ce|

As we can see, the flutter tools is correctly noticing that the SDK has changed and re-runs the compilation using the new frontend server.

We can also see that the SDK hash got correctly embedded to the kernel files.

So after 90bba3a landed things seem to be working exactly as intended.

We're going to leave this issue open a little longer until the latest change has rolled further (also into google3).

@mkustermann
Copy link
Member Author

@natebosch The pub tool should now get it's 253 exit code behavior after every update of the Dart SDK.

@natebosch
Copy link
Member

I'm still not clear on why we need both mechanisms for invalidation here. The kernel -> text tool being able to read kernel files across SDKs doesn't seem like it warrants keeping the complexity of both mechanisms long term, but I won't push on it any further.

Thanks for the fix!

dart-bot pushed a commit that referenced this issue Jul 21, 2020
…at produce Kernel

The missing --short=10 was causing (depending on git version and
configuration) us to sometimes default to using '0000000000'.

Furthermore the build rules were missing two places where -Dsdk_hash has
to be set.

Issue #41802

Change-Id: I83dbfcce677e2594074c1139093bd9592d4fa3ee
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/154684
Commit-Queue: Martin Kustermann <[email protected]>
Reviewed-by: Daco Harkes <[email protected]>
@a-siva
Copy link
Contributor

a-siva commented Jul 30, 2020

Is more work anticipated on this? Can we close the issue ?

@mkustermann
Copy link
Member Author

Is more work anticipated on this? Can we close the issue ?

Only for g3. Let me make an internal bug and close this one.

@marchacio

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. P1 A high priority bug; for example, a single project is unusable or has many test failures
Projects
None yet
Development

No branches or pull requests