Skip to content
This repository was archived by the owner on Feb 25, 2025. It is now read-only.

[Impeller] add additional setup method that caches more pipelines, warms internal shader code #50521

Merged

Conversation

jonahwilliams
Copy link
Member

@jonahwilliams jonahwilliams commented Feb 10, 2024

Part of flutter/flutter#138236

Fixes a number of issues related to startup performance by adding an "InitializeCommonlyUsedShadersIfNeeded" method to content context.

On the first frame of a flutter application renders, the backend will populate the glyph atlas for the first time. On the Vulkan backend, this executes vkCmdCopyBufferToImage. The first time this runs, Arm Mali drivers will generate an additional shader program. Creates a 1x1 texture and sets the contents to force this code to compile eagerly.

image

The first time a render pass is constructed, a shader is complied. This does not seem to depend on the properties of the render pass, so we just create a trivial one and submit it.

image

Finally there are a few missing shader variants. Lets just go ahead a populate that cache a bit more

@@ -604,4 +606,70 @@ void ContentContext::FlushCommandBuffers() const {
GetContext()->GetCommandQueue()->Submit(buffers);
}

void ContentContext::AdditionalSetup() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this method is private, but the name makes me die a little bit.

I like the first line of this method instead:

void ContentContext::InitializeCommonlyUsedShadersIfNeeded()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

GetClipPipeline(options);
}

std::string vendor = GetContext()->DescribeGpuModel();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we imagine more changes like this in the future?

  1. Does it makes sense to have this in a backend-specific context thingy?
  2. Should there be a TODO with what we'd do if more of these cases are found later?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. No, because the backend context doesn't have access to any of the shader caches, that is all in ContentContext. I could move the texture/rp creation stuff to context_vk, but then we'd have different setup methods in different places, and I kind of think its better to keep them together. I could see either case.

  2. I dunno, that feels speculative. If we find more weird cases we'll figure it out then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave this just as a comment and run the logic unconditionally?

It'll be difficult to keep track of where we have different vendor based driver hacks - if we're going to warm up like this why not just warm it up everywhere?

const RenderTarget& render_target)
: RenderPass(context, render_target), delegate_(std::move(delegate)) {}

// |RenderPass|
void RecordingRenderPass::SetPipeline(
const std::shared_ptr<Pipeline<PipelineDescriptor>>& pipeline) {
pending_.pipeline = pipeline;
delegate_->SetPipeline(pipeline);
if (delegate_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this all conditional now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test helper, by making it conditional I can make delegating to a real commmand buffer optional.

@jonahwilliams jonahwilliams marked this pull request as ready for review February 12, 2024 18:09
Copy link
Member

@chinmaygarde chinmaygarde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes for the speculation around PSO variant construction which seems unrelated. The rest are nits or suggestions.

I concede we need a way to reason about things we don't understand about driver performance. And I don’t think we are done nearly enough thinking about how to handle such a need.

We should design this in a way that allows us to visualize a reasonable path towards getting rid of these hacks (henceforth referred to as voodoo) once our understanding of driver behavior evolves, or we apply fixes in other parts of the system. Put another way, if the person adding this voodoo were to go away, could someone else

  1. Reason about why it was added.
  2. Check if it is actually effective.
  3. Device systemic ways to make that setup unnecessary.
  4. Back it out.

As such, I believe the bar to add additional voodoo should be incredibly high.

As mentioned in the comments inline, the variant construction fails 1 and 2.

The texture upload and render pass construction voodoo passes 1 but currently probably fails 2. I’d like the need for this voodoo to be measurable. This could be as simple as all the voodoo being performed in a function that is decorated with a trace event so our tools can measure it later for effectiveness.

We can call the mechanism something like, VoodooManager (or Shaman) that performs invocations of the abstract base class IVoodoo. This along with some affordance for measuring async voodoo. Make it sound absolutely ridiculous because it is doing something we actually don’t fully understand.

Each bit of voodoo can then have linked documentation and traces so someone can understand what is happening and back it out. Or not, depending on prioritization.

If we concede that we need a mechanism for performing this voodoo, we should also make it reside closer to the subsystems that need said voodoo. The render pass construction and texture upload voodoo should ideally just be in the vulkan backend in //impeller/renderer. Why does the entity framework care about this? Perhaps the context has a shaman that the different sub-systems can talk to. That way, only Vulkan can perform that voodoo and non-entity framework users can benefit too. I don’t think sub-systems above //impeller/renderer need to know how to read the device model.

I apologize if the additional process for adding voodoo seems tedious. Especially when there are pressures to ship things yesterday. But I believe a little bit of care in adding stuff here will go a long way in building a maintainable system. For instance, I didn’t understand the need for PSO construction, but I trust there is a good reason you added it. Even if I don’t know it. In the future I can totally see someone else cycling over all variants in ContentContext just to be safe. And then we run into similar problems.

@@ -604,4 +606,68 @@ void ContentContext::FlushCommandBuffers() const {
GetContext()->GetCommandQueue()->Submit(buffers);
}

void ContentContext::InitializeCommonlyUsedShadersIfNeeded() const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For accuracy, CommonlyUsedPipelinesIfNeeded since the shaders are already in their libraries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose even that isn't entirely accurate since the render pass is being initialized. lol, just DoOneTimeVoodoo?

return;
}

// On ARM devices, the initial usage of vkCmdCopyBufferToImage has been
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't completely in the dark about the reasons as to why this is happening. And we haven't fully convinced ourselves this isn't user error yet.

.color_attachment_pixel_format =
context_->GetCapabilities()->GetDefaultColorFormat()};

for (const auto mode : {BlendMode::kSource, BlendMode::kSourceOver}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the proceedings of the last go/impeller-weekly, my understanding was that PSO variant construction time did not show up in traces yet. What gives for the need for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two traces posted are also unrelated to PSO variant construction and the linked bug makes no mention of PSO variants being slow to initialize. This bit seems unrelated to the linked issue or any problem we've discussed in the past.

@jonahwilliams
Copy link
Member Author

I disagree with how you have framed the problem. This is not magic/vodo/hacks, instead this is the clearly observable behavior of an extremely common driver running on a specific platform that we care to support - meaning we must ensure that it is tested and performant. We are not targeting hypothetical or theoretical Vulkan devices.

We should design this in a way that allows us to visualize a reasonable path towards getting rid of these hacks (henceforth referred to as voodoo) once our understanding of driver behavior evolves, or we apply fixes in other parts of the system. Put another way, if the person adding this voodoo were to go away, could someone else

The path for measuring changes that improve performance has and will always be benchmarks, and the mechanism by which we understand what the driver is doing is through CPU traces which I've posted above. The process for how we take those has been documented by myself for both Android and iOS (See https://github.com/flutter/engine/blob/main/impeller/docs/android_cpu_profile.md and https://github.com/flutter/engine/blob/main/impeller/docs/ios_cpu_profile.md ).

We have numerous benchmarks in the framework repo that all report first frame times. I suspect that the first frame times reported in benchmarks will be improved by these changes, but I can certainly confirm that locally.

I apologize if the additional process for adding voodoo seems tedious. Especially when there are pressures to ship things yesterday. But I believe a little bit of care in adding stuff here will go a long way in building a maintainable system. For instance, I didn’t understand the need for PSO construction, but I trust there is a good reason you added it. Even if I don’t know it. In the future I can totally see someone else cycling over all variants in ContentContext just to be safe. And then we run into similar problems.

The process you are proposing here is not reasonable. Our code should be tested, yes. Our code should be benchmarked, yes. But we must take on the cost of maintaining good performance and behavior on the devices we support. I do not understand how your proposed process would help that. Instead, your proposal seems to me as if it would prevent us from ever adapting to the driver/platform and thus ensuring we have mediocre performance.

@chinmaygarde
Copy link
Member

@jonahwilliams @dnfield and I had a chat about this in person. The takeaways were:

Did I miss anything?

@chinmaygarde chinmaygarde self-requested a review February 14, 2024 21:22
@jonahwilliams
Copy link
Member Author

Nope, that was accurate. I believe I have all of the appropriate tests moved over to capturing the first frame events, so i will look at this one now.

CreateIfNeeded(clip_pipelines_, options);
}

if (GetContext()->GetBackendType() == Context::BackendType::kVulkan) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chinmaygarde where did we land on making this vulkan only? IIRC we agreed not to do vendor checks since this code should be harmless.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the Vulkan only check since it makes testing more of a pain.

@flutter-dashboard
Copy link

Golden file changes have been found for this pull request. Click here to view and triage (e.g. because this is an intentional change).

If you are still iterating on this change and are not ready to resolve the images on the Flutter Gold dashboard, consider marking this PR as a draft pull request above. You will still be able to view image results on the dashboard, commenting will be silenced, and the check will not try to resolve itself until marked ready for review.

Changes reported for pull request #50521 at sha 7c595cb

@flutter-dashboard
Copy link

Golden file changes are available for triage from new commit, Click here to view.

Changes reported for pull request #50521 at sha ada402b

@jonahwilliams jonahwilliams added the autosubmit Merge PR when tree becomes green via auto submit App label Feb 17, 2024
@auto-submit auto-submit bot merged commit be0f6aa into flutter:main Feb 17, 2024
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request Feb 17, 2024
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request Feb 17, 2024
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request Feb 17, 2024
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request Feb 17, 2024
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request Feb 17, 2024
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request Feb 17, 2024
auto-submit bot pushed a commit to flutter/flutter that referenced this pull request Feb 17, 2024
…143652)

flutter/engine@e51d4f1...c807aea

2024-02-17 [email protected] Roll Skia from bb61c2b4614e to 0d2dbf53aef6 (3 revisions) (flutter/engine#50744)
2024-02-17 [email protected] [Impeller] add additional setup method that caches more pipelines, warms internal shader code (flutter/engine#50521)
2024-02-17 [email protected] [Impeller] Assign subpass depth on restore rather than creation. (flutter/engine#50626)
2024-02-17 [email protected] Roll Dart SDK from fa66195a3814 to 6d659f880394 (1 revision) (flutter/engine#50739)

If this roll has caused a breakage, revert this CL and stop the roller
using the controls here:
https://autoroll.skia.org/r/flutter-engine-flutter-autoroll
Please CC [email protected],[email protected],[email protected] on the revert to ensure that a human
is aware of the problem.

To file a bug in Flutter: https://github.com/flutter/flutter/issues/new/choose

To report a problem with the AutoRoller itself, please file a bug:
https://issues.skia.org/issues/new?component=1389291&template=1850622

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+doc/main/autoroll/README.md
@gaaclarke
Copy link
Member

@jonahwilliams did we ever get measurements on how InitializeCommonlyUsedShadersIfNeeded helped us?

With the way they are implemented now the ContentContext fires off tons of tasks asynchronously, then immediately synchronizes on a subset of them for InitializeCommonlyUsedShadersIfNeeded. If we want to do that it would be preferable if the ones we are synchronizing on are the first ones to get issued. Alternatively we can just append the initialize process to the creation process so we don't need to synchronize inside of ContentContext.

I'm look at this again because of flutter/flutter#145843 and what Jia Hao showed us the other day.

@jonahwilliams
Copy link
Member Author

Sorry I can't figure out how to share graphs at specific times:

image

@gaaclarke
Copy link
Member

It does not synchronize on them, CreateIfNeeded does not call WaitAndGet.

That's not what I'm seeing in a debugger:

impeller_unittests!impeller::RenderPipelineT<impeller::SolidFillVertexShader, impeller::SolidFillFragmentShader>::WaitAndGet() (/Users/aaclarke/dev/engine/src/flutter/impeller/renderer/pipeline.h:116)
impeller_unittests!impeller::RenderPipelineT<impeller::SolidFillVertexShader, impeller::SolidFillFragmentShader>* impeller::ContentContext::CreateIfNeeded<impeller::RenderPipelineT<impeller::SolidFillVertexShader, impeller::SolidFillFragmentShader>>(impeller::ContentContext::Variants<impeller::RenderPipelineT<impeller::SolidFillVertexShader, impeller::SolidFillFragmentShader>>&, impeller::ContentContextOptions) const (/Users/aaclarke/dev/engine/src/flutter/impeller/entity/contents/content_context.h:1028)
impeller_unittests!impeller::ContentContext::InitializeCommonlyUsedShadersIfNeeded() const (/Users/aaclarke/dev/engine/src/flutter/impeller/entity/contents/content_context.cc:614)
impeller_unittests!impeller::ContentContext::ContentContext(std::_fl::shared_ptr<impeller::Context>, std::_fl::shared_ptr<impeller::TypographerContext>, std::_fl::shared_ptr<impeller::RenderTargetAllocator>) (/Users/aaclarke/dev/engine/src/flutter/impeller/entity/contents/content_context.cc:465)
impeller_unittests!impeller::ContentContext::ContentContext(std::_fl::shared_ptr<impeller::Context>, std::_fl::shared_ptr<impeller::TypographerContext>, std::_fl::shared_ptr<impeller::RenderTargetAllocator>) (/Users/aaclarke/dev/engine/src/flutter/impeller/entity/contents/content_context.cc:264)
impeller_unittests!std::_fl::__unique_if<impeller::ContentContext>::__unique_single std::_fl::make_unique[abi:v15000]<impeller::ContentContext, std::_fl::shared_ptr<impeller::Context>&, std::_fl::shared_ptr<impeller::TypographerContext>, std::_fl::shared_ptr<impeller::RenderTargetAllocator>>(std::_fl::shared_ptr<impeller::Context>&, std::_fl::shared_ptr<impeller::TypographerContext>&&, std::_fl::shared_ptr<impeller::RenderTargetAllocator>&&) (/Users/aaclarke/dev/engine/src/third_party/libcxx/include/__memory/unique_ptr.h:714)
impeller_unittests!impeller::AiksContext::AiksContext(std::_fl::shared_ptr<impeller::Context>, std::_fl::shared_ptr<impeller::TypographerContext>, std::_fl::optional<std::_fl::shared_ptr<impeller::RenderTargetAllocator>>) (/Users/aaclarke/dev/engine/src/flutter/impeller/aiks/aiks_context.cc:23)

@jonahwilliams
Copy link
Member Author

That is calling wait and get on the pipeline prototype, not the newly created pipeline. Its possible that since those aren't done yet that we've added additional synchronization here we don't need, and instead we could thread it a bit more?

@jonahwilliams
Copy link
Member Author

We could also change this code to not depend on the prototype

@gaaclarke
Copy link
Member

@jonahwilliams oh does "RenderPipelineT" mean "pipeline prototype"? I think we should probably rename that and/or add a docstring. Would "PipelinePrototype" be more appropriate?

@jonahwilliams
Copy link
Member Author

No that is just the pipeline type.

I think you could make this whole thing non-blocking by just shoving it into a worker thread. That shoud be mostly safe. That said if you are interested in working on making startup faster this is not where I would squeeze. Do you wanna GVC for a bit?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
autosubmit Merge PR when tree becomes green via auto submit App e: impeller will affect goldens
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants