Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QNN-EP]: Fix inference failures while running with htp_shared_memory #23892

Merged
merged 1 commit into from
Mar 5, 2025

Conversation

quic-ashigarg
Copy link
Contributor

Description

When using the enable_htp_shared_memory feature, we see that the address of the buffer passed to rpcmem_free is incorrect. So the rpc buffers are not freed leading to memory exhaustion.

Motivation and Context

When using the enable_htp_shared_memory_allocator feature for QNN in GenAI extensions, it leads to inference failures during the second prompt. As GenAI memory asks are higher, it surfaces sooner in gen AI use cases.

When using the htp_shared_memory feature, we see that the buffer
freed using rpcmem_free is not passed the right address. This leads
to memory exhaustion and leads to inference failure.
@HectorSVC
Copy link
Contributor

/azp run Big Models,Linux Android Emulator QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,Windows x64 QNN CI Pipeline

@HectorSVC
Copy link
Contributor

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, Linux QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

Copy link

Azure Pipelines successfully started running 8 pipeline(s).

@HectorSVC
Copy link
Contributor

/azp run Linux OpenVINO CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@HectorSVC HectorSVC added the ep:QNN issues related to QNN exeution provider label Mar 5, 2025
@jywu-msft jywu-msft merged commit 788ca51 into microsoft:main Mar 5, 2025
68 of 74 checks passed
amarin16 pushed a commit that referenced this pull request Mar 5, 2025
…#23892)

### Description
When using the enable_htp_shared_memory feature, we see that the address
of the buffer passed to rpcmem_free is incorrect. So the rpc buffers are
not freed leading to memory exhaustion.

### Motivation and Context
When using the enable_htp_shared_memory_allocator feature for QNN in
GenAI extensions, it leads to inference failures during the second
prompt. As GenAI memory asks are higher, it surfaces sooner in gen AI
use cases.

Co-authored-by: Ashish Garg <[email protected]>
guschmue pushed a commit that referenced this pull request Mar 6, 2025
…#23892)

### Description
When using the enable_htp_shared_memory feature, we see that the address
of the buffer passed to rpcmem_free is incorrect. So the rpc buffers are
not freed leading to memory exhaustion.

### Motivation and Context
When using the enable_htp_shared_memory_allocator feature for QNN in
GenAI extensions, it leads to inference failures during the second
prompt. As GenAI memory asks are higher, it surfaces sooner in gen AI
use cases.

Co-authored-by: Ashish Garg <[email protected]>
amarin16 added a commit that referenced this pull request Mar 6, 2025
The second round of cherry-picks into
[rel-1.21.0](https://github.com/microsoft/onnxruntime/tree/rel-1.21.0).
The first one was done in
#23846.
- #23779
- #23856
- #23827
- #23834
- #23876
- #23892

---------

Co-authored-by: Jambay Kinley <[email protected]>
Co-authored-by: Yulong Wang <[email protected]>
Co-authored-by: Ashish Garg <[email protected]>
Co-authored-by: Ashish Garg <[email protected]>
ankitm3k pushed a commit to intel/onnxruntime that referenced this pull request Mar 10, 2025
* Fix flash attention for GQA (Phi4) (microsoft#23850)

### Description
This change fixes GQA for Flash Attention on Nvidia GPUs. The root cause
appears to be
`k_start + capped_sg_id < seq_causal_length`
check. This is either because, 
a. seq_causal_length varies per lane, so the check becomes non uniform
control flow, which is having interactions with subgroupShuffle.
or 
b. The check itself is incorrect and is wiping out values of v based on
the source lane's seq_causal_length. While in actualness values of v
need to be causal as per the lane that is going to multiply it with qkt.

qkt is already causal because earlier values of qk for out of bounds k
are set to min_value, and exp(<-4) are 0.

This fix works by removing that causal check and relying on the qk being
wiped out earlier. The documentation for causality behavior for GQA is
missing to determine which of this reason is the true reason.

Prior to this prompts with sequence length > 16 < 32 or 1k would break
with Phi 4 but smaller prompts would work.
Tested on Intel Alderlake, Nvidia 4070.

* Model Builder API (microsoft#23223)

### Description
<!-- Describe your changes. -->
Supports creating a model programmatically using the ORT C or C++ API. 
Supports augmenting an existing model to add nodes.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* Fix typo: change `Upample` to `Upsample`. (microsoft#23838)

### Description
<!-- Describe your changes. -->
Fixed a typo in function names related to the Upsample CUDA kernel.
Changed incorrect spelling Upample to Upsample across relevant
functions.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This change is necessary to maintain consistency and prevent potential
confusion caused by incorrect function names.

* [doc] Fix typos in csharp/src/Microsoft.ML.OnnxRuntime/ (microsoft#23848)

### Description
<!-- Describe your changes. -->
Fix typos in csharp/src/Microsoft.ML.OnnxRuntime/


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* Quant tool: Consistent `get_qdq_config` and `get_qnn_qdq_config` behavior (microsoft#23856)

* Change the logic to generate the default ep context file name (microsoft#23788)

Change the logic to generate the default ep context file name

### Description
Applies to all EPs: replace the .onnx to _ctx.onnx, instead of directly append extra string _ctx.onnx to existing model path. In QNN EP, also make the context binary .bin file shorter by removing QNNExecutionProvider_ from the file name.

* Make Nuget QNN package pipeline 1ES compliant (microsoft#23805)

### Description
Make
[QNN_Nuget_Windows](https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1234)1ES
compliant



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* [js/common] allows using Uint16Array as data for float16 tensor (microsoft#23827)

### Description

Resolve microsoft#23817



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* [js/webgpu] Reland the optimization of ConvTranspose (microsoft#23858)

This PR fixes the errors in the ConvTranspose optimization and adds
tests to ensure the correctness of the implementation.

* [OpenVINO] Fix a build warning (microsoft#23877)

### Description
Fix a warning with std::move usage



### Motivation and Context
Possibly allow building without --compile_no_warning_as_error flag

* Change gsl::byte to std::byte (microsoft#23872)

To be compatible with the latest GSL library. Without this fix we will
get:

```
onnxruntime\core\providers\cpu\controlflow\loop.cc(247): error C4996: 'gsl::byte': Use std::byte instead.
```

* Allow using extended minimal build for several EPs (microsoft#23834)

### Description

#### Background

From code search, the following EPs use
`onnxruntime::GetCpuPreferredNodes()` in their `GetCapabilities()`
methods:
- CANN
- CUDA
- DML
- JS
- ROCM
- WebGPU

However, the source file that implements
`onnxruntime::GetCpuPreferredNodes()` is excluded when minimal build is
ON:
https://github.com/microsoft/onnxruntime/blob/6df0973e58ba5399fcaa98686f70ed9a9e59aaef/cmake/onnxruntime_framework.cmake#L38-L42

This means that all EPs mentioned above is not able to compile with
minimal build.

#### Solution

The excluded file `core/framework/fallback_cpu_capability.cc` cannot
build in minimal build because some of its dependencies are not included
in the minimal build. However, in extended minimal build mode, all
dependencies are available.

This PR looses the restrict and allows to compile this file when it is
extended minimal build. After this change, those EPs are able to compile
in extended minimal build.

* Add dawn to ThirdPartyNotices (microsoft#23876)

### Description

Add `dawn` to ThirdPartyNotices.

* Enable QNN EP weight sharing generation using public API (microsoft#23702)

### Description
Enable QNN EP weight sharing generation using public API instead of internal interfaces, so that user can integrate into their own toolchain. The change is to share the QnnBackendManager across ORT sessions if ep.share_ep_contexts is enabled. And there is extra option to end the share so that we know when to remove the shared QnnBackendManager from the singleton.

Change the tool name from onnxruntime_qnn_ctx_gen to ep_weight_sharing_ctx_gen, so that it can be shared for other EPs.

* [QNN-EP]: Fix inference failures while running with htp_shared_memory (microsoft#23892)

### Description
When using the enable_htp_shared_memory feature, we see that the address
of the buffer passed to rpcmem_free is incorrect. So the rpc buffers are
not freed leading to memory exhaustion.

### Motivation and Context
When using the enable_htp_shared_memory_allocator feature for QNN in
GenAI extensions, it leads to inference failures during the second
prompt. As GenAI memory asks are higher, it surfaces sooner in gen AI
use cases.

Co-authored-by: Ashish Garg <[email protected]>

* Fix enable_pix_capture build for WebGPU (microsoft#23857)

The build option --enable_pix_capture is broken. This fixes the problem.

---------

Co-authored-by: wp <[email protected]>

* [WebGPU-EP Native] Add ReduceMean (microsoft#23860)

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* [WebGPU EP] introduce BiasAdd contrib op (microsoft#23861)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Dynamo export and improve benchmark script for SAM2 encoder (microsoft#23887)

### Description
* Add dynamo export for Sam2 image encoder
* Verify fp32 onnx model with CPU EP (to avoid error message from TRT
EP).
* Update benchmark script:
  - output ORT profiling
- output torch compiled code and unique kernel name for compiled kernel
  - add an option for nightly package installation
  - uninstall existing ort packages before installing

The node metadata of dynamo exported model can help mapping node in onnx
model back to pytorch modeling script. Currently, the graph optimization
is not done on dynamo exported model, so it is experimental right now.

### Motivation and Context

To support profiling of torch compiled CUDA kernel.

* [js/web] improve workaround for bundlers (microsoft#23902)

### Description
This PR improves the workaround for bundlers in onnxruntime-web.
Specifically, the following changes have been made:

- Use [this
workaround](xenova@9c50aa2)
as suggested by @xenova in
huggingface/transformers.js#1161 (comment)

- Use `url > "file:" && url < "file;"` instead of
`url.startsWith("file:")` to allow minifiers to remove dead code
correctly.

This change allows to remove unnecessary dependencies of file parsed
from `new URL("ort.bundle.min.js", import.meta.url)` in Vite, and
optimize code like `if("file://filepath.js".startsWith("file:"))
{do_sth1(); } else {do_sth2();}` into `do_sth1()` for webpack/terser
usages.

Resolves huggingface/transformers.js#1161

* [webgpu] Restore MatMulNBits workgroup size for Phi-3.5 (microsoft#23349)

### Description
This change restores the MatMulNBits workgroup size from (8, 8, 1) back
to (16, 8, 1) to resolve a performance regression observed on Intel
iGPUs during token generation (M=1).

### Motivation and Context
As above.

Signed-off-by: Jianhui Dai <[email protected]>

* [webgpu] support Pad operator (microsoft#23141)

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* [WebNN] Accept Float16Array for float16 data type if it is available (microsoft#23894)

Float16Array is now shipping and WebNN Chromium implementation has
accepted it. We should allow it in WebNN EP as well.

* Ensure that the 'cmake_minimum_required' is version 3.5 or greater (microsoft#23888)

### Description
CMake 4.0 release candidate 2.0 is available, and it cannot compile all
of OnnxRuntime out-of-the-box. There's portions of the OnnxRuntime
codebase that specify a `cmake_minimum_required` version of 3.0, and
CMake 4.0 has removed support for compatibility with CMake < 3.5 - the
following error is reported:

```
CMake Error at winml_sdk_helpers.cmake:4 (cmake_minimum_required):
  Compatibility with CMake < 3.5 has been removed from CMake.

  Update the VERSION argument <min> value.  Or, use the <min>...<max> syntax
  to tell CMake that the project requires at least <min> but has been updated
  to work with policies introduced by <max> or earlier.

  Or, add -DCMAKE_POLICY_VERSION_MINIMUM=3.5 to try configuring anyway.
```

Since CMake 3.5 appears to have shipped in 2016, it seems reasonable to
set that as a minimum version to fix the error. The root CMakeLists.txt
does ask for a minimum version of 3.28, so we could snap to that, but
I'm still ramping up on the build, so wanted to propose a minimally
sufficient fix.

### Motivation and Context
Being able to build with the latest CMake - when it ships - reduces the
barrier to entry to building OnnxRuntime, and allows the OnnxRuntime to
leverage the latest and greatest tooling.

* WebGPU: Remove deprecated subgroups-f16 from WebGPU native and JS EP (microsoft#23898)

This PR removes the deprecated subgroups-f16 from WebGPU native and JS
EP, and also remove the unused deviceInfo in WebGPU JS EP.

* [JSEP/WebGPU] Fixed error in softmax dispatch. (microsoft#23906)

### Description
Fixed an error softmax dispatch



### Motivation and Context
Produce expected results for LlaMA model

* enable WebGPU EP in WebAssembly build (microsoft#23913)

### Description

This PR is the first step for migrating the webgpu backend of
onnxruntime-web from JSEP based to WebGPU EP based.

In this change, we enable building WebGPU EP in a wasm build (ie.
`--build_wasm` `--use_webgpu` `--use_jsep`). However, the old build
flags should still keep previous behavior.

* Adding OpenVINO Windows CI Pipeline (microsoft#23919)

### Description
<!-- Describe your changes. -->

Enable an OpenVINO Windows CI pipeline. This includes:
- Downloading the OpenVINO toolkit for Windows from an external source.
- Setting up OpenVINO environment variables.
- Building the ONNX Runtime OpenVINO Execution Provider.
- Running unit tests.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

This change is required to run checks on precommit and commit in the
ONNX Runtime project. It ensures that the code is tested with the
OpenVINO toolkit on Windows, improving the reliability and compatibility
of the project.

* [WebGPU EP] SoftMax Implementation (microsoft#23538)

Increase coverage for WebGPU Op

* Exclude MAUI projects from GPU C# packaging builds (microsoft#23923)

### Description
<!-- Describe your changes. -->
Use 'desktop only' solution in GPU C# packaging builds. We don't need to
include any MAUI support for those builds.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* Support all block sizes that are multiples of 32 for DP4A (microsoft#23907)

### Description
Simple change 
1. The DP4A shader actually supports all block sizes that are multiples
of 32, relaxing the restriction and making a small tweak to support
sizes other than 32.
2. Moved the shader to a separate file for maintainability.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Example custom op with output type inferencing (microsoft#23916)

### Description
<!-- Describe your changes. -->
Add example of a custom op that is required to do type inference for the
output type for the model load to work.
Also acts as an example of how to override an ONNX op with a custom
implementation.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
microsoft#23891

* Enabling L2+ Optimizations for EPs (microsoft#23517)

There are some requirements to modify the graph which are specific to
the EP/hardware.
ORT has the hardcoded EP list for optimizations but that can't scale and
it's hard be extended to enable EP custom optimizations.

Here is the prototype to enable L2+ optimizations for EPs (The original
overview is provided by @skottmckay) as well as the TRT EP
implementation for the ConstantFoldingDQ optimization.

Signatures for selection and optimization functions:
````
  - Selection: std::function<std::vector<std::unique_ptr<ComputeCapability>>(const GraphViewer&, const KeyValueConfig&)>
  - Optimization: std::function<Status(const Graph&, const ComputeCapability& this_optimization, ComputeCapability& cc_to_update)>
````
GetCapability

- call (new) provider bridge API to lookup pre-defined optimizer by name
and get selection function
- ComputeCapability.optimize_func, i.e. optimization function, would be
set by the optimizer to the function that does the optimization

- EP has to update the returning ComputeCapability to include the
optimization ComputeCapability in nodes_to_optimize. So that later ORT
can perform optimization/transformation accordingly.

GraphPartitioner

- After assigning the ComputeCapability to the EP and prior to Compile,
if the ComputeCapability has nodes_to_optimize, iterate that list
  - optimization function needs to be called with
    - a mutable Graph instance
    - the ComputeCapability for the individual optimization
    - the overall ComputeCapability so it can be updated

* fix binplace file in web pipeline (microsoft#23930)

* Updated run_CIs_for_external_pr.py to support the Windows OpenVINO CI pipeline (microsoft#23931)

* Fix ConvInteger handling of optional inputs. (microsoft#23935)

### Description
<!-- Describe your changes. -->
Fix ConvInteger handling of optional inputs. Need to check Exists() and
not just the number of inputs.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
microsoft#23927

* Updated ov version in pipeline (#595) (microsoft#23882)

### Description
This PR updates the OpenVINO version used in the pipeline from 2024.5.0
to 2025.0.0

Co-authored-by: jatinwadhwa921 <[email protected]>

* [AIX] External data handling (microsoft#23859)

### Description
In BE system, model tensor data coming from external file is not handled
properly.
This was found during the debugging of
(microsoft/onnxruntime-genai#1104)

This PR changes do the endianness conversion of data loaded from
external file in BE system.

* Create a packaging pipeline for a custom nuget package (microsoft#23918)

* Fix license in example test code. (microsoft#23936)

* replace usage of gsl::narrow and gsl::narrow_cast in WebGPU EP (microsoft#23926)

### Description

`gsl::narrow` does not work in no exception build.
- use `onnxruntime::narrow` if necessary;
- or change to `static_cast` if it's obviously safe.

also apply the changes to usage of `gsl::narrow_cast`, which does not
apply checks.

* VCPKG improvement: set  VCPKG_OSX_DEPLOYMENT_TARGET (microsoft#23933)

### Description
1. Set  VCPKG_OSX_DEPLOYMENT_TARGET for macOS targets
2. Enable VCPKG in more pipelines.

* Allow using a different version of flatbuffers when building with vcpkg (microsoft#23946)

### Description
Allow using a different version of flatbuffers when building with vcpkg,
so that users do not need to pin flatbuffer's version, which provides
more flexibility in the build process.

Delete utf8_range from the dependencies, because it is an indirect
dependency of protobuf, which is already included in the build process.
### Motivation and Context

* Make python package pipeline 1ES compliant (microsoft#23800)

### Description
Make [Python packaging
pipeline](https://aiinfra.visualstudio.com/530acbc4-21bc-487d-8cd8-348ff451d2ff/_build?definitionId=841)
1ES compliant



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

### Checklist

- [x] Make Onnxruntime-QNNEP-Windows-2022-CPU stateless

* Delete ROCM Nuget Publishing Pipeline (microsoft#23948)

* Bump SixLabors.ImageSharp from 2.1.9 to 2.1.10 in /csharp/sample/Microsoft.ML.OnnxRuntime.FasterRcnnSample (microsoft#23924)

Bumps [SixLabors.ImageSharp](https://github.com/SixLabors/ImageSharp)
from 2.1.9 to 2.1.10.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/SixLabors/ImageSharp/releases">SixLabors.ImageSharp's
releases</a>.</em></p>
<blockquote>
<h2>v2.1.10</h2>
<h2>What's Changed</h2>
<ul>
<li>Backport <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2859">#2859</a>
to release/2.1.x by <a
href="https://github.com/antonfirsov"><code>@​antonfirsov</code></a> in
<a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2890">SixLabors/ImageSharp#2890</a></li>
<li>Backport <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2701">#2701</a>
to 2.1.x [copy] by <a
href="https://github.com/antonfirsov"><code>@​antonfirsov</code></a> in
<a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2891">SixLabors/ImageSharp#2891</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.9...v2.1.10">https://github.com/SixLabors/ImageSharp/compare/v2.1.9...v2.1.10</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/SixLabors/ImageSharp/commit/d133ef99e8becfc3b924b0bb4315e63b8681d307"><code>d133ef9</code></a>
Set lang version</li>
<li><a
href="https://github.com/SixLabors/ImageSharp/commit/5dfe5a800367581239de442cc18de659da6e9b1d"><code>5dfe5a8</code></a>
Missed cache action update</li>
<li><a
href="https://github.com/SixLabors/ImageSharp/commit/4d3a85112b03c89d2cb8616a5b747684b6e73730"><code>4d3a851</code></a>
Use latest cache action</li>
<li><a
href="https://github.com/SixLabors/ImageSharp/commit/4cb9f40a722ab2b837157862f0320c6a652da4d0"><code>4cb9f40</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2891">#2891</a>
from SixLabors/af/backport-2701</li>
<li><a
href="https://github.com/SixLabors/ImageSharp/commit/bb82f79db0197166271d4355b5fb5ceda370a906"><code>bb82f79</code></a>
<a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2701">#2701</a>
to 2.1.x [copy]</li>
<li><a
href="https://github.com/SixLabors/ImageSharp/commit/627b5f721f30f6d529acb50bd81f92bd3db754eb"><code>627b5f7</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2890">#2890</a>
from SixLabors/af/backport-2859</li>
<li><a
href="https://github.com/SixLabors/ImageSharp/commit/67f7848d6e975e7956c8056823555de49a5fdf6d"><code>67f7848</code></a>
try to fix LFS for *.BMP</li>
<li><a
href="https://github.com/SixLabors/ImageSharp/commit/44d294e06606111195152ead3006452357ef1bb9"><code>44d294e</code></a>
8.0.x is not needed</li>
<li><a
href="https://github.com/SixLabors/ImageSharp/commit/adb85d9e66aa3a588a86f4a4ef9a0539a8502117"><code>adb85d9</code></a>
Another attempt for a Linux-specific skip</li>
<li><a
href="https://github.com/SixLabors/ImageSharp/commit/efc3fc4ee15eec4e523c26f7130e786541b00df2"><code>efc3fc4</code></a>
Disable BmpDecoder_CanDecode_Os2BitmapArray on Linux</li>
<li>Additional commits viewable in <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.9...v2.1.10">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=SixLabors.ImageSharp&package-manager=nuget&previous-version=2.1.9&new-version=2.1.10)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

---------

Signed-off-by: Jianhui Dai <[email protected]>
Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Sushanth Rajasankar <[email protected]>
Co-authored-by: Scott McKay <[email protected]>
Co-authored-by: Seungtaek Kim <[email protected]>
Co-authored-by: co63oc <[email protected]>
Co-authored-by: Jambay Kinley <[email protected]>
Co-authored-by: Hector Li <[email protected]>
Co-authored-by: Jian Chen <[email protected]>
Co-authored-by: Yulong Wang <[email protected]>
Co-authored-by: Jiajia Qin <[email protected]>
Co-authored-by: Alessio Soldano <[email protected]>
Co-authored-by: Changming Sun <[email protected]>
Co-authored-by: Ashish Garg <[email protected]>
Co-authored-by: Ashish Garg <[email protected]>
Co-authored-by: Jie Chen <[email protected]>
Co-authored-by: wp <[email protected]>
Co-authored-by: Satya Kumar Jandhyala <[email protected]>
Co-authored-by: Prathik Rao <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Tianlei Wu <[email protected]>
Co-authored-by: Jianhui Dai <[email protected]>
Co-authored-by: xhcao <[email protected]>
Co-authored-by: Wanming Lin <[email protected]>
Co-authored-by: Mark Schofield <[email protected]>
Co-authored-by: jiangzhaoming <[email protected]>
Co-authored-by: Yi-Hong Lyu <[email protected]>
Co-authored-by: vraspar <[email protected]>
Co-authored-by: Chi Lo <[email protected]>
Co-authored-by: saurabh <[email protected]>
Co-authored-by: Ranjit Ranjan <[email protected]>
Co-authored-by: Baiju Meswani <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
ashrit-ms pushed a commit that referenced this pull request Mar 17, 2025
The second round of cherry-picks into
[rel-1.21.0](https://github.com/microsoft/onnxruntime/tree/rel-1.21.0).
The first one was done in
#23846.
- #23779
- #23856
- #23827
- #23834
- #23876
- #23892

---------

Co-authored-by: Jambay Kinley <[email protected]>
Co-authored-by: Yulong Wang <[email protected]>
Co-authored-by: Ashish Garg <[email protected]>
Co-authored-by: Ashish Garg <[email protected]>
jatinwadhwa921 added a commit to intel/onnxruntime that referenced this pull request Apr 10, 2025
* Quant tool: Add `nodes_to_exclude` in `get_qnn_qdq_config` (#23779)

* [ORT/CI_Pipeline] Use --enable_generic_interface in ORT builds for EP testing (#23801)

Summary of changes:
- Changed openVINO test case to use --enable_generic_interface
- changed tensorRT test case to use --enable_generic_interface
- Fixed ORT builds to USE_FULL_PROTOBUF as openVINO/TensorRT requires
them
- Fixed pre-processor macro definition which accidently got removed when
ORT is build w/o EP

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Karim Vadsariya <[email protected]>

* Increase npm package pipeline ReactNative_CI_iOS timeout to 120 mins (#23825)

### Description
Increase [npm package
pipeline](https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1080&_a=summary)
ReactNative_CI_iOS timeout to 120 mins



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* [Mlas] Unblock hardcoded matmul blocking size (#23815)

### Description

In GemmBatch, target matrix is cut into blocks to dispatch to multiple
threads for intra-op parallelism.

Currently the block size hard-coded to 16. If the CPU has > 16 cores,
cores are not fully utilized in one op.

This change unblocks the number of blocks in various MatMul.

__Benchmark results__

Model:
llmlingua-2-bert-base-multilingual-cased-meetingbank--add-force-token-100--max-seq-len-512-CPU-INT8.onnx
set up: 96 core x86 linux

Before: 
Setting intra_op_num_threads to 64
Overriding dimension with name, batch_size, to 3
Session creation time cost: 0.485097 s
First inference time cost: 356 ms
Total inference time cost: 17.731 s
Total inference requests: 50
__Average inference time cost: 354.619 ms__
Total inference run time: 17.7312 s
Number of inferences per second: 2.81989
Avg CPU usage: 65 %
Peak working set size: 542265344 bytes
Avg CPU usage:65
Peak working set size:542265344

After:

Setting intra_op_num_threads to 32
Overriding dimension with name, batch_size, to 3
Session creation time cost: 0.523394 s
First inference time cost: 316 ms
Total inference time cost: 12.2739 s
Total inference requests: 50
__Average inference time cost: 245.478 ms__
Total inference run time: 12.2741 s
Number of inferences per second: 4.07362
Avg CPU usage: 33 %
Peak working set size: 611241984 bytes
Avg CPU usage:33
Peak working set size:611241984


Setting intra_op_num_threads to 64
Overriding dimension with name, batch_size, to 3
Session creation time cost: 0.497698 s
First inference time cost: 289 ms
Total inference time cost: 9.49205 s
Total inference requests: 50
__Average inference time cost: 189.841 ms__
Total inference run time: 9.49226 s
Number of inferences per second: 5.26745
Avg CPU usage: 65 %
Peak working set size: 548470784 bytes
Avg CPU usage:65
Peak working set size:548470784
Runs:50

### Motivation and Context
This issue is reported by M365 research team.

* Revert changes onn mac-react-native-ci-pipeline.yml (#23845)

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* Fix flash attention for GQA (Phi4) (#23850)

### Description
This change fixes GQA for Flash Attention on Nvidia GPUs. The root cause
appears to be
`k_start + capped_sg_id < seq_causal_length`
check. This is either because, 
a. seq_causal_length varies per lane, so the check becomes non uniform
control flow, which is having interactions with subgroupShuffle.
or 
b. The check itself is incorrect and is wiping out values of v based on
the source lane's seq_causal_length. While in actualness values of v
need to be causal as per the lane that is going to multiply it with qkt.

qkt is already causal because earlier values of qk for out of bounds k
are set to min_value, and exp(<-4) are 0.

This fix works by removing that causal check and relying on the qk being
wiped out earlier. The documentation for causality behavior for GQA is
missing to determine which of this reason is the true reason.

Prior to this prompts with sequence length > 16 < 32 or 1k would break
with Phi 4 but smaller prompts would work.
Tested on Intel Alderlake, Nvidia 4070.

* Model Builder API (#23223)

### Description
<!-- Describe your changes. -->
Supports creating a model programmatically using the ORT C or C++ API. 
Supports augmenting an existing model to add nodes.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* Fix typo: change `Upample` to `Upsample`. (#23838)

### Description
<!-- Describe your changes. -->
Fixed a typo in function names related to the Upsample CUDA kernel.
Changed incorrect spelling Upample to Upsample across relevant
functions.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This change is necessary to maintain consistency and prevent potential
confusion caused by incorrect function names.

* [doc] Fix typos in csharp/src/Microsoft.ML.OnnxRuntime/ (#23848)

### Description
<!-- Describe your changes. -->
Fix typos in csharp/src/Microsoft.ML.OnnxRuntime/


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* Quant tool: Consistent `get_qdq_config` and `get_qnn_qdq_config` behavior (#23856)

* Change the logic to generate the default ep context file name (#23788)

Change the logic to generate the default ep context file name

### Description
Applies to all EPs: replace the .onnx to _ctx.onnx, instead of directly append extra string _ctx.onnx to existing model path. In QNN EP, also make the context binary .bin file shorter by removing QNNExecutionProvider_ from the file name.

* Make Nuget QNN package pipeline 1ES compliant (#23805)

### Description
Make
[QNN_Nuget_Windows](https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1234)1ES
compliant



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* [js/common] allows using Uint16Array as data for float16 tensor (#23827)

### Description

Resolve #23817



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* [js/webgpu] Reland the optimization of ConvTranspose (#23858)

This PR fixes the errors in the ConvTranspose optimization and adds
tests to ensure the correctness of the implementation.

* [OpenVINO] Fix a build warning (#23877)

### Description
Fix a warning with std::move usage



### Motivation and Context
Possibly allow building without --compile_no_warning_as_error flag

* Change gsl::byte to std::byte (#23872)

To be compatible with the latest GSL library. Without this fix we will
get:

```
onnxruntime\core\providers\cpu\controlflow\loop.cc(247): error C4996: 'gsl::byte': Use std::byte instead.
```

* Allow using extended minimal build for several EPs (#23834)

### Description

#### Background

From code search, the following EPs use
`onnxruntime::GetCpuPreferredNodes()` in their `GetCapabilities()`
methods:
- CANN
- CUDA
- DML
- JS
- ROCM
- WebGPU

However, the source file that implements
`onnxruntime::GetCpuPreferredNodes()` is excluded when minimal build is
ON:
https://github.com/microsoft/onnxruntime/blob/6df0973e58ba5399fcaa98686f70ed9a9e59aaef/cmake/onnxruntime_framework.cmake#L38-L42

This means that all EPs mentioned above is not able to compile with
minimal build.

#### Solution

The excluded file `core/framework/fallback_cpu_capability.cc` cannot
build in minimal build because some of its dependencies are not included
in the minimal build. However, in extended minimal build mode, all
dependencies are available.

This PR looses the restrict and allows to compile this file when it is
extended minimal build. After this change, those EPs are able to compile
in extended minimal build.

* Add dawn to ThirdPartyNotices (#23876)

### Description

Add `dawn` to ThirdPartyNotices.

* Enable QNN EP weight sharing generation using public API (#23702)

### Description
Enable QNN EP weight sharing generation using public API instead of internal interfaces, so that user can integrate into their own toolchain. The change is to share the QnnBackendManager across ORT sessions if ep.share_ep_contexts is enabled. And there is extra option to end the share so that we know when to remove the shared QnnBackendManager from the singleton.

Change the tool name from onnxruntime_qnn_ctx_gen to ep_weight_sharing_ctx_gen, so that it can be shared for other EPs.

* [QNN-EP]: Fix inference failures while running with htp_shared_memory (#23892)

### Description
When using the enable_htp_shared_memory feature, we see that the address
of the buffer passed to rpcmem_free is incorrect. So the rpc buffers are
not freed leading to memory exhaustion.

### Motivation and Context
When using the enable_htp_shared_memory_allocator feature for QNN in
GenAI extensions, it leads to inference failures during the second
prompt. As GenAI memory asks are higher, it surfaces sooner in gen AI
use cases.

Co-authored-by: Ashish Garg <[email protected]>

* Fix enable_pix_capture build for WebGPU (#23857)

The build option --enable_pix_capture is broken. This fixes the problem.

---------

Co-authored-by: wp <[email protected]>

* [WebGPU-EP Native] Add ReduceMean (#23860)

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* [WebGPU EP] introduce BiasAdd contrib op (#23861)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Dynamo export and improve benchmark script for SAM2 encoder (#23887)

### Description
* Add dynamo export for Sam2 image encoder
* Verify fp32 onnx model with CPU EP (to avoid error message from TRT
EP).
* Update benchmark script:
  - output ORT profiling
- output torch compiled code and unique kernel name for compiled kernel
  - add an option for nightly package installation
  - uninstall existing ort packages before installing

The node metadata of dynamo exported model can help mapping node in onnx
model back to pytorch modeling script. Currently, the graph optimization
is not done on dynamo exported model, so it is experimental right now.

### Motivation and Context

To support profiling of torch compiled CUDA kernel.

* [js/web] improve workaround for bundlers (#23902)

### Description
This PR improves the workaround for bundlers in onnxruntime-web.
Specifically, the following changes have been made:

- Use [this
workaround](https://github.com/xenova/onnxruntime/commit/9c50aa2c63bad4cb73ad77ff1c43e0c43da0907f)
as suggested by @xenova in
https://github.com/huggingface/transformers.js/pull/1161#issuecomment-2695785730

- Use `url > "file:" && url < "file;"` instead of
`url.startsWith("file:")` to allow minifiers to remove dead code
correctly.

This change allows to remove unnecessary dependencies of file parsed
from `new URL("ort.bundle.min.js", import.meta.url)` in Vite, and
optimize code like `if("file://filepath.js".startsWith("file:"))
{do_sth1(); } else {do_sth2();}` into `do_sth1()` for webpack/terser
usages.

Resolves https://github.com/huggingface/transformers.js/pull/1161

* [webgpu] Restore MatMulNBits workgroup size for Phi-3.5 (#23349)

### Description
This change restores the MatMulNBits workgroup size from (8, 8, 1) back
to (16, 8, 1) to resolve a performance regression observed on Intel
iGPUs during token generation (M=1).

### Motivation and Context
As above.

Signed-off-by: Jianhui Dai <[email protected]>

* [webgpu] support Pad operator (#23141)

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* [WebNN] Accept Float16Array for float16 data type if it is available (#23894)

Float16Array is now shipping and WebNN Chromium implementation has
accepted it. We should allow it in WebNN EP as well.

* Ensure that the 'cmake_minimum_required' is version 3.5 or greater (#23888)

### Description
CMake 4.0 release candidate 2.0 is available, and it cannot compile all
of OnnxRuntime out-of-the-box. There's portions of the OnnxRuntime
codebase that specify a `cmake_minimum_required` version of 3.0, and
CMake 4.0 has removed support for compatibility with CMake < 3.5 - the
following error is reported:

```
CMake Error at winml_sdk_helpers.cmake:4 (cmake_minimum_required):
  Compatibility with CMake < 3.5 has been removed from CMake.

  Update the VERSION argument <min> value.  Or, use the <min>...<max> syntax
  to tell CMake that the project requires at least <min> but has been updated
  to work with policies introduced by <max> or earlier.

  Or, add -DCMAKE_POLICY_VERSION_MINIMUM=3.5 to try configuring anyway.
```

Since CMake 3.5 appears to have shipped in 2016, it seems reasonable to
set that as a minimum version to fix the error. The root CMakeLists.txt
does ask for a minimum version of 3.28, so we could snap to that, but
I'm still ramping up on the build, so wanted to propose a minimally
sufficient fix.

### Motivation and Context
Being able to build with the latest CMake - when it ships - reduces the
barrier to entry to building OnnxRuntime, and allows the OnnxRuntime to
leverage the latest and greatest tooling.

* WebGPU: Remove deprecated subgroups-f16 from WebGPU native and JS EP (#23898)

This PR removes the deprecated subgroups-f16 from WebGPU native and JS
EP, and also remove the unused deviceInfo in WebGPU JS EP.

* [JSEP/WebGPU] Fixed error in softmax dispatch. (#23906)

### Description
Fixed an error softmax dispatch



### Motivation and Context
Produce expected results for LlaMA model

* enable WebGPU EP in WebAssembly build (#23913)

### Description

This PR is the first step for migrating the webgpu backend of
onnxruntime-web from JSEP based to WebGPU EP based.

In this change, we enable building WebGPU EP in a wasm build (ie.
`--build_wasm` `--use_webgpu` `--use_jsep`). However, the old build
flags should still keep previous behavior.

* Adding OpenVINO Windows CI Pipeline (#23919)

### Description
<!-- Describe your changes. -->

Enable an OpenVINO Windows CI pipeline. This includes:
- Downloading the OpenVINO toolkit for Windows from an external source.
- Setting up OpenVINO environment variables.
- Building the ONNX Runtime OpenVINO Execution Provider.
- Running unit tests.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

This change is required to run checks on precommit and commit in the
ONNX Runtime project. It ensures that the code is tested with the
OpenVINO toolkit on Windows, improving the reliability and compatibility
of the project.

* [WebGPU EP] SoftMax Implementation (#23538)

Increase coverage for WebGPU Op

* Exclude MAUI projects from GPU C# packaging builds (#23923)

### Description
<!-- Describe your changes. -->
Use 'desktop only' solution in GPU C# packaging builds. We don't need to
include any MAUI support for those builds.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* Support all block sizes that are multiples of 32 for DP4A (#23907)

### Description
Simple change 
1. The DP4A shader actually supports all block sizes that are multiples
of 32, relaxing the restriction and making a small tweak to support
sizes other than 32.
2. Moved the shader to a separate file for maintainability.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Example custom op with output type inferencing (#23916)

### Description
<!-- Describe your changes. -->
Add example of a custom op that is required to do type inference for the
output type for the model load to work.
Also acts as an example of how to override an ONNX op with a custom
implementation.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#23891

* Enabling L2+ Optimizations for EPs (#23517)

There are some requirements to modify the graph which are specific to
the EP/hardware.
ORT has the hardcoded EP list for optimizations but that can't scale and
it's hard be extended to enable EP custom optimizations.

Here is the prototype to enable L2+ optimizations for EPs (The original
overview is provided by @skottmckay) as well as the TRT EP
implementation for the ConstantFoldingDQ optimization.

Signatures for selection and optimization functions:
````
  - Selection: std::function<std::vector<std::unique_ptr<ComputeCapability>>(const GraphViewer&, const KeyValueConfig&)>
  - Optimization: std::function<Status(const Graph&, const ComputeCapability& this_optimization, ComputeCapability& cc_to_update)>
````
GetCapability

- call (new) provider bridge API to lookup pre-defined optimizer by name
and get selection function
- ComputeCapability.optimize_func, i.e. optimization function, would be
set by the optimizer to the function that does the optimization

- EP has to update the returning ComputeCapability to include the
optimization ComputeCapability in nodes_to_optimize. So that later ORT
can perform optimization/transformation accordingly.

GraphPartitioner

- After assigning the ComputeCapability to the EP and prior to Compile,
if the ComputeCapability has nodes_to_optimize, iterate that list
  - optimization function needs to be called with
    - a mutable Graph instance
    - the ComputeCapability for the individual optimization
    - the overall ComputeCapability so it can be updated

* fix binplace file in web pipeline (#23930)

* Updated run_CIs_for_external_pr.py to support the Windows OpenVINO CI pipeline (#23931)

* Fix ConvInteger handling of optional inputs. (#23935)

### Description
<!-- Describe your changes. -->
Fix ConvInteger handling of optional inputs. Need to check Exists() and
not just the number of inputs.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#23927

* Updated ov version in pipeline (#595) (#23882)

### Description
This PR updates the OpenVINO version used in the pipeline from 2024.5.0
to 2025.0.0

Co-authored-by: jatinwadhwa921 <[email protected]>

* [AIX] External data handling (#23859)

### Description
In BE system, model tensor data coming from external file is not handled
properly.
This was found during the debugging of
(https://github.com/microsoft/onnxruntime-genai/issues/1104)(url)

This PR changes do the endianness conversion of data loaded from
external file in BE system.

* Create a packaging pipeline for a custom nuget package (#23918)

* Fix license in example test code. (#23936)

* replace usage of gsl::narrow and gsl::narrow_cast in WebGPU EP (#23926)

### Description

`gsl::narrow` does not work in no exception build.
- use `onnxruntime::narrow` if necessary;
- or change to `static_cast` if it's obviously safe.

also apply the changes to usage of `gsl::narrow_cast`, which does not
apply checks.

* VCPKG improvement: set  VCPKG_OSX_DEPLOYMENT_TARGET (#23933)

### Description
1. Set  VCPKG_OSX_DEPLOYMENT_TARGET for macOS targets
2. Enable VCPKG in more pipelines.

* Allow using a different version of flatbuffers when building with vcpkg (#23946)

### Description
Allow using a different version of flatbuffers when building with vcpkg,
so that users do not need to pin flatbuffer's version, which provides
more flexibility in the build process.

Delete utf8_range from the dependencies, because it is an indirect
dependency of protobuf, which is already included in the build process.
### Motivation and Context

* Make python package pipeline 1ES compliant (#23800)

### Description
Make [Python packaging
pipeline](https://aiinfra.visualstudio.com/530acbc4-21bc-487d-8cd8-348ff451d2ff/_build?definitionId=841)
1ES compliant



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

### Checklist

- [x] Make Onnxruntime-QNNEP-Windows-2022-CPU stateless

* Delete ROCM Nuget Publishing Pipeline (#23948)

* Bump SixLabors.ImageSharp from 2.1.9 to 2.1.10 in /csharp/sample/Microsoft.ML.OnnxRuntime.FasterRcnnSample (#23924)

Bumps [SixLabors.ImageSharp](https://github.com/SixLabors/ImageSharp)
from 2.1.9 to 2.1.10.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/SixLabors/ImageSharp/releases">SixLabors.ImageSharp's
releases</a>.</em></p>
<blockquote>
<h2>v2.1.10</h2>
<h2>What's Changed</h2>
<ul>
<li>Backport <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2859">#2859</a>
to release/2.1.x by <a
href="https://github.com/antonfirsov"><code>@​antonfirsov</code></a> in
<a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2890">SixLabors/ImageSharp#2890</a></li>
<li>Backport <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2701">#2701</a>
to 2.1.x [copy] by <a
href="https://github.com/antonfirsov"><code>@​antonfirsov</code></a> in
<a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2891">SixLabors/ImageSharp#2891</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.9...v2.1.10">https://github.com/SixLabors/ImageSharp/compare/v2.1.9...v2.1.10</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/SixLabors/ImageSharp/commit/d133ef99e8becfc3b924b0bb4315e63b8681d307"><code>d133ef9</code></a>
Set lang version</li>
<li><a
href="https://github.com/SixLabors/ImageSharp/commit/5dfe5a800367581239de442cc18de659da6e9b1d"><code>5dfe5a8</code></a>
Missed cache action update</li>
<li><a
href="https://github.com/SixLabors/ImageSharp/commit/4d3a85112b03c89d2cb8616a5b747684b6e73730"><code>4d3a851</code></a>
Use latest cache action</li>
<li><a
href="https://github.com/SixLabors/ImageSharp/commit/4cb9f40a722ab2b837157862f0320c6a652da4d0"><code>4cb9f40</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2891">#2891</a>
from SixLabors/af/backport-2701</li>
<li><a
href="https://github.com/SixLabors/ImageSharp/commit/bb82f79db0197166271d4355b5fb5ceda370a906"><code>bb82f79</code></a>
<a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2701">#2701</a>
to 2.1.x [copy]</li>
<li><a
href="https://github.com/SixLabors/ImageSharp/commit/627b5f721f30f6d529acb50bd81f92bd3db754eb"><code>627b5f7</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2890">#2890</a>
from SixLabors/af/backport-2859</li>
<li><a
href="https://github.com/SixLabors/ImageSharp/commit/67f7848d6e975e7956c8056823555de49a5fdf6d"><code>67f7848</code></a>
try to fix LFS for *.BMP</li>
<li><a
href="https://github.com/SixLabors/ImageSharp/commit/44d294e06606111195152ead3006452357ef1bb9"><code>44d294e</code></a>
8.0.x is not needed</li>
<li><a
href="https://github.com/SixLabors/ImageSharp/commit/adb85d9e66aa3a588a86f4a4ef9a0539a8502117"><code>adb85d9</code></a>
Another attempt for a Linux-specific skip</li>
<li><a
href="https://github.com/SixLabors/ImageSharp/commit/efc3fc4ee15eec4e523c26f7130e786541b00df2"><code>efc3fc4</code></a>
Disable BmpDecoder_CanDecode_Os2BitmapArray on Linux</li>
<li>Additional commits viewable in <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.9...v2.1.10">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=SixLabors.ImageSharp&package-manager=nuget&previous-version=2.1.9&new-version=2.1.10)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Make python CUDA package pipeline 1ES compliant (#23802)

### Description
Make
[Python-Cuda-Publishing-Pipeline](https://dev.azure.com/aiinfra/Lotus/_build?definitionId=1311&_a=summary)
1ES compliant



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* Migrate yarn to npm (#22116)

### Description
This PR change all reference to yarn to npm



### Motivation and Context
This PR is needed to address all Component Governce issue that ORT is
facing

### Current issue

- [x]   use_react_native!(:path => config["reactNativePath"]) return nil
- [x] For error `CocoaPods could not find compatible versions for pod
"RCTRequired"`, we might need to increase iOS targe version from 13.0 to
a higher version.
- [x] For 'react-native' >= 0.73.x , react-native/react.gradle file is
no longer used
- [x] We need to update to gradle 7.6 or above to upgrade the RN.
current gradlew version 7.3.3 that we use does not works on RN 71+.
- [x] Instruction on how to implement the React-Native has changed since
[0.72](https://reactnative.dev/docs/integration-with-existing-apps).
- [x] Error `The new Java toolchain feature cannot be used at the
project level in combination with source and/or target compatibility`
from gradle.
- [x] duplicate class: com.facebook.react.PackageList
solution: remove `apply from:
file("../../node_modules/@react-native-community/cli-platform-android/native_modules.gradle");
applyNativeModulesAppBuildGradle(project)` from bottom of
andoird/app/build.gradle

- [x] Need to update the OnnxruntimeModuleTest because
`ReactApplicationContext` is now a abstract class.

---------

Co-authored-by: Edward Chen <[email protected]>

* [WebGPU/JSEP] Support group query attention do_rotary attribute (#23524)

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* Fix npm audit in js/react-native/e2e (#23975)

* Suppress some warnings in WebGPU EP generated by GCC 13 (#23984)

### Description

Replace #23445, resolve conflicts and add one new file.

---------

Co-authored-by: Changming Sun <[email protected]>

* Fix NPM audit in js/react-native (#23974)

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* Bump axios from 1.7.9 to 1.8.2 in /js/node (#23963)

* GCC 14: fix insert_or_assign() call (#23955)

Resolve #23954

* ADD emsdk env vars to VCPKG_KEEP_ENV_VARS (#23997)

### Description
The vars are set by  cmake\external\emsdk\emsdk_env.bat


### Motivation and Context
By default they are filtered by vcpkg to make build reproducible.
However, emscripten's cmake toolchain file needs this information.
emcc.bat has the following code:
```
@set EM_PY=%EMSDK_PYTHON%
@if "%EM_PY%"=="" (
  set EM_PY=python
)
```
Actually, it doesn't work as expected. the line 
```
set EM_PY=python
``` 
should be changed to 
```
set EM_PY=python.exe
```

We haven't hit this issue because usually the var EM_PY is set.

* Fix  ONNX Runtime Python Test Pipeline (#23990)

### Description
[Fix ONNX Runtime Python Test Pipeline

](https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1164&_a=summary)


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* [webgpu] Fix the continuation issue (#23999)

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* [WebGPU EP] Implements Gelu, BiasSplitGelu, and QuickGelu (#23981)

Increases WebGPU operator coverage

* [Native WebGPU] Added ReduceMax and ReduceSum (#23934)

### Description
Added ReduceMax and ReduceSum



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* Convert Windows CPU CI Pipeline to Github Actions (#23996)

* [Fix] Dependencies find_package Eigen error (#23939)

### Description
To fix the CMake configuration error when a dependency brought in via
FetchContent uses find_package(Eigen3 REQUIRED)

Major Changes:
- enable EIGEN_BUILD_CMAKE_PACKAGE
- [optional] rename eigen to Eigen3 

### Motivation and Context

Get the following build error when Dependencies use find_package(Eigen3
REQUIRED)
```
By not providing "FindEigen3.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "Eigen3", but
  CMake did not find one.

  Could not find a package configuration file provided by "Eigen3" with any
  of the following names:

    Eigen3Config.cmake
    eigen3-config.cmake

  Add the installation prefix of "Eigen3" to CMAKE_PREFIX_PATH or set
  "Eigen3_DIR" to a directory containing one of the above files.  If "Eigen3"
  provides a separate development package or SDK, be sure it has been
  installed.
```
Eigen need enable **EIGEN_BUILD_CMAKE_PACKAGE** when FetchContent for
generate **Eigen3Config.cmake**

https://gitlab.com/libeigen/eigen/-/blob/master/CMakeLists.txt?ref_type=heads#L213

in addition , the eigen‘s project name is "Eigen3" and providing the
cmake configuration file is "Eigen3Config.cmake" :

https://gitlab.com/libeigen/eigen/-/blob/master/CMakeLists.txt?ref_type=heads#L36

https://gitlab.com/libeigen/eigen/-/blob/master/CMakeLists.txt?ref_type=heads#L252
So I think it's best for FetchContent_Declare Name to be consistent with
the project name to avoid potential errors.

Co-authored-by: mingyue <[email protected]>

* Update onnxruntime_c_api.h to work with MinGW (#24006)

### Description

Same as #23169

### Motivation and Context

Same as #23169

* Add DNNL github workflow (#24011)

### Description
Add DNNL github workflow which is migrated from "Windows CPU CI
pipeline" from Azure DevOps.
This PR also adds "--build_nuget" to test the C# part. 
However, then I hit an error when building the tests in
"test\Microsoft.ML.OnnxRuntime.Tests.NetCoreApp\Microsoft.ML.OnnxRuntime.Tests.NetCoreApp.csproj".
The error message was:

```
D:\a\_work\onnxruntime\onnxruntime\csharp\test\Microsoft.ML.OnnxRuntime.Tests.Common\TrainingTest.cs(34,81): error CS0103: The name 'CheckpointState' does not exist in the current context [D:\a\_work\onnxruntime\onnxruntime\csharp\test\Microsoft.ML.OnnxRuntime.Tests.NetCoreApp\Microsoft.ML.OnnxRuntime.Tests.NetCoreApp.csproj]
```
Then I checked the code. I couldn't understand how it worked before. In
this build, `__TRAINING_ENABLED_NATIVE_BUILD__` is not defined. But the
"CheckpointState" class is defined in
https://github.com/microsoft/onnxruntime/blob/main/csharp/src/Microsoft.ML.OnnxRuntime/Training/CheckpointState.shared.cs#L21
And the file is empty when __TRAINING_ENABLED_NATIVE_BUILD__ is not
defined. So I don't understand how it could work in a normal build
without dnnl.

Here is my build command:

```
python tools\ci_build\build.py  --config RelWithDebInfo --build_dir dnnlbuild --skip_submodule_sync --build_csharp --parallel --use_binskim_compliant_compile_flags --cmake_generator "Visual Studio 17 2022" --build_shared_lib --enable_onnx_tests --build_wheel --msbuild_extra_options "IncludeMobileTargets=false" --build_nuget --use_vcpkg --use_vcpkg_ms_internal_asset_cache --use_dnnl
```

This PR removes the failed test.

* Qnn weight sharing improvement (#23945)

### Description
Qnn weight sharing improvement so that only the last session in the weight sharing group (the session that has both share_ep_contexts and stop_share_ep_contexts enabled) generates the .bin file. The .bin file name is decided from the 1st session. And all generated *_ctx.onnx models point to this single .bin to avoid post-processing work.
Previously each session generates a _ctx.onnx model with a .bin file. So it requires post-processing work to go through generated *_ctx.onnx models to get the last generated *_ctx.bin file and update all *_ctx.onnx to point to the same .bin file and remove the .bin files not used.

* Correct generated cmake syntax (#24016)

### Description


Previously will got
CMake Error at
build/Android/intermediates/armeabi-v7a/vcpkg/buildtrees/0.vcpkg_dep_info.cmake:15:
  Parse error.  Expected a newline, got identifier with text "set".

* [webgpu] allow to specify UseIndicesTypeAlias for Indices (#24019)

### Description

Allow to specify `UseIndicesTypeAlias` for `AddIndices` in
`ShaderHelper`.

* [webgpu] allow overloads to Program::AddIndices (#24021)

### Description
This change allows more overloads for the `Program::AddIndices` method,
and makes use of r-value references for parameters when possible.

Also fixed the implementation of the `AddInputs` and `AddOutputs`
methods to use r-value references for the parameters

* fix test for RotaryEmbedding (#24022)

### Description

the `BaseTester::Run` function signature is:
```c++
void BaseTester::Run(ExpectResult expect_result, const std::string& expected_failure_string,
                     const std::unordered_set<std::string>& excluded_provider_types,
                     const RunOptions* run_options,
                     std::vector<std::unique_ptr<IExecutionProvider>>* execution_providers,
                     ExecutionMode execution_mode,
                     const Graph::ResolveOptions& options);
```

Its behavior is:
- if the parameter `execution_providers` is empty, it will try to
aggregate all execution providers available in the build, and for each
EP, create inference session and perform test.
- if the parameter `execution_providers` is not empty, it will run a
single inference session, use the passed-in `execution_providers` as
session options and perform test.

The old code may put multiple EPs into single inference sessions, but at
runtime there will be only one EP running the test. Specifically, WebGPU
EP is after CPU EP in this case, so the test never run on WebGPU EP.

**To reviewers**: if you see **a lot of** changes, click the "setting"
button next to the "Jump to",

<img width="277" alt="image"
src="https://github.com/user-attachments/assets/e8947ffb-f230-4c59-a5b7-36c0aedd2b7c"
/>


and check the "Hide Whitespace" and load it again.

<img width="137" alt="{4D60F676-35F4-4546-B8E1-E2F42411A9E6}"
src="https://github.com/user-attachments/assets/f4c58e6e-c290-49f7-aca7-c413db1e3c77"
/>

* Fix attention bias broadcast (#24017)

### Description

* Fix broadcast on attention bias dim 1.
* Increase test cases in test_mha.py in pipeline to cover the testing.

### Motivation and Context

This feature was added in
https://github.com/microsoft/onnxruntime/pull/21710.

There was bug when computing the offset when attention bias broadcast on
dim 1 only in both CUDA and CPU kernel.

It can be triggered when attention bias shape is like [batch_size, 1,
sequence_length, total_sequence_length] and batch_size > 1 when unfused
kernel is selected. Note that cudnn flash attention and cutlass fused
attention also supports attention bias, so the bug in unfused kernel was
not discovered previously.

* Remove unused parameter in csharp InferenceTest (#24031)

### Description
Fix a warning from analyzers:
```
Theory method 'CanRunInferenceOnAModelDotnetTensors' on test class 'InferenceTest' does not use parameter 'enableParallelExecution'. Use the parameter, or remove the parameter and associated data. (https://xunit.net/xunit.analyzers/rules/xUnit1026
```


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* [TensorRT EP] Call cudaSetDevice at compute function for handling multithreading scenario (#24010)

The GPU device is set again at compute function/compute time to handle
multithreading scenarios.

Consider the following:
Users can create multiple threads to initialize separate inference
sessions on different devices (not just the default device 0)
Later, additional threads may be spawned to execute
inference_session.Run(), which calls this compute function.
Since new threads default to using device 0, it’s necessary to
explicitly set the correct device to ensure computations run on the
intended GPU.

Example code:
````python

provider = [
        [
            ('TensorrtExecutionProvider', {
            'device_id': 0,
            }),
        ],
        [
            ('TensorrtExecutionProvider', {
            'device_id': 1,
            }),
        ]
       ]

class ThreadObj():
    def __init__(self, model_path: str, iterations: int, idx: int):
       ...
        sess_opt = ort.SessionOptions()
        self.inference_session = ort.InferenceSession(model_path, sess_opt, provider[idx % 2])
     
    def warmup(self):
        self.inference_session.run(None, self.input)

    def run(self, thread_times, threads_complete):
        for iter in range(self.iterations):
            self.inference_session.run(None, self.input)

def thread_target(obj, thread_times, threads_complete):
    obj.run(thread_times, threads_complete)

...

iterations = 500
num_threads = 13
t_obj_list = []
thread_list = []

for tidx in range(num_threads):
    obj = ThreadObj(model_path, iterations, tidx)
    t_obj_list.append(obj)
    obj.warmup()
    
for t_obj in t_obj_list:
    thread = threading.Thread(target=thread_target, daemon=True, args=(t_obj,thread_times,threads_complete,))
    thread.start()
    thread_list.append(thread)

...
````


Note: Based on our measurements (using cuda event) on the A100 GPU with
CUDA 12, the execution time for `cudaSetDevice` is approximately 0.004
ms, which is negligible and does not impact runtime performance.

* Increase timeout for ARM64-Xcode16-targeting-iphonesimulator (#24030)

* Support tvOS build (#24000)

* [TensorRT EP] Stop enforcing oss parser during Windows debug build (#24036)

### Description
<!-- Describe your changes. -->
Reverting as this issue disappeared after adapting newer TRT api.

This has been validated by building ORT 1.20.1/1.21.0 debug build and
testing on FRCNN/resnet50 models.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* Set CMAKE_POLICY_DEFAULT_CMP0069 to NEW to ensure that IPO flags are added for dependencies. (#24034)

Set CMAKE_POLICY_DEFAULT_CMP0069 to NEW to ensure that interprocedural optimization (IPO) flags are added for dependencies.
If the OLD behavior is used, the IPO flags are only added for the Intel compiler on Linux.

* Make Cuda packaging pipeline 1ES compliant (#23806)

### Description
Make [Cuda packaging
pipeline](https://dev.azure.com/aiinfra/Lotus/_build?definitionId=1287&_a=summary)
1ES compliant



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Check List

- [x] pool `onnxruntime-Win-CPU-2022` not found

* [webgpu/wasm] allow runtime switch between WebGPUEP and JSEP (#24032)

### Description

Add `--webgpu-ep=runtime` to allow build ort-web with both WebGPUEP and
JSEP, while at runtime use `globalThis.WEBGPU_EP` to switch between
them.

This change helps to do perf comparison between WebGPU EP and JSEP much
easier.

* Move call to MLAS_CPUIDINFO::GetCPUIDInfo() out of MlasSQNBitGemmDispatchNeon initialization. (#24018)

Move call to `MLAS_CPUIDINFO::GetCPUIDInfo()` out of `MlasSQNBitGemmDispatchNeon` initialization.

Reduce binary size when MatMulNBits op is not included in the build.

I believe the side effect of `MLAS_CPUIDINFO::GetCPUIDInfo()` (e.g., initializing a static object) prevents the linker from discarding the code in a build where the associated MLAS functions are unused.

* [webgpu] fix the wrong dispatch size in flash_attention (#24020)

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Yulong Wang <[email protected]>

* avoid copy unnecessary files for nodejs pkg (#23992)

### Description

remove duplicated file in nodejs package.

#23956

* Add support for custom position ids and attention bias to GQA CPU operator (#23944)

### Description

- Added support for custom position ids and attention masks to the GQA
CPU operator (fp32 and fp16)
- Added MLAS eltwise add kernel for mask application for FP32 and FP16
- Added unit tests for the added eltwise add MLAS kernel
- Modified python tests to test the new GQA inputs


### Motivation and Context
Custom position ids and attention mask are required in order to
implement speculative decoding in PhiSilica

### Benchmarks

All the benchmarks are executed on the GQA op configuration which will
be used in the PhiSilica speculative decoding secnario, and the
configuration is as follows:

- num_heads: 32
- kv_num_heads: 32
- do_rotary: 1
- local_window_size: -1
- head_size: 96
- sequence_length: 6
- packed_qkv: True

Benchmarks were executed on Cadmus with Snapdragon(R) X 12-core X1E80100
@ 3.40 GHz

In the tables below, column headers are total sequence length values
used for benchmarking, and the row values are if the attention bias was
used or not. Values are average inference time in ms over 100000 runs.

#### Fp16 results

| Total sequence length | 50 | 100 | 250 | 500 | 750 | 1000 | 1500 |
2000 | 2500 | 3000 | 3500 | 4000 |

|:-----------------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:--------|:--------|:--------|:--------|:--------|
| Without bias | 0.284054 | 0.257449 | 0.275806 | 0.334123 | 0.458324 |
0.614133 | 0.912791 | 1.38585 | 1.92186 | 2.39203 | 2.88808 | 3.46262 |
| With bias | 0.250926 | 0.253072 | 0.279724 | 0.337774 | 0.499058 |
0.585388 | 0.914316 | 1.40701 | 1.87311 | 2.47475 | 3.3906 | 3.47474 |
| Runtime increase | -11.66% | -1.7% | +1.42% | +1.09% | +8.89% | -4.68%
| +0.17% | +1.53% | -2.54% | +3.46% | +17.4% | +0.35% |

#### Fp32 results

| Total sequence length | 50 | 100 | 250 | 500 | 750 | 1000 | 1500 |
2000 | 2500 | 3000 | 3500 | 4000 |

|:-----------------|:---------|:---------|:---------|:---------|:---------|:---------|:--------|:--------|:--------|:--------|:--------|:--------|
| Without bias | 0.259049 | 0.270541 | 0.304583 | 0.376708 | 0.554013 |
0.633217 | 1.20696 | 1.65985 | 1.95169 | 2.45807 | 3.05637 | 4.05169 |
| With bias | 0.261631 | 0.268002 | 0.300853 | 0.370452 | 0.529865 |
0.735216 | 1.43493 | 1.4385 | 1.99028 | 2.3858 | 2.99425 | 4.80197 |
| Runtime increase | +1.0% | -0.94% | -1.22% | -1.66% | -4.36% | +16.11%
| +18.89% | -13.34% | +1.98% | -2.94% | -2.03% | +18.52% |

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* [WebNN] Better int64 integration (#23831)

This PR adds some workarounds to enable int64 support for some WebNN
backends which don't support int64 data type.

- Do not fallback ops that are specifically due to the int64 limitation.
- Convert all int64 initializer and input values to int32 and handle
potential overflow errors.
- Register all int64 model inputs and outputs as int32 ml-tensor.
- Handle ONNX ops that need inputs or outputs conversion between int64
and int32. e.g. ArgMax, ArgMin, Cast, etc.
- Convert int64 output data back to int32.
- Disallow int64 outputs as 'ml-tensor' preferredOutputLocation.

Fixed #21401

* Convert Windows GPU pipelines and Windows OpenVino pipeline to Github Actions (#24029)

### Description
Convert Windows GPU pipelines and Windows OpenVino pipeline to Github
Actions

* [ARM CPU] Fix fp16 const initialization on no-fp16 platform (#23978)

### Description
Fix fp16 const initialization on no-fp16 platform [such as Raspberry
PI](https://github.com/microsoft/onnxruntime/issues/23957)



### Motivation and Context
Resolve #23957

* [Native WebGPU EP] Add packedQKV and do_rotary attribute support to GroupQueryAttention operator (#23386)

### Description
Add Packed QKV inputs and do_rotary attribute to GQA.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Packed QKV inputs and do_rotary attribute are required for certain
models.

* Whisper Redesigned Solution (#23549)

### Description

This PR re-designs how Whisper is created and supported in ONNX Runtime.
The new solution leverages [previous optimization
work](https://github.com/microsoft/onnxruntime/pull/15473), and it is
designed to be used in conjunction with [this
work](https://github.com/microsoft/onnxruntime-genai/pull/1229) in ONNX
Runtime GenAI.

Some of the added changes include:
- Re-designed export that creates new ONNX models without needing a
`WhisperBeamSearch` op
- Creates one encoder model that also pre-computes the cross-attention
KV caches (since they only need to be calculated once)
- Creates one decoder model that can be used during pre-fill and token
generation
- Creates one jump-times model that can be used for word-level
timestamps
- Removes need for a `WhisperBeamSearch` op to chain the encoder and
decoder subgraphs
  - Removes need to duplicate decoder's weights in memory
- Previous solution with the `WhisperBeamSearch` op created an
encoder-decoder-init model and decoder-with-past model. The decoder was
duplicated twice, one in each.
- Removes need for separate logic to export the PyTorch model coming
from OpenAI vs. the PyTorch model coming from Hugging Face
- Re-factors common parameters and logic used in CPU and CUDA attention
kernels
- Adds `DUMP_STRING` to enable easy logging of intermediate information
when running in debug mode to debug a problem. This info is not printed
in release mode so it will not impact performance.
- Integrates `DecoderMaskedMultiHeadAttention` into `MultiHeadAttention`
- Enables past-present buffer sharing in the `MultiHeadAttention` op for
improved performance
- Adds `cache_indirection` and `past_sequence_length` as new optional
inputs to `MultiHeadAttention`
  - Adds `output_qk` as new optional output to `MultiHeadAttention`
- Enables calculating `output_qk` tensor with FP16 or FP32 precision,
regardless of the model's precision
- CI tests that run end-to-end across various flag combinations that are
used by many customers internally and externally

The existing solutions are still available if desired.

### Known Issues

- The FP32 CPU model with the `WhisperBeamSearch` op and output QK is
currently disabled. This is because ONNX Runtime doesn't currently
support output QK kernels on CPU, only on CUDA.
- The `DecoderMaskedMultiHeadAttention` CPU kernel has a parity mismatch
with the `DecoderMaskedMultiHeadAttention` CUDA kernel.
- Using `DecoderMaskedMultiHeadAttention` for the FP32 CPU model is not
enabled. Currently, it uses `MultiHeadAttention` to avoid the parity
mismatch issue.

### Motivation and Context

Using the beam search op has made it more difficult to debug and fix
errors that are encountered. This new approach is more flexible and more
customizable for users (e.g. by running with ONNX Runtime GenAI). It
also helps [this
issue](https://github.com/microsoft/onnxruntime/issues/18216).

---------

Co-authored-by: mindest <[email protected]>

* Windows: Show more useful DLL load errors to say exactly what DLL is missing (#24053)

### Description
When we fail to load a provider shared DLL in windows, the error is not
very specific. Users have to figure out if the onnxruntime file is
missing, a cuda file, or cudnn is not installed (and perhaps others).
And this is just the cuda provider. It would be far more useful if it
would say exactly what file is missing so the user can fix the actual
problem.

Plus, this will likely result in many fewer github issues regarding this
problem, but if they do, they will be much easier to fix.

This fix adds a function that will try loading a dll and its
dependencies recursively to figure out which file is missing. It uses
the OS dbghelp library to do it and is not very complex.

This also fixes a many year old bug that was introduced in the change to
use FormatMessage in env.cc, where the system error would always be an
empty string `error 126 ""` due to passing 0 as the format buffer
length. We will now see the more useful `The specified module could not
be found.` style error messages.

### Motivation and Context

Previously if we fail to load the cuda provider, the error would look
like this, which is limited:

`unknown file: error: C++ exception with description "
onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL :
LoadLibrary failed with error 126 "" when trying to load
"C:\Example\Path\To\Library\onnxruntime_providers_cuda.dll"`

Now it will look like this if cudnn is not installed:

`unknown file: error: C++ exception with description
onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL : Error
loading "C:\Example\Path\To\Library\onnxruntime_providers_cuda.dll"
which depends on "cudnn64_9.dll" which is missing. (Error 126: "The
specified module could not be found.")`

If cuda is not installed:

`unknown file: error: C++ exception with description
onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL : Error
loading "C:\Example\Path\To\Library\onnxruntime_providers_cuda.dll"
which depends on "cudart64_12.dll" which is missing. (Error 126: "The
specified module could not be found.")`

And if onnxruntime_providers_cuda.dll is not installed:

`unknown file: error: C++ exception with description
onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL : Error
loading "C:\Example\Path\To\Library\onnxruntime_providers_cuda.dll"
which is missing. (Error 126: "The specified module could not be
found.")
`

* Extend CMAKE_CUDA_FLAGS with all Blackwell compute capacity  (#23928)

### Description
<!-- Describe your changes. -->
* Update range to build SASS on all arch and PTX on highest arch
* when cuda>=12.8, build all arch (including latest blackwell)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://cmake.org/cmake/help/latest/prop_tgt/CUDA_ARCHITECTURES.html

https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list

* [WebGPU] Reduce staging buffers for uploading intializers (#23968)

This change reduces the number of staging buffers used for uploading
initializers to the GPU. On the one hand, we early release the upload
staging buffers. On the other hand, we use the BufferMapExtendedUsages
feature of Dawn on UMA GPUs, which allows us to directly write into the
dest GPU buffer without the need of a staging buffer. To achieve this,
we need to ensure the UMA GPU buffers are mapped at creation. We have
BufferManager to be awared of OnSessionInitializationEnd(), so that it
can handle buffer Create() and Upload() calls properly.

Credits to @fs-eire for the overall design of implementation.

* [WebGPU EP] Implement Remaining Reduction Ops (#24045)

### Description
<!-- Describe your changes. -->

Adds naive implementations of ReduceMin, ReduceProd, ReduceL1, ReduceL2,
ReduceLogSum, ReduceSumSquare, and ReduceLogSumExp. Will optimize to use
shared memory in a later PR.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Increases WebGPU EP operator coverage.

* add bool support to EPContext schema to unblock some models (#24065)

### Description
add bool support to EPContext schema to unblock some models

* [WebGPU EP] fix for reduce min/max error on MacOS CI (#24077)

### Error

```Traceback
/onnxruntime/onnxruntime/core/providers/webgpu/reduction/reduction_ops.cc:146 [allow_multi_axes = true] Axes values must be in the range [-rank, rank-1]. Got: 446098880
```

* Upgrade current MacOS-13 to 14 (#23293)

### Description
Upgrade current MacOS-13 to 14


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

- [x] Update the RN to 0.73.x+ to have the newer version of boost

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Fix CUDA EP Abs and Sign bfloat16 support (#23914)

### Description
<!-- Describe your changes. -->
Abs and Sign had bfloat16 kernels created but not registered with the
CUDA EP. Additionally Sign bfloat16 didn't work.
* register bfloat16 kernels with CUDA EP
* fix incorrectly named macro by adding 'X' as they add bfloat16
registration
* add specialization for bfloat16 to _Sign
  * copied existing pattern. not sure if there's a better way
* update tests



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#23875

* Improve typing for OrtValue and other public Python interfaces (#24086)

### Description

Improve the OrtValue interface typing and changed `staticmethod` to
`classmethod` for constructors to follow python conventions
(https://google.github.io/styleguide/pyguide.html#2174-decision).

* [webgpu] Limit that K must be divisible by 128 to apply dp4a matmul (#24078)

The DP4AMatMulQuantize shader needs to make sure that K is divisible by
128. Otherwise, we need align the scale
to have shape [M, ceil(K / 128)]. To simplify the shader, we limit that
K must be divisible by 128 to apply dp4a matmul.

* Add macOS ARM64 pipeline for webgpu (#24060)

### Description

Add macOS ARM64 pipeline for webgpu.

This pipeline is a temporary one. I created this pipeline because the
current code already fails on macOS ARM64 for WebGPU EP. Adding this
pipeline allows to check the status of the fix, and eventually when the
build passes, this pipeline will be merged with the existing macOS arm64
pipeline.

* [WebNN/WebGPU JS] Fix shared Module methods overriding each other (#23998)

- Renamed all conflicting WebNN methods from `jsep*` to `webnn*`.
- WebNN doesn't need flush(), therefore it doesn't need to set
`jsepBackend`.

This PR addresses issue microsoft/webnn-developer-preview#78

* Enable multithreading on FP16 to FP32 cast operator (#23619)

### Description
Enables multithreading on FP16 to FP32 cast operator.



### Motivation and Context
Improves CPU performance on FP16 models that require casting to FP32.

* Move Android CI Pipeline to Github Actions (#24094)

### Description
Move Android CI Pipeline to Github Actions

* Cleanup CoreML EP's code to remove COREML_ENABLE_MLPROGRAM (#23490)

### Description
Cleanup CoreML EP's code to remove the COREML_ENABLE_MLPROGRAM macro.
Also, increase MINIMUM_COREML_VERSION(first version we support) to 5 .

* webgpu ep support for argmax/argmin (#24089)

* [mobile/reactnative] Remove namespace from AndroidManifest.XML to resolve warning (#23847)

### Description
Removes namespace from AndroidManifest.XML



### Motivation and Context
- Resolves #21681

* [WebGPU EP] fix implementation of Pow (#24088)

### Description

Use custom implementation for Pow to fix test failures.

* Increase timeout to 90min for ARM64-Xcode16-targeting-iphonesimulator (#24091)

### Description
<!-- Describe your changes. -->

There are still some timeout for the pipeline. further extend the
timeout to 90 minutes for ARM64-Xcode16-targeting-iphonesimulator.

It takes quite a while if all build cache is missing.

### Motivation and Context

The pipeline sometimes failed because of timeout. There is a previous PR
#24030 to increase the timeout from 60min to 75 min but it looks like
not enough.

* [WebGPU] fix test failure in Reduce operators on macOS ARM64 (#24108)

### Description

fix test failure in Reduce operators on macOS ARM64

```
[E:onnxruntime:ReduceL1, sequential_executor.cc:572 ExecuteKernel] Non-zero status code returned while running ReduceL1 node. Name:'node1' Status Message: webgpu_context.cc:259 Run Uniform variable[0] (output_size) data type mismatch in program "ReduceL1", Expected: u32, Actual: i32
```

* [WebGPU EP] Implements CumSum Operator (#24047)

Increases WebGPU EP op coverage.

* [webgpu] Use 1d dispatch group size (#24084)

This PR uses 1d disptach group size and uses workgroup_idx instead of
workgroup.x|workgroup.y in case they are normalized.

* [WebGPU] fix test failure in MatMulNBits on macOS ARM64 (#24109)

### Description

abs_error is slightly loosen from 0.02 to 0.03 to allow test cases on
macOS arm64 to pass.

* [QNN-EP] Add support for Sum operator with 2 inputs (#24098)

### Description
<!-- Describe your changes. -->
* Add Sum to op builder in QNN-EP
* Now we can limit the support to Sum with 2 inputs.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
* Enhance QNN-EP support for Sum with two inputs

* [WebNN] Replace narrow with SafeInt for consistently in integer handling (#24059)

Remove redundant header files BTW.

* [QNN-EP] Add Lora Support with offline QNN context binary (#24026)

### Description
- Add the new run option called lora_config to feed the information from lora binary
- Parse and apply the lora binary in OnRunStart

### Motivation and Context
- Support Lora Adapter Binary with QNN Context Binary Usage

* [TensorRT EP] support TensorRT 10.9-GA (#23905)

### Description
<!-- Describe your changes. -->
* Update to trt10.9
* oss parser tested (here's testing method https://onnxruntime.ai/docs/build/eps.html#note-to-ort-1210-open-sourced-parser-users)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* [webgpu] Apply dp4a for generation shader (#24064)

This pr applies DP4A to generation shader. And also support any
block_size % 32 = 0.

* [CUDA] Support slide window in cutlass fused attention (#24072)

### Description
Add slide window support in cutlass fused attention

### Motivation and Context

The change was previously created by Ye:
https://github.com/microsoft/onnxruntime/pull/21926
I merged the change and resolved some conflictions. Also reversed some
Ye's change in kernel_forward.h, so that our code is consistent with
pytorch code.

* [MIGraphX EP] rename HIPPinnedAllocator to MIGraphXPinnedAllocator (#24103)

### Description
Rename class HIPPinnedAllocator to MIGraphXPinnedAllocator

### Motivation and Context
To align allocators' naming for the MIGraphX EP

* [MIGraphX EP] check POLICY CMP0144 availability before used (#24104)

### Description
For a newer CMake, suppress warnings about incorrect letter cases in
package names.

### Motivation and Context
To avoid reporting for newer CMake that a package name contains capital
letters when small letters are required.

* [JSEP] handles edge case in gridsample operator (#24121)

fix for https://github.com/microsoft/onnxruntime/issues/24070

* [OpenVINO]Session Options Appended After AppendExecutionProvider (#23852)

Description
To honor SessionOption API Contract the ordering of AddConfigOption and
AppendExecutionProvider_OpenVINO should not matter. This PR is fixing
that issue

Motivation and Context
This PR fixes a regression happened during last PR in ordering of
SessionOptions.

* [webgpu]Add MaxPool and AveragePool (#23714)

This adds Max and Average pool operators for webgpu-native. Basically,
this is a rewrite of the corresponding JSEP operators with some
improvements:
1) 'dilations' support
2) Pooling with kernelShape.length > 2 for NHWC format
3) code cleanup

However, there are still a few missing features:
1) ceil 'ceil_mode'
2) column major 'storage_order'
3) 'Indices' output for Max pools.

* [webgpu EP] put GetMaxComponents and SumVector to one place. (#24122)

### Description

put `GetMaxComponents` and `SumVector` to one place.

fix a bug in `SumVector`:

```diff
-      return "(" + x + ".x + " + x + ".y + " + x + ".w + " + x + ".z" + ")";
+      return "(" + x + ".x + " + x + ".y + " + x + ".z + " + x + ".w" + ")";
```

* skip MOE python test when MPI is not installed (#24116)

### Description
It is not common that dev machine have MPI installed. Skip the test if
MPI is not installed.

### Motivation and Context

Make it easy to run pytest in dev machine without the need to skip the
test manually.

* Integrate KleidiAI for MatMulNBits via MlasQNBitGemm (#23627)

### Description
This PR integrates Arm® KleidiAI™ to provide optimized assembly kernels
for matrix multiplication with 4-bit quantized weights. These changes
target the MlasQNBitGemm functions, and can be utilized via the
MatMulNBits operator.

* add test cases for webgpu ep in web (#24117)

### Description

This PR enables web tests (NPM suite tests) for WebGPU EP.

There are some test failures expected, so the specific job is marked as
"continueOnError".

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* Refactor Webnn IsSupported*() to use constant initializers. (#24118)

### Description
<!-- Describe your changes. -->
This PR continues the work started at
https://github.com/microsoft/onnxruntime/pull/19401.

### Motivation and Context
An overridable initializer should not have a fixed value included in an
WebNN model as it could be changed at runtime. The current check doesn't
include validating that the initializer is constant.

* Deleted the constant SKIP_CUDA_TEST_WITH_DML (#24113)

### Description
Deleted the constant SKIP_CUDA_TEST_WITH_DML. It does not seem to be
used anywhere.

### Motivation and Context
The constant SKIP_CUDA_TEST_WITH_DML prohibits onnxruntime to be
compiled when both of the flags -use_cuda and -use_dml are set.

Co-authored-by: Andreas Hussing <[email protected]>

* Update T5 Onnx Export and Optimization (#23949)

Previously, the encoder onnx model adds extra initialization for decoder
to generate kv cache from prompt. It is not necessary. Here we redesign
onnx export for T5 model to output two separate models for encode and
decoder.

Move Linear that generates cross features based on encoder_hidden_states
to encoder onnx model. In this way, the encoder does not need output
encoder_hidden_states, and only need output the features for cross
attention used in decoder.

Major changes:
 -[x] update t5 onnx export script
 -[x] update convert_generation script
-[x] update beam search to support changes of inputs and outputs (detail
can be found below).
-[x] add a tiny t5 model, and enable t…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:QNN issues related to QNN exeution provider release:1.21.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants