Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backmerging with Msft Commits #654

Open
wants to merge 5 commits into
base: ovep-develop
Choose a base branch
from

Conversation

jatinwadhwa921
Copy link

Backmerging with Msft Commits

snnn and others added 5 commits April 10, 2025 08:26
The Azure DevOps pipeline template
[/nuget/templates/dml-vs-2022.yml](https://github.com/microsoft/onnxruntime/blob/main/tools/ci_build/github/azure-pipelines/nuget/templates/dml-vs-2022.yml)
is used to build the ONNX Runtime DirectML (DML) components. It
historically contained two potential mechanisms for creating NuGet
packages:

1. Invoking `python tools/ci_build/build.py` with the `--build_nuget`
flag.
2. Executing a specific `NuPackScript` (usually calling `msbuild
/t:CreatePackage`).

This redundancy created a significant problem during release builds
(when the pipeline parameter IsReleaseBuild is set to true). Here's why:
- Duplicate Package Creation: Both packaging methods would execute.
- build.py --build_nuget created a package with a
development/pre-release version suffix (e.g.,
Microsoft.ML.OnnxRuntime.DirectML.1.21.1-dev-20250408-0849-84808eb710.nupkg).
- The NuPackScript's msbuild call, influenced by IsReleaseBuild=true,
created the clean release version package (e.g.,
Microsoft.ML.OnnxRuntime.DirectML.1.21.1.nupkg).
- ren Command Failure: For the x86 and arm64 builds, the NuPackScript
contains a command like:
    ```Bash
    ren Microsoft.ML.OnnxRuntime.DirectML.* win-dml-x86.zip
    ``` 
This command fails when two files match the pattern
Microsoft.ML.OnnxRuntime.DirectML.* (the dev package and the release
package), as ren requires a single source file when using wildcards for
renaming.
- Result: This caused build failures specifically when attempting to
create release candidates or final release builds for x86 and arm64 DML
components. This issue did not typically occur in regular nightly builds
(IsReleaseBuild: false) because only one package (the dev version) was
likely produced, allowing the ren command to succeed. Therefore we only
found the problem when doing a patch release for ONNX Runtime 1.21.

(@amarin16, the release manager of ONNX Runtime 1.21, found the issue
and explained it to us why the pipeline was not working)

The change is relatively simple. This PR removes the `--build_nuget`
flag from the `python tools/ci_build/build.py` command within the
dml-vs-2022.yml template. By removing the redundant packaging step from
build.py, only the NuPackScript's msbuild command generates a package
file. This ensures only one file matches the
Microsoft.ML.OnnxRuntime.DirectML.* pattern, allowing the subsequent ren
command in the x86 and arm64 scripts to execute successfully during
release builds.

# Background (how the DML packaging pipeline works)

The build has two stages:

1. Individual Architecture Builds (Using dml-vs-2022.yml): Each stage
(x64, x86, arm64) runs, now reliably using only its specific
NuPackScript to generate its artifact without the risk of the ren
command failing during release.
x64 produces: Microsoft.ML.OnnxRuntime.DirectML.[version].nupkg
x86 produces: win-dml-x86.zip
arm64 produces: win-dml-arm64.zip
(arm32 is not built/included).
2. Final Packaging Stage (e.g., stages/nuget_dml_packaging_stage.yml):
Downloads these artifacts and combines them by unpacking the base x64
.nupkg, injecting the contents of the .zip files into the appropriate
runtimes/ directories (e.g., runtimes/win-x86/native/,
runtimes/win-arm64/native/), and re-packing the final,
multi-architecture Microsoft.ML.OnnxRuntime.DirectML.nupkg.

In stage 1 only x64 produces a nuget package, therefore specific MSBuild
parameters: `/p:IsReleaseBuild=${{ parameters.IsReleaseBuild }}` is
passed to all architectures' MSBuild calls, while
`/p:CurrentData=$(BuildDate) /p:CurrentTime=$(BuildTime)` are passed
only in the x64 script. BTW, the property "CurrentData" apparently is a
typo. It should be `CurrentDate`.
…#24348)

### Description
<!-- Describe your changes. -->
Make test `CApiTest.RequestLoadCancellation` deterministic by removing
the `terminator` thread.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The test contributes to CI failures
### Description

This change allows NPM tests to run the nodejs binding for webgpu. This
helps to debug test failures much easier because WebAssembly is
generally very difficult to debug.

Steps to debug:

0. build
   - {ORT_ROOT}> build --config Debug --use_webgpu --build_nodejs
   - {ORT_ROOT}\js\web> npm ci
   - {ORT_ROOT}\js\web> npm run pull:wasm
2. run `npm test -- <args> -b=webgpu -e=node` once. ( this command
generates necessary .js files and `testdata-config.json`.)
3. use native debugger to debug:
   ```
C:\Program Files\nodejs\node.exe
{ORT_ROOT}\js\node_modules\mocha\bin\_mocha --timeout 999999 --colors -r
{ORT_ROOT}\js/web/dist/ort.node.min.js {ORT_ROOT}\js/web/test/test-main
   ```
### Description

MlasTranspose was running single-thread and was not performing well
enough on a multi-threaded CPU. Therefore, I modified it to run with
multi-thread to improve performance.

The `MlasTranspose` was previously running in a single-threaded, which
resulted in suboptimal performance on multi-threaded CPUs. To address
this, I have modified it to utilize multi-threading.

### Motivation and Context

We encountered this issue while running the
[multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large),
which was converted to ONNX format and executed on a multi-core CPU
(Xeon 6338). Below are the performance metrics before and after the
modification:

| | INTER_NUM_THREADS | INTRA_NUM_THREADS | INPUT_LENGTH | BATCH_SIZE |
Duration time[sec] |
| ------ | ----------------- | ----------------- | ------------ |
---------- | ------------------ |
| BEFORE | 1 | 16 | 512 | 4 | 1.24 |
| AFTER | 1 | 16 | 512 | 4 | 1.09 |

Condition
- FP32
- CPUExecutionProvider

This change resulted in a performance improvement of approximately 14%.
MlasTranspose stand-alone performance improvements are as follows

| | INTRA_NUM_THREADS | BEFORE | AFTER |
| --------------------------------- | ---- | -------------- |
------------- |
| MlasTranspose [msec] | 16 | 182.55 [ms] | 11.60 [ms] |

`MlasTranspose` is x15~16 faster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants