[ET-VK] Minor performance improvements to native layer norm. #9892

trivedivivek · 2025-04-04T04:54:57Z

Stack from ghstack (oldest at bottom):

[ET-VK] Tuning native layer norm local workgroup size to improve thread occupancy during reduce. #9984
-> [ET-VK] Minor performance improvements to native layer norm. #9892

This diff introduces minor performance improvements to the native layer norm function in the Vulkan backend of Executorch.

In this new approach:
The mean and variance values are calculated in 2 separate passes.
Shader is dispatched based on input texture size, and input texel is read and stored in shared memory.
Input stored in shard memory is then summed up using a reduce function.

This implementation better utilizes a GPUs parallel processing capabilities.

Differential Revision: D72430290

This diff introduces minor performance improvements to the native layer norm function in the Vulkan backend of Executorch. In this new approach: The mean and variance values are calculated in 2 separate passes. Shader is dispatched based on input texture size, and input texel is read and stored in shared memory. Input stored in shard memory is then summed up using a reduce function. This implementation better utilizes a GPUs parallel processing capabilities. Differential Revision: [D72430290](https://our.internmc.facebook.com/intern/diff/D72430290/) [ghstack-poisoned]

pytorch-bot · 2025-04-04T04:55:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9892

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit cfb8351 with merge base 1facfa9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This diff introduces minor performance improvements to the native layer norm function in the Vulkan backend of Executorch. In this new approach: The mean and variance values are calculated in 2 separate passes. Shader is dispatched based on input texture size, and input texel is read and stored in shared memory. Input stored in shard memory is then summed up using a reduce function. This implementation better utilizes a GPUs parallel processing capabilities. Differential Revision: [D72430290](https://our.internmc.facebook.com/intern/diff/D72430290/) ghstack-source-id: 276053981 Pull Request resolved: #9892

facebook-github-bot · 2025-04-04T04:55:05Z

This pull request was exported from Phabricator. Differential Revision: D72430290

This diff introduces minor performance improvements to the native layer norm function in the Vulkan backend of Executorch. In this new approach: The mean and variance values are calculated in 2 separate passes. Shader is dispatched based on input texture size, and input texel is read and stored in shared memory. Input stored in shard memory is then summed up using a reduce function. This implementation better utilizes a GPUs parallel processing capabilities. Differential Revision: [D72430290](https://our.internmc.facebook.com/intern/diff/D72430290/) [ghstack-poisoned]

Pull Request resolved: #9892 This diff introduces minor performance improvements to the native layer norm function in the Vulkan backend of Executorch. In this new approach: The mean and variance values are calculated in 2 separate passes. Shader is dispatched based on input texture size, and input texel is read and stored in shared memory. Input stored in shard memory is then summed up using a reduce function. This implementation better utilizes a GPUs parallel processing capabilities. Differential Revision: [D72430290](https://our.internmc.facebook.com/intern/diff/D72430290/) ghstack-source-id: 276439596

facebook-github-bot · 2025-04-07T02:03:48Z

This pull request was exported from Phabricator. Differential Revision: D72430290

This diff introduces minor performance improvements to the native layer norm function in the Vulkan backend of Executorch. In this new approach: The mean and variance values are calculated in 2 separate passes. Shader is dispatched based on input texture size, and input texel is read and stored in shared memory. Input stored in shard memory is then summed up using a reduce function. This implementation better utilizes a GPUs parallel processing capabilities. Differential Revision: [D72430290](https://our.internmc.facebook.com/intern/diff/D72430290/) [ghstack-poisoned]

Pull Request resolved: #9892 This diff introduces minor performance improvements to the native layer norm function in the Vulkan backend of Executorch. In this new approach: The mean and variance values are calculated in 2 separate passes. Shader is dispatched based on input texture size, and input texel is read and stored in shared memory. Input stored in shard memory is then summed up using a reduce function. This implementation better utilizes a GPUs parallel processing capabilities. ghstack-source-id: 276575089 Differential Revision: [D72430290](https://our.internmc.facebook.com/intern/diff/D72430290/)

facebook-github-bot · 2025-04-07T18:14:25Z

This pull request was exported from Phabricator. Differential Revision: D72430290

This diff introduces minor performance improvements to the native layer norm function in the Vulkan backend of Executorch. In this new approach: The mean and variance values are calculated in 2 separate passes. Shader is dispatched based on input texture size, and input texel is read and stored in shared memory. Input stored in shard memory is then summed up using a reduce function. This implementation better utilizes a GPUs parallel processing capabilities. Differential Revision: [D72430290](https://our.internmc.facebook.com/intern/diff/D72430290/) [ghstack-poisoned]

Pull Request resolved: #9892 This diff introduces minor performance improvements to the native layer norm function in the Vulkan backend of Executorch. In this new approach: The mean and variance values are calculated in 2 separate passes. Shader is dispatched based on input texture size, and input texel is read and stored in shared memory. Input stored in shard memory is then summed up using a reduce function. This implementation better utilizes a GPUs parallel processing capabilities. ghstack-source-id: 276877983 Differential Revision: [D72430290](https://our.internmc.facebook.com/intern/diff/D72430290/)

facebook-github-bot · 2025-04-08T20:30:26Z

This pull request was exported from Phabricator. Differential Revision: D72430290

trivedivivek requested a review from SS-JIA as a code owner April 4, 2025 04:54

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 4, 2025

trivedivivek added the topic: not user facing label Apr 4, 2025

facebook-github-bot added the fb-exported label Apr 7, 2025

SS-JIA approved these changes Apr 7, 2025

View reviewed changes

trivedivivek mentioned this pull request Apr 8, 2025

[ET-VK] Tuning native layer norm local workgroup size to improve thread occupancy during reduce. #9984

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK] Minor performance improvements to native layer norm. #9892

[ET-VK] Minor performance improvements to native layer norm. #9892

trivedivivek commented Apr 4, 2025 •

edited

Loading

pytorch-bot bot commented Apr 4, 2025 •

edited

Loading

facebook-github-bot commented Apr 4, 2025

facebook-github-bot commented Apr 7, 2025

facebook-github-bot commented Apr 7, 2025

facebook-github-bot commented Apr 8, 2025

[ET-VK] Minor performance improvements to native layer norm. #9892

Are you sure you want to change the base?

[ET-VK] Minor performance improvements to native layer norm. #9892

Conversation

trivedivivek commented Apr 4, 2025 • edited Loading

pytorch-bot bot commented Apr 4, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9892

✅ No Failures

facebook-github-bot commented Apr 4, 2025

facebook-github-bot commented Apr 7, 2025

facebook-github-bot commented Apr 7, 2025

facebook-github-bot commented Apr 8, 2025

trivedivivek commented Apr 4, 2025 •

edited

Loading

pytorch-bot bot commented Apr 4, 2025 •

edited

Loading