Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ET-VK] Improve packing format for int4 linear operator + misc improvements #9883

Merged
merged 6 commits into from
Apr 7, 2025

Conversation

SS-JIA
Copy link
Contributor

@SS-JIA SS-JIA commented Apr 3, 2025

Stack from ghstack (oldest at bottom):

Context

Improve performance of the quantized int4 linear shader by packing the scales and zeros tensor, as well as the weight tensor in a more optimal way.

See the comments in the pack_int4_linear_weight_transposed_interleave shader for more details about how the new packing works.

Changes

  • Split int8 quantized linear and int4 quantized linear into separate C++ files for better code organization
  • Introduce packing shader for int4 weights
  • Update int4 linear shader to account for packed weights

Impact

This change massively improves the performance of the weight int4 quantized linear operator.

With this change, running LLaMa 3.2 1B can now achieve 10 tok/s, from 0.9 tok/s on an Adreno 740. This is a 10x improvement!

With this change:

/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 332.3 MB/s (74692800 bytes in 0.214s)
I 00:00:00.003353 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003533 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003563 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003685 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003747 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003799 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003852 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003902 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003976 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004289 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:04.841690 executorch:runner.cpp:101] Reading metadata from model
I 00:00:04.841808 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:04.841830 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:04.841851 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:04.841874 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:04.841893 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:04.841909 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:04.841927 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:04.841945 executorch:runner.cpp:133] eos_id = 128009
I 00:00:04.841951 executorch:runner.cpp:133] eos_id = 128001
I 00:00:04.841963 executorch:runner.cpp:188] RSS after loading model: 2229.828125 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:06.239633 executorch:runner.cpp:258] RSS after prompt prefill: 2229.828125 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:00:17.699086 executorch:runner.cpp:272] RSS after finishing text generation: 2229.828125 MiB (0 if unsupported)
I 00:00:17.699155 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:00:17.699161 executorch:stats.h:114]       Model Load Time:                4.837000 (seconds)
I 00:00:17.699165 executorch:stats.h:124]       Total inference time:           12.857000 (seconds)              Rate:  8.788987 (tokens/second)
I 00:00:17.699168 executorch:stats.h:132]               Prompt evaluation:      1.398000 (seconds)               Rate:  10.014306 (tokens/second)
I 00:00:17.699171 executorch:stats.h:143]               Generated 113 tokens:   11.459000 (seconds)              Rate:  9.861244 (tokens/second)
I 00:00:17.699174 executorch:stats.h:151]       Time to first generated token:  1.398000 (seconds)
I 00:00:17.699177 executorch:stats.h:158]       Sampling time over 127 tokens:  549246500.843000 (seconds)

Before this change:

/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 302.0 MB/s (74637464 bytes in 0.236s)
I 00:00:00.003050 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003200 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003226 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003337 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003396 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003449 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003502 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003553 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003629 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004075 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:05.417531 executorch:runner.cpp:101] Reading metadata from model
I 00:00:05.417647 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:05.417669 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:05.417698 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:05.417716 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:05.417735 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:05.417751 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:05.417768 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:05.417787 executorch:runner.cpp:133] eos_id = 128009
I 00:00:05.417793 executorch:runner.cpp:133] eos_id = 128001
I 00:00:05.417808 executorch:runner.cpp:188] RSS after loading model: 2230.812500 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:19.689616 executorch:runner.cpp:258] RSS after prompt prefill: 2230.812500 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:02:15.269693 executorch:runner.cpp:272] RSS after finishing text generation: 2230.812500 MiB (0 if unsupported)
I 00:02:15.269810 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:02:15.269825 executorch:stats.h:114]       Model Load Time:                5.414000 (seconds)
I 00:02:15.269832 executorch:stats.h:124]       Total inference time:           129.852000 (seconds)             Rate:  0.870221 (tokens/second)
I 00:02:15.269837 executorch:stats.h:132]               Prompt evaluation:      14.271000 (seconds)              Rate:  0.981010 (tokens/second)
I 00:02:15.269841 executorch:stats.h:143]               Generated 113 tokens:   115.581000 (seconds)             Rate:  0.977669 (tokens/second)
I 00:02:15.269844 executorch:stats.h:151]       Time to first generated token:  14.271000 (seconds)
I 00:02:15.269847 executorch:stats.h:158]       Sampling time over 127 tokens:  549711269.115000 (seconds)

PyTorchObserver {"prompt_tokens":14,"generated_tokens":113,"model_load_start_ms":1743712527974,"model_load_end_ms":1743712533388,"inference_start_ms":1743712533388,"inference_end_ms":1743712663240,"prompt_eval_end_ms":1743712547659,"first_token_ms":1743712547659,"aggregate_sampling_time_ms":549711269115,"SCALING_FACTOR_UNITS_PER_SECOND":1000}

Differential Revision: D72412950

…ements

## Context

Improve performance of the quantized int4 linear shader by packing the scales and zeros tensor, as well as the weight tensor in a more optimal way.

See the comments in the `pack_int4_linear_weight_transposed_interleave` shader for more details about how the new packing works.

## Changes

* Split int8 quantized linear and int4 quantized linear into separate C++ files for better code organization
* Introduce packing shader for int4 weights
* Update int4 linear shader to account for packed weights

## Impact

This change massively improves the performance of the weight int4 quantized linear operator.

With this change, running LLaMa 3.2 1B can now achieve 10 tok/s, from 0.9 tok/s on an Adreno 740. This is a 10x improvement!

With this change:

```
/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 332.3 MB/s (74692800 bytes in 0.214s)
I 00:00:00.003353 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003533 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003563 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003685 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003747 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003799 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003852 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003902 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003976 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004289 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:04.841690 executorch:runner.cpp:101] Reading metadata from model
I 00:00:04.841808 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:04.841830 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:04.841851 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:04.841874 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:04.841893 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:04.841909 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:04.841927 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:04.841945 executorch:runner.cpp:133] eos_id = 128009
I 00:00:04.841951 executorch:runner.cpp:133] eos_id = 128001
I 00:00:04.841963 executorch:runner.cpp:188] RSS after loading model: 2229.828125 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:06.239633 executorch:runner.cpp:258] RSS after prompt prefill: 2229.828125 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:00:17.699086 executorch:runner.cpp:272] RSS after finishing text generation: 2229.828125 MiB (0 if unsupported)
I 00:00:17.699155 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:00:17.699161 executorch:stats.h:114]       Model Load Time:                4.837000 (seconds)
I 00:00:17.699165 executorch:stats.h:124]       Total inference time:           12.857000 (seconds)              Rate:  8.788987 (tokens/second)
I 00:00:17.699168 executorch:stats.h:132]               Prompt evaluation:      1.398000 (seconds)               Rate:  10.014306 (tokens/second)
I 00:00:17.699171 executorch:stats.h:143]               Generated 113 tokens:   11.459000 (seconds)              Rate:  9.861244 (tokens/second)
I 00:00:17.699174 executorch:stats.h:151]       Time to first generated token:  1.398000 (seconds)
I 00:00:17.699177 executorch:stats.h:158]       Sampling time over 127 tokens:  549246500.843000 (seconds)
```

Before this change:

```
/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 302.0 MB/s (74637464 bytes in 0.236s)
I 00:00:00.003050 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003200 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003226 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003337 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003396 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003449 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003502 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003553 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003629 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004075 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:05.417531 executorch:runner.cpp:101] Reading metadata from model
I 00:00:05.417647 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:05.417669 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:05.417698 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:05.417716 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:05.417735 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:05.417751 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:05.417768 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:05.417787 executorch:runner.cpp:133] eos_id = 128009
I 00:00:05.417793 executorch:runner.cpp:133] eos_id = 128001
I 00:00:05.417808 executorch:runner.cpp:188] RSS after loading model: 2230.812500 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:19.689616 executorch:runner.cpp:258] RSS after prompt prefill: 2230.812500 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:02:15.269693 executorch:runner.cpp:272] RSS after finishing text generation: 2230.812500 MiB (0 if unsupported)
I 00:02:15.269810 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:02:15.269825 executorch:stats.h:114]       Model Load Time:                5.414000 (seconds)
I 00:02:15.269832 executorch:stats.h:124]       Total inference time:           129.852000 (seconds)             Rate:  0.870221 (tokens/second)
I 00:02:15.269837 executorch:stats.h:132]               Prompt evaluation:      14.271000 (seconds)              Rate:  0.981010 (tokens/second)
I 00:02:15.269841 executorch:stats.h:143]               Generated 113 tokens:   115.581000 (seconds)             Rate:  0.977669 (tokens/second)
I 00:02:15.269844 executorch:stats.h:151]       Time to first generated token:  14.271000 (seconds)
I 00:02:15.269847 executorch:stats.h:158]       Sampling time over 127 tokens:  549711269.115000 (seconds)

PyTorchObserver {"prompt_tokens":14,"generated_tokens":113,"model_load_start_ms":1743712527974,"model_load_end_ms":1743712533388,"inference_start_ms":1743712533388,"inference_end_ms":1743712663240,"prompt_eval_end_ms":1743712547659,"first_token_ms":1743712547659,"aggregate_sampling_time_ms":549711269115,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
```

Differential Revision: [D72412950](https://our.internmc.facebook.com/intern/diff/D72412950/)

[ghstack-poisoned]
SS-JIA added a commit that referenced this pull request Apr 3, 2025
…ements

## Context

Improve performance of the quantized int4 linear shader by packing the scales and zeros tensor, as well as the weight tensor in a more optimal way.

See the comments in the `pack_int4_linear_weight_transposed_interleave` shader for more details about how the new packing works.

## Changes

* Split int8 quantized linear and int4 quantized linear into separate C++ files for better code organization
* Introduce packing shader for int4 weights
* Update int4 linear shader to account for packed weights

## Impact

This change massively improves the performance of the weight int4 quantized linear operator.

With this change, running LLaMa 3.2 1B can now achieve 10 tok/s, from 0.9 tok/s on an Adreno 740. This is a 10x improvement!

With this change:

```
/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 332.3 MB/s (74692800 bytes in 0.214s)
I 00:00:00.003353 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003533 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003563 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003685 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003747 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003799 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003852 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003902 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003976 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004289 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:04.841690 executorch:runner.cpp:101] Reading metadata from model
I 00:00:04.841808 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:04.841830 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:04.841851 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:04.841874 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:04.841893 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:04.841909 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:04.841927 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:04.841945 executorch:runner.cpp:133] eos_id = 128009
I 00:00:04.841951 executorch:runner.cpp:133] eos_id = 128001
I 00:00:04.841963 executorch:runner.cpp:188] RSS after loading model: 2229.828125 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:06.239633 executorch:runner.cpp:258] RSS after prompt prefill: 2229.828125 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:00:17.699086 executorch:runner.cpp:272] RSS after finishing text generation: 2229.828125 MiB (0 if unsupported)
I 00:00:17.699155 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:00:17.699161 executorch:stats.h:114]       Model Load Time:                4.837000 (seconds)
I 00:00:17.699165 executorch:stats.h:124]       Total inference time:           12.857000 (seconds)              Rate:  8.788987 (tokens/second)
I 00:00:17.699168 executorch:stats.h:132]               Prompt evaluation:      1.398000 (seconds)               Rate:  10.014306 (tokens/second)
I 00:00:17.699171 executorch:stats.h:143]               Generated 113 tokens:   11.459000 (seconds)              Rate:  9.861244 (tokens/second)
I 00:00:17.699174 executorch:stats.h:151]       Time to first generated token:  1.398000 (seconds)
I 00:00:17.699177 executorch:stats.h:158]       Sampling time over 127 tokens:  549246500.843000 (seconds)
```

Before this change:

```
/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 302.0 MB/s (74637464 bytes in 0.236s)
I 00:00:00.003050 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003200 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003226 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003337 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003396 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003449 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003502 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003553 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003629 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004075 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:05.417531 executorch:runner.cpp:101] Reading metadata from model
I 00:00:05.417647 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:05.417669 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:05.417698 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:05.417716 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:05.417735 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:05.417751 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:05.417768 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:05.417787 executorch:runner.cpp:133] eos_id = 128009
I 00:00:05.417793 executorch:runner.cpp:133] eos_id = 128001
I 00:00:05.417808 executorch:runner.cpp:188] RSS after loading model: 2230.812500 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:19.689616 executorch:runner.cpp:258] RSS after prompt prefill: 2230.812500 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:02:15.269693 executorch:runner.cpp:272] RSS after finishing text generation: 2230.812500 MiB (0 if unsupported)
I 00:02:15.269810 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:02:15.269825 executorch:stats.h:114]       Model Load Time:                5.414000 (seconds)
I 00:02:15.269832 executorch:stats.h:124]       Total inference time:           129.852000 (seconds)             Rate:  0.870221 (tokens/second)
I 00:02:15.269837 executorch:stats.h:132]               Prompt evaluation:      14.271000 (seconds)              Rate:  0.981010 (tokens/second)
I 00:02:15.269841 executorch:stats.h:143]               Generated 113 tokens:   115.581000 (seconds)             Rate:  0.977669 (tokens/second)
I 00:02:15.269844 executorch:stats.h:151]       Time to first generated token:  14.271000 (seconds)
I 00:02:15.269847 executorch:stats.h:158]       Sampling time over 127 tokens:  549711269.115000 (seconds)

PyTorchObserver {"prompt_tokens":14,"generated_tokens":113,"model_load_start_ms":1743712527974,"model_load_end_ms":1743712533388,"inference_start_ms":1743712533388,"inference_end_ms":1743712663240,"prompt_eval_end_ms":1743712547659,"first_token_ms":1743712547659,"aggregate_sampling_time_ms":549711269115,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
```

Differential Revision: [D72412950](https://our.internmc.facebook.com/intern/diff/D72412950/)

ghstack-source-id: 275969380
Pull Request resolved: #9883
Copy link

pytorch-bot bot commented Apr 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9883

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 4106f49 with merge base 6adff9c (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 3, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72412950

…misc improvements"

## Context

Improve performance of the quantized int4 linear shader by packing the scales and zeros tensor, as well as the weight tensor in a more optimal way.

See the comments in the `pack_int4_linear_weight_transposed_interleave` shader for more details about how the new packing works.

## Changes

* Split int8 quantized linear and int4 quantized linear into separate C++ files for better code organization
* Introduce packing shader for int4 weights
* Update int4 linear shader to account for packed weights

## Impact

This change massively improves the performance of the weight int4 quantized linear operator.

With this change, running LLaMa 3.2 1B can now achieve 10 tok/s, from 0.9 tok/s on an Adreno 740. This is a 10x improvement!

With this change:

```
/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 332.3 MB/s (74692800 bytes in 0.214s)
I 00:00:00.003353 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003533 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003563 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003685 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003747 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003799 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003852 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003902 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003976 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004289 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:04.841690 executorch:runner.cpp:101] Reading metadata from model
I 00:00:04.841808 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:04.841830 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:04.841851 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:04.841874 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:04.841893 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:04.841909 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:04.841927 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:04.841945 executorch:runner.cpp:133] eos_id = 128009
I 00:00:04.841951 executorch:runner.cpp:133] eos_id = 128001
I 00:00:04.841963 executorch:runner.cpp:188] RSS after loading model: 2229.828125 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:06.239633 executorch:runner.cpp:258] RSS after prompt prefill: 2229.828125 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:00:17.699086 executorch:runner.cpp:272] RSS after finishing text generation: 2229.828125 MiB (0 if unsupported)
I 00:00:17.699155 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:00:17.699161 executorch:stats.h:114]       Model Load Time:                4.837000 (seconds)
I 00:00:17.699165 executorch:stats.h:124]       Total inference time:           12.857000 (seconds)              Rate:  8.788987 (tokens/second)
I 00:00:17.699168 executorch:stats.h:132]               Prompt evaluation:      1.398000 (seconds)               Rate:  10.014306 (tokens/second)
I 00:00:17.699171 executorch:stats.h:143]               Generated 113 tokens:   11.459000 (seconds)              Rate:  9.861244 (tokens/second)
I 00:00:17.699174 executorch:stats.h:151]       Time to first generated token:  1.398000 (seconds)
I 00:00:17.699177 executorch:stats.h:158]       Sampling time over 127 tokens:  549246500.843000 (seconds)
```

Before this change:

```
/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 302.0 MB/s (74637464 bytes in 0.236s)
I 00:00:00.003050 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003200 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003226 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003337 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003396 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003449 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003502 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003553 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003629 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004075 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:05.417531 executorch:runner.cpp:101] Reading metadata from model
I 00:00:05.417647 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:05.417669 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:05.417698 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:05.417716 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:05.417735 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:05.417751 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:05.417768 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:05.417787 executorch:runner.cpp:133] eos_id = 128009
I 00:00:05.417793 executorch:runner.cpp:133] eos_id = 128001
I 00:00:05.417808 executorch:runner.cpp:188] RSS after loading model: 2230.812500 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:19.689616 executorch:runner.cpp:258] RSS after prompt prefill: 2230.812500 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:02:15.269693 executorch:runner.cpp:272] RSS after finishing text generation: 2230.812500 MiB (0 if unsupported)
I 00:02:15.269810 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:02:15.269825 executorch:stats.h:114]       Model Load Time:                5.414000 (seconds)
I 00:02:15.269832 executorch:stats.h:124]       Total inference time:           129.852000 (seconds)             Rate:  0.870221 (tokens/second)
I 00:02:15.269837 executorch:stats.h:132]               Prompt evaluation:      14.271000 (seconds)              Rate:  0.981010 (tokens/second)
I 00:02:15.269841 executorch:stats.h:143]               Generated 113 tokens:   115.581000 (seconds)             Rate:  0.977669 (tokens/second)
I 00:02:15.269844 executorch:stats.h:151]       Time to first generated token:  14.271000 (seconds)
I 00:02:15.269847 executorch:stats.h:158]       Sampling time over 127 tokens:  549711269.115000 (seconds)

PyTorchObserver {"prompt_tokens":14,"generated_tokens":113,"model_load_start_ms":1743712527974,"model_load_end_ms":1743712533388,"inference_start_ms":1743712533388,"inference_end_ms":1743712663240,"prompt_eval_end_ms":1743712547659,"first_token_ms":1743712547659,"aggregate_sampling_time_ms":549711269115,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
```

Differential Revision: [D72412950](https://our.internmc.facebook.com/intern/diff/D72412950/)

[ghstack-poisoned]
SS-JIA added a commit that referenced this pull request Apr 3, 2025
…ements

Pull Request resolved: #9883

## Context

Improve performance of the quantized int4 linear shader by packing the scales and zeros tensor, as well as the weight tensor in a more optimal way.

See the comments in the `pack_int4_linear_weight_transposed_interleave` shader for more details about how the new packing works.

## Changes

* Split int8 quantized linear and int4 quantized linear into separate C++ files for better code organization
* Introduce packing shader for int4 weights
* Update int4 linear shader to account for packed weights

## Impact

This change massively improves the performance of the weight int4 quantized linear operator.

With this change, running LLaMa 3.2 1B can now achieve 10 tok/s, from 0.9 tok/s on an Adreno 740. This is a 10x improvement!

With this change:

```
/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 332.3 MB/s (74692800 bytes in 0.214s)
I 00:00:00.003353 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003533 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003563 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003685 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003747 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003799 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003852 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003902 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003976 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004289 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:04.841690 executorch:runner.cpp:101] Reading metadata from model
I 00:00:04.841808 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:04.841830 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:04.841851 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:04.841874 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:04.841893 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:04.841909 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:04.841927 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:04.841945 executorch:runner.cpp:133] eos_id = 128009
I 00:00:04.841951 executorch:runner.cpp:133] eos_id = 128001
I 00:00:04.841963 executorch:runner.cpp:188] RSS after loading model: 2229.828125 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:06.239633 executorch:runner.cpp:258] RSS after prompt prefill: 2229.828125 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:00:17.699086 executorch:runner.cpp:272] RSS after finishing text generation: 2229.828125 MiB (0 if unsupported)
I 00:00:17.699155 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:00:17.699161 executorch:stats.h:114]       Model Load Time:                4.837000 (seconds)
I 00:00:17.699165 executorch:stats.h:124]       Total inference time:           12.857000 (seconds)              Rate:  8.788987 (tokens/second)
I 00:00:17.699168 executorch:stats.h:132]               Prompt evaluation:      1.398000 (seconds)               Rate:  10.014306 (tokens/second)
I 00:00:17.699171 executorch:stats.h:143]               Generated 113 tokens:   11.459000 (seconds)              Rate:  9.861244 (tokens/second)
I 00:00:17.699174 executorch:stats.h:151]       Time to first generated token:  1.398000 (seconds)
I 00:00:17.699177 executorch:stats.h:158]       Sampling time over 127 tokens:  549246500.843000 (seconds)
```

Before this change:

```
/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 302.0 MB/s (74637464 bytes in 0.236s)
I 00:00:00.003050 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003200 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003226 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003337 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003396 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003449 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003502 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003553 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003629 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004075 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:05.417531 executorch:runner.cpp:101] Reading metadata from model
I 00:00:05.417647 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:05.417669 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:05.417698 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:05.417716 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:05.417735 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:05.417751 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:05.417768 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:05.417787 executorch:runner.cpp:133] eos_id = 128009
I 00:00:05.417793 executorch:runner.cpp:133] eos_id = 128001
I 00:00:05.417808 executorch:runner.cpp:188] RSS after loading model: 2230.812500 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:19.689616 executorch:runner.cpp:258] RSS after prompt prefill: 2230.812500 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:02:15.269693 executorch:runner.cpp:272] RSS after finishing text generation: 2230.812500 MiB (0 if unsupported)
I 00:02:15.269810 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:02:15.269825 executorch:stats.h:114]       Model Load Time:                5.414000 (seconds)
I 00:02:15.269832 executorch:stats.h:124]       Total inference time:           129.852000 (seconds)             Rate:  0.870221 (tokens/second)
I 00:02:15.269837 executorch:stats.h:132]               Prompt evaluation:      14.271000 (seconds)              Rate:  0.981010 (tokens/second)
I 00:02:15.269841 executorch:stats.h:143]               Generated 113 tokens:   115.581000 (seconds)             Rate:  0.977669 (tokens/second)
I 00:02:15.269844 executorch:stats.h:151]       Time to first generated token:  14.271000 (seconds)
I 00:02:15.269847 executorch:stats.h:158]       Sampling time over 127 tokens:  549711269.115000 (seconds)

PyTorchObserver {"prompt_tokens":14,"generated_tokens":113,"model_load_start_ms":1743712527974,"model_load_end_ms":1743712533388,"inference_start_ms":1743712533388,"inference_end_ms":1743712663240,"prompt_eval_end_ms":1743712547659,"first_token_ms":1743712547659,"aggregate_sampling_time_ms":549711269115,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
```
ghstack-source-id: 275989812
@exported-using-ghexport

Differential Revision: [D72412950](https://our.internmc.facebook.com/intern/diff/D72412950/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72412950

…misc improvements"

## Context

Improve performance of the quantized int4 linear shader by packing the scales and zeros tensor, as well as the weight tensor in a more optimal way.

See the comments in the `pack_int4_linear_weight_transposed_interleave` shader for more details about how the new packing works.

## Changes

* Split int8 quantized linear and int4 quantized linear into separate C++ files for better code organization
* Introduce packing shader for int4 weights
* Update int4 linear shader to account for packed weights

## Impact

This change massively improves the performance of the weight int4 quantized linear operator.

With this change, running LLaMa 3.2 1B can now achieve 10 tok/s, from 0.9 tok/s on an Adreno 740. This is a 10x improvement!

With this change:

```
/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 332.3 MB/s (74692800 bytes in 0.214s)
I 00:00:00.003353 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003533 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003563 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003685 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003747 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003799 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003852 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003902 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003976 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004289 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:04.841690 executorch:runner.cpp:101] Reading metadata from model
I 00:00:04.841808 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:04.841830 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:04.841851 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:04.841874 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:04.841893 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:04.841909 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:04.841927 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:04.841945 executorch:runner.cpp:133] eos_id = 128009
I 00:00:04.841951 executorch:runner.cpp:133] eos_id = 128001
I 00:00:04.841963 executorch:runner.cpp:188] RSS after loading model: 2229.828125 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:06.239633 executorch:runner.cpp:258] RSS after prompt prefill: 2229.828125 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:00:17.699086 executorch:runner.cpp:272] RSS after finishing text generation: 2229.828125 MiB (0 if unsupported)
I 00:00:17.699155 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:00:17.699161 executorch:stats.h:114]       Model Load Time:                4.837000 (seconds)
I 00:00:17.699165 executorch:stats.h:124]       Total inference time:           12.857000 (seconds)              Rate:  8.788987 (tokens/second)
I 00:00:17.699168 executorch:stats.h:132]               Prompt evaluation:      1.398000 (seconds)               Rate:  10.014306 (tokens/second)
I 00:00:17.699171 executorch:stats.h:143]               Generated 113 tokens:   11.459000 (seconds)              Rate:  9.861244 (tokens/second)
I 00:00:17.699174 executorch:stats.h:151]       Time to first generated token:  1.398000 (seconds)
I 00:00:17.699177 executorch:stats.h:158]       Sampling time over 127 tokens:  549246500.843000 (seconds)
```

Before this change:

```
/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 302.0 MB/s (74637464 bytes in 0.236s)
I 00:00:00.003050 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003200 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003226 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003337 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003396 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003449 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003502 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003553 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003629 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004075 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:05.417531 executorch:runner.cpp:101] Reading metadata from model
I 00:00:05.417647 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:05.417669 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:05.417698 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:05.417716 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:05.417735 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:05.417751 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:05.417768 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:05.417787 executorch:runner.cpp:133] eos_id = 128009
I 00:00:05.417793 executorch:runner.cpp:133] eos_id = 128001
I 00:00:05.417808 executorch:runner.cpp:188] RSS after loading model: 2230.812500 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:19.689616 executorch:runner.cpp:258] RSS after prompt prefill: 2230.812500 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:02:15.269693 executorch:runner.cpp:272] RSS after finishing text generation: 2230.812500 MiB (0 if unsupported)
I 00:02:15.269810 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:02:15.269825 executorch:stats.h:114]       Model Load Time:                5.414000 (seconds)
I 00:02:15.269832 executorch:stats.h:124]       Total inference time:           129.852000 (seconds)             Rate:  0.870221 (tokens/second)
I 00:02:15.269837 executorch:stats.h:132]               Prompt evaluation:      14.271000 (seconds)              Rate:  0.981010 (tokens/second)
I 00:02:15.269841 executorch:stats.h:143]               Generated 113 tokens:   115.581000 (seconds)             Rate:  0.977669 (tokens/second)
I 00:02:15.269844 executorch:stats.h:151]       Time to first generated token:  14.271000 (seconds)
I 00:02:15.269847 executorch:stats.h:158]       Sampling time over 127 tokens:  549711269.115000 (seconds)

PyTorchObserver {"prompt_tokens":14,"generated_tokens":113,"model_load_start_ms":1743712527974,"model_load_end_ms":1743712533388,"inference_start_ms":1743712533388,"inference_end_ms":1743712663240,"prompt_eval_end_ms":1743712547659,"first_token_ms":1743712547659,"aggregate_sampling_time_ms":549711269115,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
```

Differential Revision: [D72412950](https://our.internmc.facebook.com/intern/diff/D72412950/)

[ghstack-poisoned]
SS-JIA added a commit that referenced this pull request Apr 4, 2025
## Context

Due to poor performance of Vulkan's int4 linear operator, the final logit layer of the transformer model was not being delegated to vulkan, and was instead quantized and executed with the XNNPACK delegate.

However, with D72412950 / #9883 decent performance can now be achieved with Vulkan/s int4 linear op. Therefore, the final logit layer can be lowered to Vulkan.

## Changes

* Remove limit from `VkInt4WeightOnlyQuantizer` that was causing it to ignore the logit layer of the transformer
* Do not apply XNNPACK partitioner and quantizer when lowering with Vulkan

Differential Revision: [D72480177](https://our.internmc.facebook.com/intern/diff/D72480177/)

[ghstack-poisoned]
SS-JIA added a commit that referenced this pull request Apr 4, 2025
## Context

Due to poor performance of Vulkan's int4 linear operator, the final logit layer of the transformer model was not being delegated to vulkan, and was instead quantized and executed with the XNNPACK delegate.

However, with D72412950 / #9883 decent performance can now be achieved with Vulkan/s int4 linear op. Therefore, the final logit layer can be lowered to Vulkan.

## Changes

* Remove limit from `VkInt4WeightOnlyQuantizer` that was causing it to ignore the logit layer of the transformer
* Do not apply XNNPACK partitioner and quantizer when lowering with Vulkan

Differential Revision: [D72480177](https://our.internmc.facebook.com/intern/diff/D72480177/)

ghstack-source-id: 276219519
Pull Request resolved: #9918
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72412950

…misc improvements"

## Context

Improve performance of the quantized int4 linear shader by packing the scales and zeros tensor, as well as the weight tensor in a more optimal way.

See the comments in the `pack_int4_linear_weight_transposed_interleave` shader for more details about how the new packing works.

## Changes

* Split int8 quantized linear and int4 quantized linear into separate C++ files for better code organization
* Introduce packing shader for int4 weights
* Update int4 linear shader to account for packed weights

## Impact

This change massively improves the performance of the weight int4 quantized linear operator.

With this change, running LLaMa 3.2 1B can now achieve 10 tok/s, from 0.9 tok/s on an Adreno 740. This is a 10x improvement!

With this change:

```
/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 332.3 MB/s (74692800 bytes in 0.214s)
I 00:00:00.003353 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003533 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003563 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003685 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003747 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003799 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003852 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003902 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003976 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004289 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:04.841690 executorch:runner.cpp:101] Reading metadata from model
I 00:00:04.841808 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:04.841830 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:04.841851 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:04.841874 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:04.841893 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:04.841909 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:04.841927 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:04.841945 executorch:runner.cpp:133] eos_id = 128009
I 00:00:04.841951 executorch:runner.cpp:133] eos_id = 128001
I 00:00:04.841963 executorch:runner.cpp:188] RSS after loading model: 2229.828125 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:06.239633 executorch:runner.cpp:258] RSS after prompt prefill: 2229.828125 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:00:17.699086 executorch:runner.cpp:272] RSS after finishing text generation: 2229.828125 MiB (0 if unsupported)
I 00:00:17.699155 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:00:17.699161 executorch:stats.h:114]       Model Load Time:                4.837000 (seconds)
I 00:00:17.699165 executorch:stats.h:124]       Total inference time:           12.857000 (seconds)              Rate:  8.788987 (tokens/second)
I 00:00:17.699168 executorch:stats.h:132]               Prompt evaluation:      1.398000 (seconds)               Rate:  10.014306 (tokens/second)
I 00:00:17.699171 executorch:stats.h:143]               Generated 113 tokens:   11.459000 (seconds)              Rate:  9.861244 (tokens/second)
I 00:00:17.699174 executorch:stats.h:151]       Time to first generated token:  1.398000 (seconds)
I 00:00:17.699177 executorch:stats.h:158]       Sampling time over 127 tokens:  549246500.843000 (seconds)
```

Before this change:

```
/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 302.0 MB/s (74637464 bytes in 0.236s)
I 00:00:00.003050 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003200 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003226 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003337 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003396 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003449 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003502 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003553 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003629 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004075 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:05.417531 executorch:runner.cpp:101] Reading metadata from model
I 00:00:05.417647 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:05.417669 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:05.417698 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:05.417716 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:05.417735 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:05.417751 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:05.417768 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:05.417787 executorch:runner.cpp:133] eos_id = 128009
I 00:00:05.417793 executorch:runner.cpp:133] eos_id = 128001
I 00:00:05.417808 executorch:runner.cpp:188] RSS after loading model: 2230.812500 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:19.689616 executorch:runner.cpp:258] RSS after prompt prefill: 2230.812500 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:02:15.269693 executorch:runner.cpp:272] RSS after finishing text generation: 2230.812500 MiB (0 if unsupported)
I 00:02:15.269810 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:02:15.269825 executorch:stats.h:114]       Model Load Time:                5.414000 (seconds)
I 00:02:15.269832 executorch:stats.h:124]       Total inference time:           129.852000 (seconds)             Rate:  0.870221 (tokens/second)
I 00:02:15.269837 executorch:stats.h:132]               Prompt evaluation:      14.271000 (seconds)              Rate:  0.981010 (tokens/second)
I 00:02:15.269841 executorch:stats.h:143]               Generated 113 tokens:   115.581000 (seconds)             Rate:  0.977669 (tokens/second)
I 00:02:15.269844 executorch:stats.h:151]       Time to first generated token:  14.271000 (seconds)
I 00:02:15.269847 executorch:stats.h:158]       Sampling time over 127 tokens:  549711269.115000 (seconds)

PyTorchObserver {"prompt_tokens":14,"generated_tokens":113,"model_load_start_ms":1743712527974,"model_load_end_ms":1743712533388,"inference_start_ms":1743712533388,"inference_end_ms":1743712663240,"prompt_eval_end_ms":1743712547659,"first_token_ms":1743712547659,"aggregate_sampling_time_ms":549711269115,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
```

Differential Revision: [D72412950](https://our.internmc.facebook.com/intern/diff/D72412950/)

[ghstack-poisoned]
SS-JIA added a commit that referenced this pull request Apr 4, 2025
… lowered to Vulkan"

## Context

Due to poor performance of Vulkan's int4 linear operator, the final logit layer of the transformer model was not being delegated to vulkan, and was instead quantized and executed with the XNNPACK delegate.

However, with D72412950 / #9883 decent performance can now be achieved with Vulkan/s int4 linear op. Therefore, the final logit layer can be lowered to Vulkan.

## Changes

* Remove limit from `VkInt4WeightOnlyQuantizer` that was causing it to ignore the logit layer of the transformer
* Do not apply XNNPACK partitioner and quantizer when lowering with Vulkan

Differential Revision: [D72480177](https://our.internmc.facebook.com/intern/diff/D72480177/)

cc manuelcandales cbilgin

[ghstack-poisoned]
SS-JIA added a commit that referenced this pull request Apr 4, 2025
## Context

Due to poor performance of Vulkan's int4 linear operator, the final logit layer of the transformer model was not being delegated to vulkan, and was instead quantized and executed with the XNNPACK delegate.

However, with D72412950 / #9883 decent performance can now be achieved with Vulkan/s int4 linear op. Therefore, the final logit layer can be lowered to Vulkan.

## Changes

* Remove limit from `VkInt4WeightOnlyQuantizer` that was causing it to ignore the logit layer of the transformer
* Do not apply XNNPACK partitioner and quantizer when lowering with Vulkan

Differential Revision: [D72480177](https://our.internmc.facebook.com/intern/diff/D72480177/)

cc manuelcandales cbilgin

[ghstack-poisoned]
SS-JIA added a commit that referenced this pull request Apr 4, 2025
Pull Request resolved: #9918

## Context

Due to poor performance of Vulkan's int4 linear operator, the final logit layer of the transformer model was not being delegated to vulkan, and was instead quantized and executed with the XNNPACK delegate.

However, with D72412950 / #9883 decent performance can now be achieved with Vulkan/s int4 linear op. Therefore, the final logit layer can be lowered to Vulkan.

## Changes

* Remove limit from `VkInt4WeightOnlyQuantizer` that was causing it to ignore the logit layer of the transformer
* Do not apply XNNPACK partitioner and quantizer when lowering with Vulkan

Differential Revision: [D72480177](https://our.internmc.facebook.com/intern/diff/D72480177/)
ghstack-source-id: 276235672
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72412950

@SS-JIA SS-JIA added the release notes: vulkan Changes to the Vulkan backend delegate label Apr 7, 2025
…misc improvements"

## Context

Improve performance of the quantized int4 linear shader by packing the scales and zeros tensor, as well as the weight tensor in a more optimal way.

See the comments in the `pack_int4_linear_weight_transposed_interleave` shader for more details about how the new packing works.

## Changes

* Split int8 quantized linear and int4 quantized linear into separate C++ files for better code organization
* Introduce packing shader for int4 weights
* Update int4 linear shader to account for packed weights

## Impact

This change massively improves the performance of the weight int4 quantized linear operator.

With this change, running LLaMa 3.2 1B can now achieve 10 tok/s, from 0.9 tok/s on an Adreno 740. This is a 10x improvement!

With this change:

```
/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 332.3 MB/s (74692800 bytes in 0.214s)
I 00:00:00.003353 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003533 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003563 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003685 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003747 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003799 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003852 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003902 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003976 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004289 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:04.841690 executorch:runner.cpp:101] Reading metadata from model
I 00:00:04.841808 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:04.841830 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:04.841851 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:04.841874 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:04.841893 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:04.841909 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:04.841927 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:04.841945 executorch:runner.cpp:133] eos_id = 128009
I 00:00:04.841951 executorch:runner.cpp:133] eos_id = 128001
I 00:00:04.841963 executorch:runner.cpp:188] RSS after loading model: 2229.828125 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:06.239633 executorch:runner.cpp:258] RSS after prompt prefill: 2229.828125 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:00:17.699086 executorch:runner.cpp:272] RSS after finishing text generation: 2229.828125 MiB (0 if unsupported)
I 00:00:17.699155 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:00:17.699161 executorch:stats.h:114]       Model Load Time:                4.837000 (seconds)
I 00:00:17.699165 executorch:stats.h:124]       Total inference time:           12.857000 (seconds)              Rate:  8.788987 (tokens/second)
I 00:00:17.699168 executorch:stats.h:132]               Prompt evaluation:      1.398000 (seconds)               Rate:  10.014306 (tokens/second)
I 00:00:17.699171 executorch:stats.h:143]               Generated 113 tokens:   11.459000 (seconds)              Rate:  9.861244 (tokens/second)
I 00:00:17.699174 executorch:stats.h:151]       Time to first generated token:  1.398000 (seconds)
I 00:00:17.699177 executorch:stats.h:158]       Sampling time over 127 tokens:  549246500.843000 (seconds)
```

Before this change:

```
/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 302.0 MB/s (74637464 bytes in 0.236s)
I 00:00:00.003050 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003200 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003226 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003337 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003396 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003449 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003502 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003553 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003629 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004075 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:05.417531 executorch:runner.cpp:101] Reading metadata from model
I 00:00:05.417647 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:05.417669 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:05.417698 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:05.417716 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:05.417735 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:05.417751 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:05.417768 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:05.417787 executorch:runner.cpp:133] eos_id = 128009
I 00:00:05.417793 executorch:runner.cpp:133] eos_id = 128001
I 00:00:05.417808 executorch:runner.cpp:188] RSS after loading model: 2230.812500 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:19.689616 executorch:runner.cpp:258] RSS after prompt prefill: 2230.812500 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:02:15.269693 executorch:runner.cpp:272] RSS after finishing text generation: 2230.812500 MiB (0 if unsupported)
I 00:02:15.269810 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:02:15.269825 executorch:stats.h:114]       Model Load Time:                5.414000 (seconds)
I 00:02:15.269832 executorch:stats.h:124]       Total inference time:           129.852000 (seconds)             Rate:  0.870221 (tokens/second)
I 00:02:15.269837 executorch:stats.h:132]               Prompt evaluation:      14.271000 (seconds)              Rate:  0.981010 (tokens/second)
I 00:02:15.269841 executorch:stats.h:143]               Generated 113 tokens:   115.581000 (seconds)             Rate:  0.977669 (tokens/second)
I 00:02:15.269844 executorch:stats.h:151]       Time to first generated token:  14.271000 (seconds)
I 00:02:15.269847 executorch:stats.h:158]       Sampling time over 127 tokens:  549711269.115000 (seconds)

PyTorchObserver {"prompt_tokens":14,"generated_tokens":113,"model_load_start_ms":1743712527974,"model_load_end_ms":1743712533388,"inference_start_ms":1743712533388,"inference_end_ms":1743712663240,"prompt_eval_end_ms":1743712547659,"first_token_ms":1743712547659,"aggregate_sampling_time_ms":549711269115,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
```

Differential Revision: [D72412950](https://our.internmc.facebook.com/intern/diff/D72412950/)

[ghstack-poisoned]
SS-JIA added a commit that referenced this pull request Apr 7, 2025
… lowered to Vulkan"

## Context

Due to poor performance of Vulkan's int4 linear operator, the final logit layer of the transformer model was not being delegated to vulkan, and was instead quantized and executed with the XNNPACK delegate.

However, with D72412950 / #9883 decent performance can now be achieved with Vulkan/s int4 linear op. Therefore, the final logit layer can be lowered to Vulkan.

## Changes

* Remove limit from `VkInt4WeightOnlyQuantizer` that was causing it to ignore the logit layer of the transformer
* Do not apply XNNPACK partitioner and quantizer when lowering with Vulkan

Differential Revision: [D72480177](https://our.internmc.facebook.com/intern/diff/D72480177/)

cc manuelcandales cbilgin

[ghstack-poisoned]
SS-JIA added a commit that referenced this pull request Apr 7, 2025
## Context

Due to poor performance of Vulkan's int4 linear operator, the final logit layer of the transformer model was not being delegated to vulkan, and was instead quantized and executed with the XNNPACK delegate.

However, with D72412950 / #9883 decent performance can now be achieved with Vulkan/s int4 linear op. Therefore, the final logit layer can be lowered to Vulkan.

## Changes

* Remove limit from `VkInt4WeightOnlyQuantizer` that was causing it to ignore the logit layer of the transformer
* Do not apply XNNPACK partitioner and quantizer when lowering with Vulkan

Differential Revision: [D72480177](https://our.internmc.facebook.com/intern/diff/D72480177/)

cc manuelcandales cbilgin

[ghstack-poisoned]
SS-JIA added a commit that referenced this pull request Apr 7, 2025
Pull Request resolved: #9918

## Context

Due to poor performance of Vulkan's int4 linear operator, the final logit layer of the transformer model was not being delegated to vulkan, and was instead quantized and executed with the XNNPACK delegate.

However, with D72412950 / #9883 decent performance can now be achieved with Vulkan/s int4 linear op. Therefore, the final logit layer can be lowered to Vulkan.

## Changes

* Remove limit from `VkInt4WeightOnlyQuantizer` that was causing it to ignore the logit layer of the transformer
* Do not apply XNNPACK partitioner and quantizer when lowering with Vulkan
ghstack-source-id: 276549534

Differential Revision: [D72480177](https://our.internmc.facebook.com/intern/diff/D72480177/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72412950

…misc improvements"

## Context

Improve performance of the quantized int4 linear shader by packing the scales and zeros tensor, as well as the weight tensor in a more optimal way.

See the comments in the `pack_int4_linear_weight_transposed_interleave` shader for more details about how the new packing works.

## Changes

* Split int8 quantized linear and int4 quantized linear into separate C++ files for better code organization
* Introduce packing shader for int4 weights
* Update int4 linear shader to account for packed weights

## Impact

This change massively improves the performance of the weight int4 quantized linear operator.

With this change, running LLaMa 3.2 1B can now achieve 10 tok/s, from 0.9 tok/s on an Adreno 740. This is a 10x improvement!

With this change:

```
/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 332.3 MB/s (74692800 bytes in 0.214s)
I 00:00:00.003353 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003533 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003563 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003685 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003747 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003799 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003852 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003902 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003976 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004289 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:04.841690 executorch:runner.cpp:101] Reading metadata from model
I 00:00:04.841808 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:04.841830 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:04.841851 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:04.841874 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:04.841893 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:04.841909 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:04.841927 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:04.841945 executorch:runner.cpp:133] eos_id = 128009
I 00:00:04.841951 executorch:runner.cpp:133] eos_id = 128001
I 00:00:04.841963 executorch:runner.cpp:188] RSS after loading model: 2229.828125 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:06.239633 executorch:runner.cpp:258] RSS after prompt prefill: 2229.828125 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:00:17.699086 executorch:runner.cpp:272] RSS after finishing text generation: 2229.828125 MiB (0 if unsupported)
I 00:00:17.699155 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:00:17.699161 executorch:stats.h:114]       Model Load Time:                4.837000 (seconds)
I 00:00:17.699165 executorch:stats.h:124]       Total inference time:           12.857000 (seconds)              Rate:  8.788987 (tokens/second)
I 00:00:17.699168 executorch:stats.h:132]               Prompt evaluation:      1.398000 (seconds)               Rate:  10.014306 (tokens/second)
I 00:00:17.699171 executorch:stats.h:143]               Generated 113 tokens:   11.459000 (seconds)              Rate:  9.861244 (tokens/second)
I 00:00:17.699174 executorch:stats.h:151]       Time to first generated token:  1.398000 (seconds)
I 00:00:17.699177 executorch:stats.h:158]       Sampling time over 127 tokens:  549246500.843000 (seconds)
```

Before this change:

```
/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 302.0 MB/s (74637464 bytes in 0.236s)
I 00:00:00.003050 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003200 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003226 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003337 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003396 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003449 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003502 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003553 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003629 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004075 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:05.417531 executorch:runner.cpp:101] Reading metadata from model
I 00:00:05.417647 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:05.417669 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:05.417698 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:05.417716 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:05.417735 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:05.417751 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:05.417768 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:05.417787 executorch:runner.cpp:133] eos_id = 128009
I 00:00:05.417793 executorch:runner.cpp:133] eos_id = 128001
I 00:00:05.417808 executorch:runner.cpp:188] RSS after loading model: 2230.812500 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:19.689616 executorch:runner.cpp:258] RSS after prompt prefill: 2230.812500 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:02:15.269693 executorch:runner.cpp:272] RSS after finishing text generation: 2230.812500 MiB (0 if unsupported)
I 00:02:15.269810 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:02:15.269825 executorch:stats.h:114]       Model Load Time:                5.414000 (seconds)
I 00:02:15.269832 executorch:stats.h:124]       Total inference time:           129.852000 (seconds)             Rate:  0.870221 (tokens/second)
I 00:02:15.269837 executorch:stats.h:132]               Prompt evaluation:      14.271000 (seconds)              Rate:  0.981010 (tokens/second)
I 00:02:15.269841 executorch:stats.h:143]               Generated 113 tokens:   115.581000 (seconds)             Rate:  0.977669 (tokens/second)
I 00:02:15.269844 executorch:stats.h:151]       Time to first generated token:  14.271000 (seconds)
I 00:02:15.269847 executorch:stats.h:158]       Sampling time over 127 tokens:  549711269.115000 (seconds)

PyTorchObserver {"prompt_tokens":14,"generated_tokens":113,"model_load_start_ms":1743712527974,"model_load_end_ms":1743712533388,"inference_start_ms":1743712533388,"inference_end_ms":1743712663240,"prompt_eval_end_ms":1743712547659,"first_token_ms":1743712547659,"aggregate_sampling_time_ms":549711269115,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
```

Differential Revision: [D72412950](https://our.internmc.facebook.com/intern/diff/D72412950/)

[ghstack-poisoned]
SS-JIA added a commit that referenced this pull request Apr 7, 2025
… lowered to Vulkan"

## Context

Due to poor performance of Vulkan's int4 linear operator, the final logit layer of the transformer model was not being delegated to vulkan, and was instead quantized and executed with the XNNPACK delegate.

However, with D72412950 / #9883 decent performance can now be achieved with Vulkan/s int4 linear op. Therefore, the final logit layer can be lowered to Vulkan.

## Changes

* Remove limit from `VkInt4WeightOnlyQuantizer` that was causing it to ignore the logit layer of the transformer
* Do not apply XNNPACK partitioner and quantizer when lowering with Vulkan

Differential Revision: [D72480177](https://our.internmc.facebook.com/intern/diff/D72480177/)

cc manuelcandales cbilgin

[ghstack-poisoned]
SS-JIA added a commit that referenced this pull request Apr 7, 2025
## Context

Due to poor performance of Vulkan's int4 linear operator, the final logit layer of the transformer model was not being delegated to vulkan, and was instead quantized and executed with the XNNPACK delegate.

However, with D72412950 / #9883 decent performance can now be achieved with Vulkan/s int4 linear op. Therefore, the final logit layer can be lowered to Vulkan.

## Changes

* Remove limit from `VkInt4WeightOnlyQuantizer` that was causing it to ignore the logit layer of the transformer
* Do not apply XNNPACK partitioner and quantizer when lowering with Vulkan

Differential Revision: [D72480177](https://our.internmc.facebook.com/intern/diff/D72480177/)

cc manuelcandales cbilgin

[ghstack-poisoned]
SS-JIA added a commit that referenced this pull request Apr 7, 2025
Pull Request resolved: #9918

## Context

Due to poor performance of Vulkan's int4 linear operator, the final logit layer of the transformer model was not being delegated to vulkan, and was instead quantized and executed with the XNNPACK delegate.

However, with D72412950 / #9883 decent performance can now be achieved with Vulkan/s int4 linear op. Therefore, the final logit layer can be lowered to Vulkan.

## Changes

* Remove limit from `VkInt4WeightOnlyQuantizer` that was causing it to ignore the logit layer of the transformer
* Do not apply XNNPACK partitioner and quantizer when lowering with Vulkan
ghstack-source-id: 276566114

Differential Revision: [D72480177](https://our.internmc.facebook.com/intern/diff/D72480177/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72412950

@facebook-github-bot facebook-github-bot merged commit 079d734 into gh/SS-JIA/206/base Apr 7, 2025
81 of 84 checks passed
@facebook-github-bot facebook-github-bot deleted the gh/SS-JIA/206/head branch April 7, 2025 21:33
kirklandsign pushed a commit that referenced this pull request Apr 7, 2025
Pull Request resolved: #9918

## Context

Due to poor performance of Vulkan's int4 linear operator, the final logit layer of the transformer model was not being delegated to vulkan, and was instead quantized and executed with the XNNPACK delegate.

However, with D72412950 / #9883 decent performance can now be achieved with Vulkan/s int4 linear op. Therefore, the final logit layer can be lowered to Vulkan.

## Changes

* Remove limit from `VkInt4WeightOnlyQuantizer` that was causing it to ignore the logit layer of the transformer
* Do not apply XNNPACK partitioner and quantizer when lowering with Vulkan
ghstack-source-id: 276566114

Differential Revision: [D72480177](https://our.internmc.facebook.com/intern/diff/D72480177/)
kirklandsign pushed a commit that referenced this pull request Apr 7, 2025
## Context

Due to poor performance of Vulkan's int4 linear operator, the final logit layer of the transformer model was not being delegated to vulkan, and was instead quantized and executed with the XNNPACK delegate.

However, with D72412950 / #9883 decent performance can now be achieved with Vulkan/s int4 linear op. Therefore, the final logit layer can be lowered to Vulkan.

## Changes

* Remove limit from `VkInt4WeightOnlyQuantizer` that was causing it to ignore the logit layer of the transformer
* Do not apply XNNPACK partitioner and quantizer when lowering with Vulkan

Differential Revision: [D72480177](https://our.internmc.facebook.com/intern/diff/D72480177/)

cc @manuelcandales @cbilgin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported release notes: vulkan Changes to the Vulkan backend delegate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants