[ET-VK] Improve packing format for int4 linear operator + misc improvements #9883

SS-JIA · 2025-04-03T20:48:47Z

Stack from ghstack (oldest at bottom):

Context

Improve performance of the quantized int4 linear shader by packing the scales and zeros tensor, as well as the weight tensor in a more optimal way.

See the comments in the pack_int4_linear_weight_transposed_interleave shader for more details about how the new packing works.

Changes

Split int8 quantized linear and int4 quantized linear into separate C++ files for better code organization
Introduce packing shader for int4 weights
Update int4 linear shader to account for packed weights

Impact

This change massively improves the performance of the weight int4 quantized linear operator.

With this change, running LLaMa 3.2 1B can now achieve 10 tok/s, from 0.9 tok/s on an Adreno 740. This is a 10x improvement!

With this change:

/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 332.3 MB/s (74692800 bytes in 0.214s)
I 00:00:00.003353 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003533 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003563 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003685 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003747 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003799 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003852 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003902 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003976 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004289 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:04.841690 executorch:runner.cpp:101] Reading metadata from model
I 00:00:04.841808 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:04.841830 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:04.841851 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:04.841874 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:04.841893 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:04.841909 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:04.841927 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:04.841945 executorch:runner.cpp:133] eos_id = 128009
I 00:00:04.841951 executorch:runner.cpp:133] eos_id = 128001
I 00:00:04.841963 executorch:runner.cpp:188] RSS after loading model: 2229.828125 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:06.239633 executorch:runner.cpp:258] RSS after prompt prefill: 2229.828125 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:00:17.699086 executorch:runner.cpp:272] RSS after finishing text generation: 2229.828125 MiB (0 if unsupported)
I 00:00:17.699155 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:00:17.699161 executorch:stats.h:114]       Model Load Time:                4.837000 (seconds)
I 00:00:17.699165 executorch:stats.h:124]       Total inference time:           12.857000 (seconds)              Rate:  8.788987 (tokens/second)
I 00:00:17.699168 executorch:stats.h:132]               Prompt evaluation:      1.398000 (seconds)               Rate:  10.014306 (tokens/second)
I 00:00:17.699171 executorch:stats.h:143]               Generated 113 tokens:   11.459000 (seconds)              Rate:  9.861244 (tokens/second)
I 00:00:17.699174 executorch:stats.h:151]       Time to first generated token:  1.398000 (seconds)
I 00:00:17.699177 executorch:stats.h:158]       Sampling time over 127 tokens:  549246500.843000 (seconds)

Before this change:

/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 302.0 MB/s (74637464 bytes in 0.236s)
I 00:00:00.003050 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003200 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003226 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003337 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003396 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003449 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003502 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003553 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003629 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004075 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:05.417531 executorch:runner.cpp:101] Reading metadata from model
I 00:00:05.417647 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:05.417669 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:05.417698 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:05.417716 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:05.417735 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:05.417751 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:05.417768 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:05.417787 executorch:runner.cpp:133] eos_id = 128009
I 00:00:05.417793 executorch:runner.cpp:133] eos_id = 128001
I 00:00:05.417808 executorch:runner.cpp:188] RSS after loading model: 2230.812500 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:19.689616 executorch:runner.cpp:258] RSS after prompt prefill: 2230.812500 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:02:15.269693 executorch:runner.cpp:272] RSS after finishing text generation: 2230.812500 MiB (0 if unsupported)
I 00:02:15.269810 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:02:15.269825 executorch:stats.h:114]       Model Load Time:                5.414000 (seconds)
I 00:02:15.269832 executorch:stats.h:124]       Total inference time:           129.852000 (seconds)             Rate:  0.870221 (tokens/second)
I 00:02:15.269837 executorch:stats.h:132]               Prompt evaluation:      14.271000 (seconds)              Rate:  0.981010 (tokens/second)
I 00:02:15.269841 executorch:stats.h:143]               Generated 113 tokens:   115.581000 (seconds)             Rate:  0.977669 (tokens/second)
I 00:02:15.269844 executorch:stats.h:151]       Time to first generated token:  14.271000 (seconds)
I 00:02:15.269847 executorch:stats.h:158]       Sampling time over 127 tokens:  549711269.115000 (seconds)

PyTorchObserver {"prompt_tokens":14,"generated_tokens":113,"model_load_start_ms":1743712527974,"model_load_end_ms":1743712533388,"inference_start_ms":1743712533388,"inference_end_ms":1743712663240,"prompt_eval_end_ms":1743712547659,"first_token_ms":1743712547659,"aggregate_sampling_time_ms":549711269115,"SCALING_FACTOR_UNITS_PER_SECOND":1000}

Differential Revision: D72412950

…ements ## Context Improve performance of the quantized int4 linear shader by packing the scales and zeros tensor, as well as the weight tensor in a more optimal way. See the comments in the `pack_int4_linear_weight_transposed_interleave` shader for more details about how the new packing works. ## Changes * Split int8 quantized linear and int4 quantized linear into separate C++ files for better code organization * Introduce packing shader for int4 weights * Update int4 linear shader to account for packed weights ## Impact This change massively improves the performance of the weight int4 quantized linear operator. With this change, running LLaMa 3.2 1B can now achieve 10 tok/s, from 0.9 tok/s on an Adreno 740. This is a 10x improvement! With this change: ``` /home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 332.3 MB/s (74692800 bytes in 0.214s) I 00:00:00.003353 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version I 00:00:00.003533 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version I 00:00:00.003563 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1 I 00:00:00.003685 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1 I 00:00:00.003747 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1 I 00:00:00.003799 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1 I 00:00:00.003852 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1 I 00:00:00.003902 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1 I 00:00:00.003976 executorch:main.cpp:69] Resetting threadpool with num threads = 6 I 00:00:00.004289 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model I 00:00:04.841690 executorch:runner.cpp:101] Reading metadata from model I 00:00:04.841808 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256 I 00:00:04.841830 executorch:runner.cpp:126] Metadata: get_bos_id = 128000 I 00:00:04.841851 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1 I 00:00:04.841874 executorch:runner.cpp:126] Metadata: use_kv_cache = 1 I 00:00:04.841893 executorch:runner.cpp:126] Metadata: get_max_context_len = 128 I 00:00:04.841909 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128 I 00:00:04.841927 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0 I 00:00:04.841945 executorch:runner.cpp:133] eos_id = 128009 I 00:00:04.841951 executorch:runner.cpp:133] eos_id = 128001 I 00:00:04.841963 executorch:runner.cpp:188] RSS after loading model: 2229.828125 MiB (0 if unsupported) <|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|> I 00:00:06.239633 executorch:runner.cpp:258] RSS after prompt prefill: 2229.828125 MiB (0 if unsupported) Here's a short story for you: **The Library of Lost Memories** In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past. The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:00:17.699086 executorch:runner.cpp:272] RSS after finishing text generation: 2229.828125 MiB (0 if unsupported) I 00:00:17.699155 executorch:stats.h:108] Prompt Tokens: 14 Generated Tokens: 113 I 00:00:17.699161 executorch:stats.h:114] Model Load Time: 4.837000 (seconds) I 00:00:17.699165 executorch:stats.h:124] Total inference time: 12.857000 (seconds) Rate: 8.788987 (tokens/second) I 00:00:17.699168 executorch:stats.h:132] Prompt evaluation: 1.398000 (seconds) Rate: 10.014306 (tokens/second) I 00:00:17.699171 executorch:stats.h:143] Generated 113 tokens: 11.459000 (seconds) Rate: 9.861244 (tokens/second) I 00:00:17.699174 executorch:stats.h:151] Time to first generated token: 1.398000 (seconds) I 00:00:17.699177 executorch:stats.h:158] Sampling time over 127 tokens: 549246500.843000 (seconds) ``` Before this change: ``` /home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 302.0 MB/s (74637464 bytes in 0.236s) I 00:00:00.003050 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version I 00:00:00.003200 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version I 00:00:00.003226 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1 I 00:00:00.003337 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1 I 00:00:00.003396 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1 I 00:00:00.003449 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1 I 00:00:00.003502 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1 I 00:00:00.003553 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1 I 00:00:00.003629 executorch:main.cpp:69] Resetting threadpool with num threads = 6 I 00:00:00.004075 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model I 00:00:05.417531 executorch:runner.cpp:101] Reading metadata from model I 00:00:05.417647 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256 I 00:00:05.417669 executorch:runner.cpp:126] Metadata: get_bos_id = 128000 I 00:00:05.417698 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1 I 00:00:05.417716 executorch:runner.cpp:126] Metadata: use_kv_cache = 1 I 00:00:05.417735 executorch:runner.cpp:126] Metadata: get_max_context_len = 128 I 00:00:05.417751 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128 I 00:00:05.417768 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0 I 00:00:05.417787 executorch:runner.cpp:133] eos_id = 128009 I 00:00:05.417793 executorch:runner.cpp:133] eos_id = 128001 I 00:00:05.417808 executorch:runner.cpp:188] RSS after loading model: 2230.812500 MiB (0 if unsupported) <|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|> I 00:00:19.689616 executorch:runner.cpp:258] RSS after prompt prefill: 2230.812500 MiB (0 if unsupported) Here's a short story for you: **The Library of Lost Memories** In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past. The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:02:15.269693 executorch:runner.cpp:272] RSS after finishing text generation: 2230.812500 MiB (0 if unsupported) I 00:02:15.269810 executorch:stats.h:108] Prompt Tokens: 14 Generated Tokens: 113 I 00:02:15.269825 executorch:stats.h:114] Model Load Time: 5.414000 (seconds) I 00:02:15.269832 executorch:stats.h:124] Total inference time: 129.852000 (seconds) Rate: 0.870221 (tokens/second) I 00:02:15.269837 executorch:stats.h:132] Prompt evaluation: 14.271000 (seconds) Rate: 0.981010 (tokens/second) I 00:02:15.269841 executorch:stats.h:143] Generated 113 tokens: 115.581000 (seconds) Rate: 0.977669 (tokens/second) I 00:02:15.269844 executorch:stats.h:151] Time to first generated token: 14.271000 (seconds) I 00:02:15.269847 executorch:stats.h:158] Sampling time over 127 tokens: 549711269.115000 (seconds) PyTorchObserver {"prompt_tokens":14,"generated_tokens":113,"model_load_start_ms":1743712527974,"model_load_end_ms":1743712533388,"inference_start_ms":1743712533388,"inference_end_ms":1743712663240,"prompt_eval_end_ms":1743712547659,"first_token_ms":1743712547659,"aggregate_sampling_time_ms":549711269115,"SCALING_FACTOR_UNITS_PER_SECOND":1000} ``` Differential Revision: [D72412950](https://our.internmc.facebook.com/intern/diff/D72412950/) [ghstack-poisoned]

…ements ## Context Improve performance of the quantized int4 linear shader by packing the scales and zeros tensor, as well as the weight tensor in a more optimal way. See the comments in the `pack_int4_linear_weight_transposed_interleave` shader for more details about how the new packing works. ## Changes * Split int8 quantized linear and int4 quantized linear into separate C++ files for better code organization * Introduce packing shader for int4 weights * Update int4 linear shader to account for packed weights ## Impact This change massively improves the performance of the weight int4 quantized linear operator. With this change, running LLaMa 3.2 1B can now achieve 10 tok/s, from 0.9 tok/s on an Adreno 740. This is a 10x improvement! With this change: ``` /home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 332.3 MB/s (74692800 bytes in 0.214s) I 00:00:00.003353 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version I 00:00:00.003533 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version I 00:00:00.003563 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1 I 00:00:00.003685 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1 I 00:00:00.003747 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1 I 00:00:00.003799 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1 I 00:00:00.003852 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1 I 00:00:00.003902 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1 I 00:00:00.003976 executorch:main.cpp:69] Resetting threadpool with num threads = 6 I 00:00:00.004289 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model I 00:00:04.841690 executorch:runner.cpp:101] Reading metadata from model I 00:00:04.841808 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256 I 00:00:04.841830 executorch:runner.cpp:126] Metadata: get_bos_id = 128000 I 00:00:04.841851 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1 I 00:00:04.841874 executorch:runner.cpp:126] Metadata: use_kv_cache = 1 I 00:00:04.841893 executorch:runner.cpp:126] Metadata: get_max_context_len = 128 I 00:00:04.841909 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128 I 00:00:04.841927 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0 I 00:00:04.841945 executorch:runner.cpp:133] eos_id = 128009 I 00:00:04.841951 executorch:runner.cpp:133] eos_id = 128001 I 00:00:04.841963 executorch:runner.cpp:188] RSS after loading model: 2229.828125 MiB (0 if unsupported) <|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|> I 00:00:06.239633 executorch:runner.cpp:258] RSS after prompt prefill: 2229.828125 MiB (0 if unsupported) Here's a short story for you: **The Library of Lost Memories** In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past. The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:00:17.699086 executorch:runner.cpp:272] RSS after finishing text generation: 2229.828125 MiB (0 if unsupported) I 00:00:17.699155 executorch:stats.h:108] Prompt Tokens: 14 Generated Tokens: 113 I 00:00:17.699161 executorch:stats.h:114] Model Load Time: 4.837000 (seconds) I 00:00:17.699165 executorch:stats.h:124] Total inference time: 12.857000 (seconds) Rate: 8.788987 (tokens/second) I 00:00:17.699168 executorch:stats.h:132] Prompt evaluation: 1.398000 (seconds) Rate: 10.014306 (tokens/second) I 00:00:17.699171 executorch:stats.h:143] Generated 113 tokens: 11.459000 (seconds) Rate: 9.861244 (tokens/second) I 00:00:17.699174 executorch:stats.h:151] Time to first generated token: 1.398000 (seconds) I 00:00:17.699177 executorch:stats.h:158] Sampling time over 127 tokens: 549246500.843000 (seconds) ``` Before this change: ``` /home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 302.0 MB/s (74637464 bytes in 0.236s) I 00:00:00.003050 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version I 00:00:00.003200 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version I 00:00:00.003226 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1 I 00:00:00.003337 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1 I 00:00:00.003396 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1 I 00:00:00.003449 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1 I 00:00:00.003502 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1 I 00:00:00.003553 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1 I 00:00:00.003629 executorch:main.cpp:69] Resetting threadpool with num threads = 6 I 00:00:00.004075 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model I 00:00:05.417531 executorch:runner.cpp:101] Reading metadata from model I 00:00:05.417647 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256 I 00:00:05.417669 executorch:runner.cpp:126] Metadata: get_bos_id = 128000 I 00:00:05.417698 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1 I 00:00:05.417716 executorch:runner.cpp:126] Metadata: use_kv_cache = 1 I 00:00:05.417735 executorch:runner.cpp:126] Metadata: get_max_context_len = 128 I 00:00:05.417751 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128 I 00:00:05.417768 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0 I 00:00:05.417787 executorch:runner.cpp:133] eos_id = 128009 I 00:00:05.417793 executorch:runner.cpp:133] eos_id = 128001 I 00:00:05.417808 executorch:runner.cpp:188] RSS after loading model: 2230.812500 MiB (0 if unsupported) <|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|> I 00:00:19.689616 executorch:runner.cpp:258] RSS after prompt prefill: 2230.812500 MiB (0 if unsupported) Here's a short story for you: **The Library of Lost Memories** In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past. The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:02:15.269693 executorch:runner.cpp:272] RSS after finishing text generation: 2230.812500 MiB (0 if unsupported) I 00:02:15.269810 executorch:stats.h:108] Prompt Tokens: 14 Generated Tokens: 113 I 00:02:15.269825 executorch:stats.h:114] Model Load Time: 5.414000 (seconds) I 00:02:15.269832 executorch:stats.h:124] Total inference time: 129.852000 (seconds) Rate: 0.870221 (tokens/second) I 00:02:15.269837 executorch:stats.h:132] Prompt evaluation: 14.271000 (seconds) Rate: 0.981010 (tokens/second) I 00:02:15.269841 executorch:stats.h:143] Generated 113 tokens: 115.581000 (seconds) Rate: 0.977669 (tokens/second) I 00:02:15.269844 executorch:stats.h:151] Time to first generated token: 14.271000 (seconds) I 00:02:15.269847 executorch:stats.h:158] Sampling time over 127 tokens: 549711269.115000 (seconds) PyTorchObserver {"prompt_tokens":14,"generated_tokens":113,"model_load_start_ms":1743712527974,"model_load_end_ms":1743712533388,"inference_start_ms":1743712533388,"inference_end_ms":1743712663240,"prompt_eval_end_ms":1743712547659,"first_token_ms":1743712547659,"aggregate_sampling_time_ms":549711269115,"SCALING_FACTOR_UNITS_PER_SECOND":1000} ``` Differential Revision: [D72412950](https://our.internmc.facebook.com/intern/diff/D72412950/) ghstack-source-id: 275969380 Pull Request resolved: #9883

pytorch-bot · 2025-04-03T20:48:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9883

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 4106f49 with merge base 6adff9c ():

NEW FAILURES - The following jobs have failed:

pull / test-llava-runner-linux / linux-job (gh)
RuntimeError: Command docker exec -t b8218ce9e57d6bddcbe97d801b84c4206b4b0f9c7088db6f182364a7fa17efe8 /exec failed with exit code 139
pull / unittest-arm / linux-job (gh)
RuntimeError: Command docker exec -t 448ecf47fb83171b22b17bb1c17f8f71a7806dff0089353fd5165811f9a69bd6 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-04-03T20:48:59Z