Replies: 7 comments 11 replies
-
THis is a good analysis. Let me dive into it a bit |
Beta Was this translation helpful? Give feedback.
-
Thanks for compiling this. Looking at the next steps, are we intending to analyze the impact of warmup iterations or high iteration counts? It would be very interesting to see the data on runtimes per iteration over say 1000 iterations run back to back. Does it stabilize? Or randomly spike? Can we rely on median latency being reliable over a large number of iterations? |
Beta Was this translation helpful? Give feedback.
-
Amazing how we can do such analysis in OSS. ❤️ Couple of random thoughts while reading the post,
|
Beta Was this translation helpful? Give feedback.
-
FYI, with more data from public Android devices are found, I just updated the post to incorporate the private vs. public comparison. The metrics from the new data strengthens the conclusions, indicating using private AWS can provide decent stability for Android benchmarking. cc: @cbilgin @kimishpatel @digantdesai |
Beta Was this translation helpful? Give feedback.
-
@guangy10 overall conclusion make sense. I would like to offer my views on the way forward
|
Beta Was this translation helpful? Give feedback.
-
If there is no objection, I will go ahead and order more S22 private devices. From our records, in the last 7 days https://github.com/pytorch/executorch/actions/workflows/android-perf.yml runs a total of 2578 minutes and it's configured to run every 8 hours or 3 times per day. So, each run takes around 2578 / (7 * 3) or 2 hours. On the other hand, https://github.com/pytorch/executorch/actions/workflows/android-perf-private-device-experiment.yml with 2 S22 devices runs a total of 2353 minutes every 4 hours or 6 times per day. So, it is roughly 2353 / (7 * 6) or 1 hour. My ballpark estimate is ordering 2 more S22 devices would be sufficient (x2), but I think I will request 4 more to have a buffer for PR runs and broken devices like #11083 where we might need to remove a device from the pool. |
Beta Was this translation helpful? Give feedback.
-
FYI, after carefully reviewed the iOS benchmark results and also including more data for iOS, the latest analysis shows converged conclusions for both iOS and Android. The post has been updated to reflect the new findings. All data sources and script are updated and linked. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Benchmark Infra Stability Assessment with private AWS devices
TL;DR
Analysis reveals that private AWS device can provide acceptable stability across all tested platforms (Android, iOS), delegates (QNN, XNNPACK, CoreML, MPS) and models (Llama3.2-1b and MobileNetV3), demonstrating that our private AWS infrastructure can deliver consistent benchmarking results.
Understanding Stability Metrics
To properly assess the stability of ML model inference latency, I use several key statistical metrics:
And a composite stability score (0-100 scale) is calculated using weighted CV, Max/Min ratio, and P99/P50 ratio.
Intra-primary (private) Dataset Stability Comparison
I will begin the analysis by examining the key metrics for the primary (private) dataset. This section focuses on assessing the inherent stability of our benchmarking environment before making any comparisons to public infrastructure. By analyzing key statistical metrics mentioned above across different model and device combinations, we can establish a baseline understanding of performance consistency and stability.
Overall Stability Summary:
Device-based Comparison:
My insights and recommendations
The analysis of latency stability across private AWS devices reveals certain patterns in performance consistency:
Intra-private analysis reveals that private iPhone and S22 can provide acceptable stability across all tested delegates (QNN, CoreML, MPS, XNNPACK) and models (Llama3.2-1b and MobileNetV3), demonstrating that our private AWS infrastructure can deliver consistent benchmarking results.
Inter-dataset (private & public) Stability Comparison
To assess whether private AWS devices provide better stability than their public counterparts, here I conducted a detailed comparison between matching datasets from both environments. This section presents an apple-to-apple comparison of benchmark stability for identical model-device combinations, allowing us to directly evaluate the benefits of moving to use private infrastructure.
1. llama3_spinq+s22_android13 (Private) vs llama3_spinq+s22_android13 (Public)
Metrics Comparison:
Interpretation:
2. mv3_qnn+s22_android13 (Private) with mv3_qnn+s22_android13 (Public)
Metrics Comparison:
Interpretation:
3. mv3_xnnq8+s22_android13 (Private) vs. mv3_xnnq8+s22_android13 (Public)
Metrics Comparison:
Interpretation:
4. llama3_qlora+iphone15max_ios17 (Private) with llama3_qlora+iphone15max_ios17 (Public)
Metrics Comparison:
Interpretation:
5. mv3_xnnq8+iphone15max_ios17 (Private) with mv3_xnnq8+iphone15max_ios17 (Public)
Metrics Comparison:
Interpretation:
Though both are not ideal, but private environment shows better stability with a 837.9% higher stability score (Private: 10.8/100 vs Public: 1.2/100), and 27.5% lower coefficient of variation, indicating more consistent performance over public devices.
6. mv3_coreml+iphone15max_ios17 (Private) with mv3_coreml+iphone15max_ios17 (Public)
Metrics Comparison:
Interpretation:
Both environments show perfect and identical stability scores.
7. mv3_mps+iphone15max_ios17 (Private) with mv3_mps+iphone15max_ios17 (Public)
Metrics Comparison:
Interpretation:
Overall Private vs Public Comparison:
Summary:
Private devices consistently outperform public devices on both platforms, with Android showing slightly larger performance gains and more dramatic stability improvements.
Detailed Stability Analysis on Individual Dataset - Primary (Private)
The full list of individual dataset analysis can be downloaded here. In this section I will highlight detailed statistical metrics for only a few selected datasets.
1. Latency Stability Analysis: llama3_spinq+s22_android13 (Primary)
2. Latency Stability Analysis: mv3_qnn+s22_android13 (Primary)
3. Latency Stability Analysis: mv3_xnnq8+s22_android13 (Primary)
4. Latency Stability Analysis: llama3_spinq+iphone15max_ios17 (Primary)
5. Latency Stability Analysis: mv3_xnnq8+iphone15max_ios17 (Primary)
6. Latency Stability Analysis: mv3_mps+iphone15max_ios17 (Primary)
7. Latency Stability Analysis: mv3_coreml+iphone15max_ios17 (Primary)
Summary of Conclusions and Next Steps
ExecuTorch Benchmarking
The analysis shows that private AWS devices provide significantly better stability for both Android and iOS benchmarking, with Android showing slightly larger performance gains and more dramatic stability improvements.
As next steps, I would suggest:
DevX Improvements
Our current benchmarking infrastructure has critical gaps that limit our ability to understand and address stability issues. These limitations are particularly problematic when trying to diagnose the root causes of performance variations we've observed across devices.
Current Gaps
Addressing these gaps is urgent to establish a reliable benchmarking infrastructure. Without these improvements, we risk making timely decisions and basing conclusions on misleading or incomplete data.
References
Here I attached the source of data and my script if anyone want to repeat the work. Please also use it as a reference when filling the infra gaps above.
The script used for analysis
Data source:
Datasets from Primary/Private AWS devices:
Benchmark Dataset with Private AWS Devices.xlsx
Datasets from Reference/Public AWS devices:
Benchmark Dataset with Public AWS Devices.xlsx
Each tab represent one dataset collected with one model+config+device combination. The source of the data are copied from the ExecuTorch benchmark dashboard
Beta Was this translation helpful? Give feedback.
All reactions