RFC: Backend Test Infrastructure #11140
GregoryComer
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Introduction
The purpose of this RFC is to propose shared test suites and stress test infrastructure that can run against all ExecuTorch backends. These backends should be able to integrate with this framework with minimal effort and should cover all major capabilities such as partitioner modes, tensor dtypes, operator implementations, and quantization flows.
The initial goal of this effort is to ensure that all ExecuTorch backends meet a standard of functional reliability and correctness defined by ExecuTorch as a platform. Practically, this means being able to take any PT2 graph, partition only what a backend can handle, lower, generate .pte, load it, and run without errors or crashes on most relevant platforms for a given backend.
This document proposes a design aiming at ExecuTorch GA release needs. However, here we are intentionally limiting the implementation scope, as reflected in the milestones for the v0.7 release to remain adaptable for the GA requirements. Consequently, performance is explicitly a non-goal for ET v0.7 release. Strict numerical correctness is also outside the scope for v0.7, as the primary focus is on functionality. Additionally, exhaustively measuring coverage on various platforms and operating systems for backends, such as Vulkan on Windows, is also not part of the early goals. However, as highlighted earlier, we ultimately anticipate the proposed design to support all these aspects if needed for the GA release.
Notes for Reviewers
We’ve intentionally focused on high-level interfaces and goals in this document over implementation details. We expect that the implementation will not be particularly controversial, but we would be interested in points that you foresee as being potentially complex or problematic.
For GA, how do we set backend stability? Quantitatively, how do we weigh different axes - functionality, performance (vs. competition), platform support, ease of enabling new models, debugging (performance/numerical issues), llm-specific goals, etc.
We are especially looking for feedback from backend authors on the following questions:
Are we testing for the right things?
Are the proposed backend integration points sufficient to cover all backend capabilities?
Are there additional use cases that we should consider?
How should we handle tolerances in a way that scales well across backends and quantization schemes?
Motivation
As ExecuTorch approaches 1.0 GA release in October of this year, we have explicit goals on software reliability and out of box experience.
Scope
Design Goals for 0.7
5.1 Which operators where or weren’t partitioned
5.2 What percentage of nodes are partitioned on the model test suite
5.3 PTE size, and delegate blob count/size in bytes
5.4 AOT preprocessing time
5.5 Runtime statistics
6.1 It should be easy to add additional metrics to the test job. And also second order metrics derived from these primary metrics.
6.2 It should be easy to add a new test for one or all backends
6.3 It should be relatively straightforward to add a new backend
Potential design goals for GA include performance, numerical correctness, and platform coverage.
Non-Goals for 0.7
Motivating Use Cases
The proposed design should serve to support a number of related use cases for backend validation. Some specific motivating examples include:
Design
The proposed design involves a core test harness, which takes an input, a matrix of configurations. The output of the test run is a test report, which includes pass/fail status for all tests and any additional collected metrics. This test harness should be able to be run ad-hoc locally or integrated into CI jobs.
Configuration Matrix
The configuration matrix determines which tests to run, quantization and lowering flows, and runtime targets. Each axis takes a set of one or more configurations. The test harness is responsible for evaluating each combination in the cartesian product of the configurations.
Orthogonal Configuration Axes:
Let’s talk about each in a bit more detail.
Tests
Each test is defined by an eager mode model and a set of test inputs. The test itself is intended to be orthogonal of backend, quantization, and runtime configuration. We may need to relax tolerances between backends, but we expect functional correctness and reliability across all backends and configurations, which is what this effort aims to achieve.
We intend to introduce two primary test suites: models and operators. The operator test suite should cover the entire ATen op set, and any other operators that are common. In practice, we see many non-core ops that are not decomposed, and there is a general de-emphasis on a single op-set, so we will want to add operators as we see them.
Model tests will leverage our existing example models as a baseline, but it should be easy to integrate external model libraries, such as HF Transformers. We will also need to create artificial models to validate specific configurations, such as dynamically quantized linears with multiple consumers, or other cases we have seen problems with in the past.
For operator-level tests, we anticipate using FACTO to generate input permutations, including dtypes. For model-level tests, we will likely need to manage dtype as an independent axis, and input tensor generation will be coupled to the dtype.
2. Backend-Independent Configuration
This configuration controls model DTypes and whether the model is exported with dynamic shapes. Even in cases where backends do not support dynamic shapes, it is beneficial to validate that they do not attempt to partition nodes that they cannot handle.
3. Backend Interface and Recipes
The backend interface is responsible for allowing backends to register quantization and lowering configurations. Quantization is coupled to the backend, but is considered a separate axis, such that we can test multiple quantization schemes against multiple lowering schemes independently. Such as testing (no quant, 8-bit, 4-bit) x (to_edge, to_edge_transformer_and_lower), or perhaps with different partitioner options.
When the high-level export API and recipes are available (Tarun's Recipe RFC), we may integrate that, as it fulfills effectively the same functionality. However, I’d like to avoid a hard dependency on this work, so we will maintain test recipes independently for now. We can re-evaluate when the high-level export API is available.
4. Device configurations
When integrating with an external executor, this configuration controls which devices are used to run benchmarks. There is necessarily a dependency on the specific backends used, as backends are largely hardware-dependent. This configuration acts as one filter in this set, where the backend configuration also factors in, such that tests are run only on devices that pass all filters.
5. Runtime Interface
The runtime interface abstracts the underlying .pte execution mechanism. The test harness will provide a .pte file, expected outputs and tolerances, and any runtime configuration options (thread count, etc.). The runtime provider will be responsible for executing the .pte with the given inputs and validating that the outputs are within tolerance.
This design intends to provide an abstraction for the underlying runtime executor. Initially, we will use pybindings as the runtime executor. In a later milestone, we will add the option to run tests on-device using AWS device farm. This may be necessary to support all backends, and it provides a more realistic execution environment.
Question for reviewers: Which backends support pybindings currently? Are we going to enforce pybind support in ET (via simulators or similar) for all backends?
6. Metrics
As part of the test execution framework, we want to be able to collect metrics on the lowering, execution, and outputs. The test framework will include common logic for storing and aggregating metrics. Each metric should be recorded per-test case and aggregated for each configuration set.
Desired metrics include:
7. Outputs
The primary output of the test run is a report, which includes pass/fail information for each test, logging and output for failed tests, and individual and aggregated metrics for the run. At a high-level, the goals are that it should be easy to view summary statistics, we should be able to access the raw result and metric data for post-processing, and it should be easy to debug failed tests.
Milestones for v0.7
The following features are proposed to be delivered with the ExecuTorch 0.7 milestone (mid/end of June, 2025).
Milestone 1 - end of May 2025:
Milestone 2 - end of June 2025:
Dependencies (XFN)
Risks
Beta Was this translation helpful? Give feedback.
All reactions