RFC: Backend Test Infrastructure #11140

GregoryComer · 2025-05-27T09:52:02Z

GregoryComer
May 27, 2025
Collaborator

Introduction

The purpose of this RFC is to propose shared test suites and stress test infrastructure that can run against all ExecuTorch backends. These backends should be able to integrate with this framework with minimal effort and should cover all major capabilities such as partitioner modes, tensor dtypes, operator implementations, and quantization flows.

The initial goal of this effort is to ensure that all ExecuTorch backends meet a standard of functional reliability and correctness defined by ExecuTorch as a platform. Practically, this means being able to take any PT2 graph, partition only what a backend can handle, lower, generate .pte, load it, and run without errors or crashes on most relevant platforms for a given backend.

This document proposes a design aiming at ExecuTorch GA release needs. However, here we are intentionally limiting the implementation scope, as reflected in the milestones for the v0.7 release to remain adaptable for the GA requirements. Consequently, performance is explicitly a non-goal for ET v0.7 release. Strict numerical correctness is also outside the scope for v0.7, as the primary focus is on functionality. Additionally, exhaustively measuring coverage on various platforms and operating systems for backends, such as Vulkan on Windows, is also not part of the early goals. However, as highlighted earlier, we ultimately anticipate the proposed design to support all these aspects if needed for the GA release.

Notes for Reviewers

We’ve intentionally focused on high-level interfaces and goals in this document over implementation details. We expect that the implementation will not be particularly controversial, but we would be interested in points that you foresee as being potentially complex or problematic.
For GA, how do we set backend stability? Quantitatively, how do we weigh different axes - functionality, performance (vs. competition), platform support, ease of enabling new models, debugging (performance/numerical issues), llm-specific goals, etc.

We are especially looking for feedback from backend authors on the following questions:
Are we testing for the right things?
Are the proposed backend integration points sufficient to cover all backend capabilities?
Are there additional use cases that we should consider?
How should we handle tolerances in a way that scales well across backends and quantization schemes?

Motivation

As ExecuTorch approaches 1.0 GA release in October of this year, we have explicit goals on software reliability and out of box experience.

From experience rolling out ExecuTorch at scale within Meta, we have seen that there is often a significant amount of work needed to harden backends to handle the variety of models and deployment targets seen in production.
Backends are largely responsible for writing their own tests. That means that knowledge gained from hardening one backend isn’t easily shared. Getting test coverage right is hard, and we don’t want to have to rely on individual backends to get this right in order to have a stable platform.
We are also seeing a number of externally contributed backends, and we do not have a good way to assess correctness and reliability.
Lastly, ExecuTorch would want to independently assess the quality of the backends, in order to set a quality bar for uniform UX regardless of backend.

Scope

Design Goals for 0.7

Provide a backend independent testing infrastructure that backend authors can plug their backend into to test a subset of operator permutations and models.
Run this test suite or part of it with varied frequency in CI, and make it easy to run locally on appropriate hardware, to produce a standard format “coverage report” see (5).
Ensure that backends do not fail during partitioning, lowering at AoT, or similarly at runtime when running a PTE with reasonable inputs.
Validate that numerical outputs are sane, with general tolerances being configurable by the given backend for quantized tests.
Provide visibility into backend delegation across the test suite, being able to report metrics such as:
5.1 Which operators where or weren’t partitioned
5.2 What percentage of nodes are partitioned on the model test suite
5.3 PTE size, and delegate blob count/size in bytes
5.4 AOT preprocessing time
5.5 Runtime statistics
High degree of flexibility.
6.1 It should be easy to add additional metrics to the test job. And also second order metrics derived from these primary metrics.
6.2 It should be easy to add a new test for one or all backends
6.3 It should be relatively straightforward to add a new backend

Potential design goals for GA include performance, numerical correctness, and platform coverage.

Non-Goals for 0.7

Performance - while performance is a critical part of the framework, this effort focuses on correctness. Performance testing requires a different approach than what I am proposing here, and we are already investing in benchmarking.
Numerical accuracy - We do not intend to enforce specific numerical bounds across backends. We can keep tolerances loose and allow backends to override tolerances in order to validate basic correctness.
Platform Coverage - We will start with “most common” platform first i.e. Linux for XNNPACK as opposed to Windows for XNNPACK.

Motivating Use Cases

The proposed design should serve to support a number of related use cases for backend validation. Some specific motivating examples include:

Finding existing correctness or stability bugs in existing delegates.
Running operator and model-level tests in CI to catch regressions in backends.
Providing signal to backend authors and ET maintainers on the capabilities, correctness, and reliability of a given backend.
- This signal should be especially beneficial when writing a new backend or evaluating a new backend. We can enforce a reliability and correctness bar and have confidence that everything in the ET repo meets this bar.

Design

The proposed design involves a core test harness, which takes an input, a matrix of configurations. The output of the test run is a test report, which includes pass/fail status for all tests and any additional collected metrics. This test harness should be able to be run ad-hoc locally or integrated into CI jobs.

Configuration Matrix

The configuration matrix determines which tests to run, quantization and lowering flows, and runtime targets. Each axis takes a set of one or more configurations. The test harness is responsible for evaluating each combination in the cartesian product of the configurations.

Orthogonal Configuration Axes:

Tests
- Operator-level
- Model-level
Backend independent configs
- Shape dynamism
- Dtype (fp32, fp16)
- Input tensors, and args
Backend specific configs / recipes
- Quantization config
- Partitioner config
- Runtime configurations
  - Examples include thread count and weight caching.
Device configurations
Report configurations

Let’s talk about each in a bit more detail.

Tests

Each test is defined by an eager mode model and a set of test inputs. The test itself is intended to be orthogonal of backend, quantization, and runtime configuration. We may need to relax tolerances between backends, but we expect functional correctness and reliability across all backends and configurations, which is what this effort aims to achieve.

We intend to introduce two primary test suites: models and operators. The operator test suite should cover the entire ATen op set, and any other operators that are common. In practice, we see many non-core ops that are not decomposed, and there is a general de-emphasis on a single op-set, so we will want to add operators as we see them.

Model tests will leverage our existing example models as a baseline, but it should be easy to integrate external model libraries, such as HF Transformers. We will also need to create artificial models to validate specific configurations, such as dynamically quantized linears with multiple consumers, or other cases we have seen problems with in the past.

For operator-level tests, we anticipate using FACTO to generate input permutations, including dtypes. For model-level tests, we will likely need to manage dtype as an independent axis, and input tensor generation will be coupled to the dtype.

2. Backend-Independent Configuration

This configuration controls model DTypes and whether the model is exported with dynamic shapes. Even in cases where backends do not support dynamic shapes, it is beneficial to validate that they do not attempt to partition nodes that they cannot handle.

3. Backend Interface and Recipes

The backend interface is responsible for allowing backends to register quantization and lowering configurations. Quantization is coupled to the backend, but is considered a separate axis, such that we can test multiple quantization schemes against multiple lowering schemes independently. Such as testing (no quant, 8-bit, 4-bit) x (to_edge, to_edge_transformer_and_lower), or perhaps with different partitioner options.

When the high-level export API and recipes are available (Tarun's Recipe RFC), we may integrate that, as it fulfills effectively the same functionality. However, I’d like to avoid a hard dependency on this work, so we will maintain test recipes independently for now. We can re-evaluate when the high-level export API is available.

4. Device configurations

When integrating with an external executor, this configuration controls which devices are used to run benchmarks. There is necessarily a dependency on the specific backends used, as backends are largely hardware-dependent. This configuration acts as one filter in this set, where the backend configuration also factors in, such that tests are run only on devices that pass all filters.

5. Runtime Interface

The runtime interface abstracts the underlying .pte execution mechanism. The test harness will provide a .pte file, expected outputs and tolerances, and any runtime configuration options (thread count, etc.). The runtime provider will be responsible for executing the .pte with the given inputs and validating that the outputs are within tolerance.

This design intends to provide an abstraction for the underlying runtime executor. Initially, we will use pybindings as the runtime executor. In a later milestone, we will add the option to run tests on-device using AWS device farm. This may be necessary to support all backends, and it provides a more realistic execution environment.

Question for reviewers: Which backends support pybindings currently? Are we going to enforce pybind support in ET (via simulators or similar) for all backends?

6. Metrics

As part of the test execution framework, we want to be able to collect metrics on the lowering, execution, and outputs. The test framework will include common logic for storing and aggregating metrics. Each metric should be recorded per-test case and aggregated for each configuration set.

Desired metrics include:

Partitioning statistics
- Which operators were or weren’t partitioned?
- What percentage of operators were partitioned across the test suite?
AOT preprocessing time, memory footprint
PTE size
Output tensor error statistics
- Maximum element-wise error (absolute and relative)
- Mean absolute error (absolute and relative)
- Runtime statistics

7. Outputs

The primary output of the test run is a report, which includes pass/fail information for each test, logging and output for failed tests, and individual and aggregated metrics for the run. At a high-level, the goals are that it should be easy to view summary statistics, we should be able to access the raw result and metric data for post-processing, and it should be easy to debug failed tests.

Milestones for v0.7

The following features are proposed to be delivered with the ExecuTorch 0.7 milestone (mid/end of June, 2025).

Note: We are intentionally not listing milestones for GA at this time, as there are still some uncertainties regarding how this framework will evolve, be adopted, and prove useful. While we are designing it to meet the GA needs we can currently anticipate, the implementation will be strictly limited to milestones for version 0.7. We plan to reassess this work after version 0.7 and will republish milestones for GA as we gain more insights.

Milestone 1 - end of May 2025:

Integrate recipes and pybind execution for P0 delegates: XNNPACK, Core ML, Vulkan, and QNN.
Support local execution of the test suite.
Add model tests for 10 ET example models.

Milestone 2 - end of June 2025:

Run model-level tests in CI for P0 delegates.
Programmatic operator test case generation.
Create a model test suite consisting of at least 20 representative models.
Support the following metrics:
- AOT preprocessing time.
- Percentage of nodes partitioned.
- List of partitioned operators.
- PTE and delegate blob size.

Dependencies (XFN)

Buy in from the backend authors
- Backend authors will need to register lowering recipes for test purposes. We should make sure that the proposed interface will work for all backends and can capture the spectrum of use cases and lowering permutations.
- Support from the backend owners to resolve issues discovered by this framework, they may not have planned for it especially for v0.7.

Risks

False positives due to flaky execution or poorly configured tolerances.
- It may be difficult to get this right across backends. We want to be sure that we don’t spam backend authors with false positives. Any issues we report to backend owners need to be real failures.
Very high test volume due to the combinatorial explosion of configurations and test inputs.
- We will likely not want to run the full CI jobs on every PR. We can restrict them to periodic (weekly or monthly) and on manually tagged PRs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Backend Test Infrastructure #11140

Uh oh!

{{title}}

Uh oh!

Tests

Replies: 0 comments

Select a reply

Uh oh!

RFC: Backend Test Infrastructure #11140

Uh oh!

GregoryComer May 27, 2025 Collaborator

Introduction

Notes for Reviewers

Motivation

Scope

Design Goals for 0.7

Non-Goals for 0.7

Motivating Use Cases

Design

Configuration Matrix

Tests

2. Backend-Independent Configuration

3. Backend Interface and Recipes

4. Device configurations

5. Runtime Interface

6. Metrics

7. Outputs

Milestones for v0.7

Dependencies (XFN)

Risks

Replies: 0 comments

GregoryComer
May 27, 2025
Collaborator