Skip to content

Commit 6d90225

Browse files
committed
FMA: Interop Transaction Handling
1 parent 00bdbe5 commit 6d90225

File tree

1 file changed

+137
-0
lines changed

1 file changed

+137
-0
lines changed

security/fma-interop-tx-handling.md

+137
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# Interop Transaction Handling: Failure Modes and Recovery Path Analysis
2+
3+
| | |
4+
|--------|--------------|
5+
| Author | Axel Kingsley |
6+
| Created at | 2025-03-31 |
7+
| Needs Approval From | |
8+
| Other Reviewers | |
9+
| Status | Draft |
10+
11+
## Introduction
12+
13+
This document covers new considerations for Chain Operators in Interop-enabled spaces when processing transactions.
14+
15+
## Context and Problem
16+
17+
In an OP Stack chain, block builders (Sequencers) build blocks from user-submitted transactions. Most
18+
Chain Operators arrange their infrastructure defensively so that RPC requests aren't handled direclty by the
19+
Supervisor, instead building the mempool over P2P.
20+
21+
In an Interop-Enabled context, Executing Messages (Interop Transactions) hold special meaning within the
22+
protocol. For every Executing Message in a block, there must be a matching "Initiating Message" (plain log event)
23+
which matches the specified index and content hash.
24+
25+
Including an Executing Message which *does not* match to an Initiating Message is considered to be invalid to the protocol.
26+
The result is that the *entire block* which contains this Invalid Message is replaced, producing a reorg on the chain
27+
at this height.
28+
29+
Because the consequenquence for an Invalid Message is so high, Chain Operators are highly incentivized to check Executing
30+
Messages before they are added to a block. *However*, being excessive with these checks can cause interruption to the
31+
chain's regular forward-progress. There must be a balance taken in checking messages.
32+
33+
### Two Extremes
34+
35+
To understand the purpose of these decisions, lets consider the extreme validity-checking policies we could adopt:
36+
37+
**Check Every Message Exhaustively**
38+
- In this model, every Executing Message is checked constantly, at a maximum rate, and each message is checked as close to block-building as possible.
39+
- The compute cost to check every message during block building adds time to *every* transaction, blowing out our ability
40+
to build a block in under 2s.
41+
- The validity of the included Executing Messages would be *as correct* as possible. However! Even *after* the block is built, the data being relied upon (cross-unsafe data) could change on the Initiating Chain if they suffer a reorg.
42+
So while this policy is *most correct*, it is not *totally correct*
43+
44+
**Never Check Any Messages**
45+
- In this model, we optimize only for not taking any additional compute tasks, instead just trusting every message.
46+
- Naturally, there is no impact into block building or any other process, BUT...
47+
- Blocks would easily become invalid, because an attacker could submit Invalid Messages, even just one per 2s, and prevent the Sequencer from ever building a valid block.
48+
49+
So, no matter what solution we pick, we deal with *some* amount of uncertainty and take *some* amount of additional compute load.
50+
51+
## The Solution Design
52+
53+
The [Interop Topology and Tx Flow for Interop Chains Design Doc](https://github.com/ethereum-optimism/design-docs/pull/218)
54+
describes the solution design we plan on going with:
55+
56+
- All Executing Message are checked once at `proxyd` ingress.
57+
- All Executing Message are checked once at Node Mempool ingress (not counting Sequencer).
58+
- All Executing Message in Node Mempools are Batched at checked on a regular interval.
59+
- If an Executing Message is ever Invalid, it is discarded and not retried.
60+
- *No* Checks are done at Block Building time.
61+
62+
This FMA describes the potential negative consequences of this design. We have selected this design because it maximizes
63+
the opportunities for Invalid Messages to be caught and discarded, while also leaving the block building hot-path from
64+
having to take on new compute.
65+
66+
# FMs:
67+
68+
## FM1: Checks Fail to Catch an Invalid Executing Message
69+
- Description
70+
- Due to a bug in either the Supervisor, or the Callers, an Executing Message
71+
was allowed into block building.
72+
- When this happens, the block which is built is invalid.
73+
- The Sequencer for the network can choose to build from the Replacement of the Invalid block,
74+
or from the Parent, if this is still the Unsafe Chain. If it continues to build from the Invalid
75+
block itself, all those appended blocks will be dropped.
76+
- Risk Assessment
77+
- Has no effect on our ability to process transactions. Supervisor effects are described in
78+
the Supervisor FMA.
79+
- Negative UX and Customer Perception from building invalid block content.
80+
81+
## FM2: Checks Discard Valid Message
82+
- Description
83+
- Due to a bug in either the Supervisor, or the Callers, some or all Executing Messages
84+
aren't being included in blocks.
85+
- When this happens, there is nothing invalid being produced by block builders, but no Interop
86+
Messages are being included.
87+
- Risk Assessment
88+
- More Negative UX and Custoemr Perception if Interop Messages aren't making it into the chain.
89+
- Failed transactions would cause customers to redrive transactions, potentially overwhelming
90+
infrastructure capacity.
91+
92+
## FM3a: Transaction Volume causes DOS Failures of Proxyd
93+
- Description
94+
- Due to the new validation requirements on `proxyd` to check Interop Messages, an influx of
95+
Interop Messages may arrive, causing `proxyd` to become overwhelmed with work.
96+
- When `proxyd` becomes overwhelmed, it may reject customer requests or crash outright, affecting
97+
liveness for Tx Inclusion and RPC.
98+
- Risk Assessment
99+
- Low Impact ; Medium Likelihood
100+
- If a `proxyd` instance should go down, we should be able to replace it quickly, as it is a stateless
101+
service.
102+
- When `proxyd` goes down, it takes outstanding requests with it, meaning some requests fail to be handled,
103+
but excess load is shed.
104+
- Mitigations
105+
- `proxyd` could feature a pressure-relief setting, where if too much time is spent waiting on the Supervisor,
106+
no additional Interop Messages will be accepted through this gateway.
107+
- We should deploy at least one "Utility Supervisor" to respond to Checks from `proxyd` instances.
108+
The size and quantitiy of the Supervisor(s) could be scaled if needed. (Note: A Supervisor also requires Nodes
109+
of each network to function)
110+
## FM3b: Transaction Volume causes DOS Failures of Supervisor
111+
- Description
112+
- In any section of infrastructure, calls to the Supervisor to Check a given Interop Message might overwhelm the
113+
Supervisor.
114+
- If this happens, the Supervisor may become slow to respond, slow to perform its other sync duties, or may crash outright.
115+
- When a Supervisor crashes, any connected Nodes can't keep their Cross-Heads up to date, and Managed Nodes won't get L1 updates either.
116+
- Risk Assessment
117+
- Medium Impact ; Low Likelihood
118+
- The Supervisor is designed to respond to Check requests. Even though it hasn't been load tested in realistic settings, there is very little computational overhead when responding to an RPC request.
119+
- Supervisors can be scaled and replicated to serve high-need sections of the infrastructure. Supervisors
120+
sync identically (assuming a matching L1), so two of them should be able to share traffic.
121+
- Mitigations
122+
- When the Supervisor is down, any block builder or mempool filter *should* treat unavailability as
123+
a negative - to protect correctness, when the Supervisor goes down, don't include any Interop Messages.
124+
## FM3c: Transaction Volume causes DOS Failures of Node
125+
- Description
126+
- Transactions make it past `proxyd` and arrive in a Sentry Node or the Sequencer.
127+
- Due to the high volume of Interop Messages, the work to check Interop Messages causes
128+
a delay in the handling of other work, or causes the Node to crash outright.
129+
- Risk Assessment
130+
- Medium/Low Impact ; Low Likelihood
131+
- It's not good if a Sequencer goes down, but the average node can crash without issue.
132+
- Conductor Sets keep block production healthy even when a Sequencer goes down.
133+
- Mitigations
134+
- Callers should use Batched RPC requests to the Supervisor when they are regularly validating groups
135+
of Transactions. This minimizes the amount of network latency experienced, allowing other work to get done.
136+
- Mempool transactions which fail checks should be dropped and not retried. This prevents malicious transactions
137+
from using more than one check against the Supervisor.

0 commit comments

Comments
 (0)