|
| 1 | +# Interop Transaction Handling: Failure Modes and Recovery Path Analysis |
| 2 | + |
| 3 | +| | | |
| 4 | +|--------|--------------| |
| 5 | +| Author | Axel Kingsley | |
| 6 | +| Created at | 2025-03-31 | |
| 7 | +| Needs Approval From | | |
| 8 | +| Other Reviewers | | |
| 9 | +| Status | Draft | |
| 10 | + |
| 11 | +## Introduction |
| 12 | + |
| 13 | +This document covers new considerations for Chain Operators in Interop-enabled spaces when processing transactions. |
| 14 | + |
| 15 | +## Context and Problem |
| 16 | + |
| 17 | +In an OP Stack chain, block builders (Sequencers) build blocks from user-submitted transactions. Most |
| 18 | +Chain Operators arrange their infrastructure defensively so that RPC requests aren't handled direclty by the |
| 19 | +Supervisor, instead building the mempool over P2P. |
| 20 | + |
| 21 | +In an Interop-Enabled context, Executing Messages (Interop Transactions) hold special meaning within the |
| 22 | +protocol. For every Executing Message in a block, there must be a matching "Initiating Message" (plain log event) |
| 23 | +which matches the specified index and content hash. |
| 24 | + |
| 25 | +Including an Executing Message which *does not* match to an Initiating Message is considered to be invalid to the protocol. |
| 26 | +The result is that the *entire block* which contains this Invalid Message is replaced, producing a reorg on the chain |
| 27 | +at this height. |
| 28 | + |
| 29 | +Because the consequenquence for an Invalid Message is so high, Chain Operators are highly incentivized to check Executing |
| 30 | +Messages before they are added to a block. *However*, being excessive with these checks can cause interruption to the |
| 31 | +chain's regular forward-progress. There must be a balance taken in checking messages. |
| 32 | + |
| 33 | +### Two Extremes |
| 34 | + |
| 35 | +To understand the purpose of these decisions, lets consider the extreme validity-checking policies we could adopt: |
| 36 | + |
| 37 | +**Check Every Message Exhaustively** |
| 38 | +- In this model, every Executing Message is checked constantly, at a maximum rate, and each message is checked as close to block-building as possible. |
| 39 | +- The compute cost to check every message during block building adds time to *every* transaction, blowing out our ability |
| 40 | +to build a block in under 2s. |
| 41 | +- The validity of the included Executing Messages would be *as correct* as possible. However! Even *after* the block is built, the data being relied upon (cross-unsafe data) could change on the Initiating Chain if they suffer a reorg. |
| 42 | +So while this policy is *most correct*, it is not *totally correct* |
| 43 | + |
| 44 | +**Never Check Any Messages** |
| 45 | +- In this model, we optimize only for not taking any additional compute tasks, instead just trusting every message. |
| 46 | +- Naturally, there is no impact into block building or any other process, BUT... |
| 47 | +- Blocks would easily become invalid, because an attacker could submit Invalid Messages, even just one per 2s, and prevent the Sequencer from ever building a valid block. |
| 48 | + |
| 49 | +So, no matter what solution we pick, we deal with *some* amount of uncertainty and take *some* amount of additional compute load. |
| 50 | + |
| 51 | +## The Solution Design |
| 52 | + |
| 53 | +The [Interop Topology and Tx Flow for Interop Chains Design Doc](https://github.com/ethereum-optimism/design-docs/pull/218) |
| 54 | +describes the solution design we plan on going with: |
| 55 | + |
| 56 | +- All Executing Message are checked once at `proxyd` ingress. |
| 57 | +- All Executing Message are checked once at Node Mempool ingress (not counting Sequencer). |
| 58 | +- All Executing Message in Node Mempools are Batched at checked on a regular interval. |
| 59 | +- If an Executing Message is ever Invalid, it is discarded and not retried. |
| 60 | +- *No* Checks are done at Block Building time. |
| 61 | + |
| 62 | +This FMA describes the potential negative consequences of this design. We have selected this design because it maximizes |
| 63 | +the opportunities for Invalid Messages to be caught and discarded, while also leaving the block building hot-path from |
| 64 | +having to take on new compute. |
| 65 | + |
| 66 | +# FMs: |
| 67 | + |
| 68 | +## FM1: Checks Fail to Catch an Invalid Executing Message |
| 69 | +- Description |
| 70 | + - Due to a bug in either the Supervisor, or the Callers, an Executing Message |
| 71 | + was allowed into block building. |
| 72 | + - When this happens, the block which is built is invalid. |
| 73 | + - The Sequencer for the network can choose to build from the Replacement of the Invalid block, |
| 74 | + or from the Parent, if this is still the Unsafe Chain. If it continues to build from the Invalid |
| 75 | + block itself, all those appended blocks will be dropped. |
| 76 | +- Risk Assessment |
| 77 | + - Has no effect on our ability to process transactions. Supervisor effects are described in |
| 78 | + the Supervisor FMA. |
| 79 | + - Negative UX and Customer Perception from building invalid block content. |
| 80 | + |
| 81 | +## FM2: Checks Discard Valid Message |
| 82 | +- Description |
| 83 | + - Due to a bug in either the Supervisor, or the Callers, some or all Executing Messages |
| 84 | + aren't being included in blocks. |
| 85 | + - When this happens, there is nothing invalid being produced by block builders, but no Interop |
| 86 | + Messages are being included. |
| 87 | +- Risk Assessment |
| 88 | + - More Negative UX and Custoemr Perception if Interop Messages aren't making it into the chain. |
| 89 | + - Failed transactions would cause customers to redrive transactions, potentially overwhelming |
| 90 | + infrastructure capacity. |
| 91 | + |
| 92 | +## FM3a: Transaction Volume causes DOS Failures of Proxyd |
| 93 | +- Description |
| 94 | + - Due to the new validation requirements on `proxyd` to check Interop Messages, an influx of |
| 95 | + Interop Messages may arrive, causing `proxyd` to become overwhelmed with work. |
| 96 | + - When `proxyd` becomes overwhelmed, it may reject customer requests or crash outright, affecting |
| 97 | + liveness for Tx Inclusion and RPC. |
| 98 | +- Risk Assessment |
| 99 | + - Low Impact ; Medium Likelihood |
| 100 | + - If a `proxyd` instance should go down, we should be able to replace it quickly, as it is a stateless |
| 101 | + service. |
| 102 | + - When `proxyd` goes down, it takes outstanding requests with it, meaning some requests fail to be handled, |
| 103 | + but excess load is shed. |
| 104 | +- Mitigations |
| 105 | + - `proxyd` could feature a pressure-relief setting, where if too much time is spent waiting on the Supervisor, |
| 106 | + no additional Interop Messages will be accepted through this gateway. |
| 107 | + - We should deploy at least one "Utility Supervisor" to respond to Checks from `proxyd` instances. |
| 108 | + The size and quantitiy of the Supervisor(s) could be scaled if needed. (Note: A Supervisor also requires Nodes |
| 109 | + of each network to function) |
| 110 | +## FM3b: Transaction Volume causes DOS Failures of Supervisor |
| 111 | +- Description |
| 112 | + - In any section of infrastructure, calls to the Supervisor to Check a given Interop Message might overwhelm the |
| 113 | + Supervisor. |
| 114 | + - If this happens, the Supervisor may become slow to respond, slow to perform its other sync duties, or may crash outright. |
| 115 | + - When a Supervisor crashes, any connected Nodes can't keep their Cross-Heads up to date, and Managed Nodes won't get L1 updates either. |
| 116 | +- Risk Assessment |
| 117 | + - Medium Impact ; Low Likelihood |
| 118 | + - The Supervisor is designed to respond to Check requests. Even though it hasn't been load tested in realistic settings, there is very little computational overhead when responding to an RPC request. |
| 119 | + - Supervisors can be scaled and replicated to serve high-need sections of the infrastructure. Supervisors |
| 120 | + sync identically (assuming a matching L1), so two of them should be able to share traffic. |
| 121 | +- Mitigations |
| 122 | + - When the Supervisor is down, any block builder or mempool filter *should* treat unavailability as |
| 123 | + a negative - to protect correctness, when the Supervisor goes down, don't include any Interop Messages. |
| 124 | +## FM3c: Transaction Volume causes DOS Failures of Node |
| 125 | +- Description |
| 126 | + - Transactions make it past `proxyd` and arrive in a Sentry Node or the Sequencer. |
| 127 | + - Due to the high volume of Interop Messages, the work to check Interop Messages causes |
| 128 | + a delay in the handling of other work, or causes the Node to crash outright. |
| 129 | +- Risk Assessment |
| 130 | + - Medium/Low Impact ; Low Likelihood |
| 131 | + - It's not good if a Sequencer goes down, but the average node can crash without issue. |
| 132 | + - Conductor Sets keep block production healthy even when a Sequencer goes down. |
| 133 | +- Mitigations |
| 134 | + - Callers should use Batched RPC requests to the Supervisor when they are regularly validating groups |
| 135 | + of Transactions. This minimizes the amount of network latency experienced, allowing other work to get done. |
| 136 | + - Mempool transactions which fail checks should be dropped and not retried. This prevents malicious transactions |
| 137 | + from using more than one check against the Supervisor. |
0 commit comments