Skip to content

Latest commit

 

History

History
117 lines (76 loc) · 7.83 KB

fma-withdrawals-root-in-block-header.md

File metadata and controls

117 lines (76 loc) · 7.83 KB

Withdrawals Root in Block Header: Failure Modes and Recovery Path Analysis

Author George Knee
Created at 2025-03-20
Initial Reviewers Mark Tyneway
Need Approval From Tom Assas, Michael Amadi (Shadowing)
Status Implementing Actions 🛫

Introduction

The "Withdrawals Root in Block Header" feature copies some information stored in the L2 blockchain state into the block header, making it a part of the history (information stored by all, including non-archive, nodes). The information in question is the L2toL1MessagePasser account storage root, and it is stored in the previously unused withdrawalsRoot field of the block header.

It allows proposals to be made and verified without the needing to bear the cost of running an archive node.

Below are references for this project:

Failure Modes and Recovery Paths

FM1: Withdrawals downtime do to inaccurate withdrawalsRoot in the block header

  • Description: If the withdrawalsRoot in the block header is incorrect, critical infra used to enable withdrawals may fail. Namely, output proposals and challenges would be incorrect, affecting chains with permissioned and chains with permissionless proofs. This is because these components will, with the activation of the Isthmus fork, use the withdrawalsRoot header field instead of querying the information from an archive node in the usual way. Output roots are returned by the op-node optimism_outputAtBlock RPC method, and this behaves differently under Isthmus -- when handling a request for an output root (it no longer delegates an eth_getProof to op-geth and just reads the information from the block header).

    Triggers:

    • A failed hardfork activation in the execution client.

    • If there is an execution client bug, for example it is possible the root is (incorrectly) added to the header before the state is fully committed.

    • If we were to ever introduce non empty withdrawals in the block body, this might override the mechanism introduced with this feature and invalidate the interpreation of the withdrawalsRoot field.

  • Risk Assessment:

    High impact, low likelihood

    Temporary downtime for withdrawals, or loss of funds if not remediated in time to challenge a malicious proposal.

    Mitigations:

    • We rely on e2e tests to check for consistency between the outputs returned from op-node and those constructed manually in the old way.

    • Instead of waiting for the failure mode to materialize and then writing a patch in a rush, we could add an optional config var to op-node to switch it back into the old behavior. Rolling out the fix would not then require any software releases.

    • op-geth could be modified to log a critical error (triggering an alert) if the withdrawals list in the body is ever non empty.

  • Detection: Fault proof monitoring systems may not detect this failure mode immediately, until an actor running patched software made a proposal or challenge.

  • Recovery Path(s): Fault proof infra would nee to be pointed at a patched op-node. The patch would restore the old behaviour for generating output roots.

FM2: Failure of p2p network due to bug in new topic/message serde logic

  • Description: Because this feature introduces a new p2p gossip topic and message serialization format, a bug can mean the failure of p2p gossip for any chain with Isthmus active. This would cause an unsafe chain halt on affected nodes (but the safe chain would still progress).

  • Risk Assessment:

    High impact, low likelihood

    Mitigations: We rely on end-to-end testing (including fuzzing) to catch any bugs in this code path. We could run extended fuzzing campaigns.

  • Detection: Continuous integration, or Kurtosis and/or devnet testing would catch this. Failing that, the bug makes it to production, our alerting infrastructure would notify us.

  • Recovery Path(s): The bug would need to be patched and new op-node release cut and rolled out.

Generic failure modes:

See the generic FMA:

  • Chain halt at activation (there is a change to the engine API, which elevates this risk)
  • Activation failure
  • Invalid setImplementation execution
  • Chain split (across clients)

Specific Action Items

Generic Action Items

  • (BLOCKING): We have implemented extensive unit and end-to-end testing of the activation flow: https://github.com/ethereum-optimism/optimism/blob/develop/op-e2e/actions/upgrades/isthmus_fork_test.go
  • (BLOCKING): We have implemented multi-client testing with kurtosis and/or devnets to reduce the chance of bugs. This should be in the form of an acceptance tests which target all client types in the network ethereum-optimism/optimism#15102
  • (BLOCKING) We should ensure that our usual suite of alerts applies to devnets and are routed to protocol engineers signing off on the devnet completion.
  • (BLOCKING): Run fuzzing on the v4 gossip p2p more than 10s (assignee: @Ethnical @geoknee) ethereum-optimism/optimism#15068
  • (BLOCKING): We tested the activation on our devnets.
  • (non-BLOCKING): Creating a monitoring that differential testing from the merkle tree inclusion computation and the block.header request withdrawalRoot (assignee: @Ethnical). Tracking -> Monitoring Security-Issue
  • (non-BLOCKING): We have implemented fuzz testing in a kurtosis multi-client devnet to reduce the chance of bugs

Audit Requirements

An audit has not been deemed necessary.