When L2s Fall

The goal of L2 protocols is to provide a better blockchain user experience by reducing costs and latency compared to transacting on L1 mainnet Ethereum. If you’ve spent any time in the blockchain space, you know about the blockchain trilemma.

trilemma

Most, if not all, L2 chains built on mainnet Ethereum focus on scalability at the cost of decentralization. In contrast to mainnet Ethereum, top L2 chains require all transactions to pass through a single bottleneck, the L2 sequencer. The role of the L2 sequencer is to take L2 transactions, build L2 blocks, and submit the L2 blocks onto the L1 chain. The sequencer role is crucial to L2 uptime, and if the sequencer or associated system components goes down, the chain goes down. This article investigates instances of major L2 downtime events across different L2 chains. Minor downtime events of under 30 minutes were not included, because these minor incidents are more common and harder to identify.

Centralized Sequencer Weakness

Murphy’s law states that anything that can go wrong, will go wrong. Many L2s have experienced sequencer downtime, and if current trends continue, this will not change until a more decentralized sequencer approach is adopted. Some newer L2 chains do use decentralized sequencer models, mostly leveraging the HotStuff BFT consensus protocol, but such big differences come with tradeoffs.

One could imagine that a worst case scenario for a L2 chain is unlimited downtime e.g. the chain goes down and the sequencer is not restarted. In such a scenario, assets would be locked on the chain with no way out. Luckily, L2Beat has a “Sequencer Failure” category to track which L2 chains have support for a fallback escape hatch if this scenario arises with. Unfortunately, most chains have not made it clear exactly how this fallback escape hatch will work in reality, and there is no way to test any of these escape hatches to verify that they will work when needed.

The paranoid among us may imagine a worst case scenario where the sequencer server is compromised by black hats and the L2 team is somehow prevented from bringing a new sequencer back online. The entire TVL of the L2 chain could be held for ransom. A similar scenario could occur should the management of a L2 sequencer turn malicious and seek a way to extract value from users on the chain. Luckily, no such worst case scenario has occurred to date, but the idea of so much value relying on the trust of a few individuals or entities is problematic in the long run.

In an effort to highlight the risks of such centralized design, this is the first attempt to examine the causes of unplanned L2 sequencer downtime across different L2s. After examining the root cause of specific incidents, a brief explanation is given of the data collection methods used.

For clarification, a common example of L2 chain downtime that is not examined is planned maintenance downtime. Some examples of this include the OP Bedrock upgrade, which had an expected downtime of 4 hours, and the Arbitrum Nova upgrade, which had an expected downtime of 2-4 hours.

L2 Downtime Incidents (ordered chronologically)

Arbitrum: September 14, 2021

This 45 minute downtime incident was from a sudden increase in submitted transactions. Details on this downtime incident are sparse, but perhaps that is not surprising given it was in the early days (first 6 months) of Arbitrum.

Takeaway: Prepare to handle sudden increases in transaction volume.

Blocks: Downtime started at block 828469 until block 828470.

Arbitrum: January 9, 2022

This downtime incident may be unique because it was caused by hardware failure, not software failure. While backup measures for this scenario were supposedly in place, those measures failed due to a poorly timed software update.

Takeaway: Verify that backup measures are bulletproof, even in combined failure scenarios (i.e. hardware failure + software update).

Blocks: Downtime started at block 4509809 until block 4509828.

Polygon: March 11, 2022

This downtime incident started as a scheduled maintenance window but lasted longer than most users expected. The exact duration is either around 5 hours or 11 hours, depending whether you use blockchain explorer timestamps or news articles as an information source. The main issue was in the consensus layer, where the different Heimdall validators (a Tendermint fork) were unable to reach 2/3 consensus. The Bor component of validators (a geth fork) was unaffected, but was not able to continue building blocks when consensus could not be reached. A temporary hotfix helped get the chain moving again, but it is unclear what the long term solution was for the consensus issue in Heimdall.

Takeaway: Thoroughly test new software so that maintenance window operations go smoothly.

Blocks: Delays started at block 25811391 until block 25811439.

zkSync: April 1, 2023

This incident resulted in over 4 hours of downtime, but the fix for the issue only took 5 minutes. One reason for the slow response time in responding to this issue is that the monitoring for the database that went down failed to work as expected. The protocol tweeted “The database health alert did not trigger because it could not connect to it to collect metrics.” Another reason given for the delay is that all the team was at an off-site, so they did not have engineers in multiple time zones like normal.

Takeaway: Test that monitoring systems will trigger errors in all forms of failure modes. Make sure that the on-call engineer is always on-call, even during team off-sites.

Blocks: Downtime started at block 5308 until block 5312.

Optimism: April 26, 2023

While the sequencer remained online, this two-hour incident was caused by a sudden 10x increase in transactions that resulted in delayed inclusion of transactions due to longer delays in Optimism’s read-only replica. Users experienced transaction confirmations delays of several minutes, even though the transactions had been processed. The solutions to handle this going forward were to add monitoring for such events and to design the then-upcoming Bedrock upgrade to handle such scenarios. Bedrock handles this scenario better by using a fixed two-second block time (instead of building a new block for every transaction) and replacing the read-only replica design with a more efficient way of indexing blocks using a P2P gossip network. Despite this incident having a postmortem writeup by Optimism, the incident is not listed as an incident on the Optimism status page.

OP 2023 postmortem

The y-axis indicates the delay (in seconds) for indexing of blocks after the sequencer processed them. Note the increase on April 26.

Takeaway: Stress test systems with 10x or more than the average traffic.

Blocks: Unlike the other incidents in this list, the exact blocks impacted during this incident are unclear. The block timestamps on Optimism L2 remained consistent during the stated time period of this incident. Likewise, the batch submitter on mainnet Ethereum continued submitting transactions during the stated time period without any unusual gaps. One possible explanation is that transactions could be processed faster by the sequencer than could be included by the batch submitter. In other words, the high transaction volume processed by the sequencer meant the batch submitter could not keep up without delayed inclusion of some transactions.

Arbitrum: June 7, 2023

This downtime incident lasted nearly two hours. The problem was not the sequencer, which actually continued working during the entirety of the incident. The issue was that the L1 batch poster (AKA batch submitter according to the etherscan label), which submits L2 data to mainnet Ethereum, stopped doing so. The reason no transactions were submitted to mainnet Ethereum was that a bug introduced in PR #1640 caused the batch poster to use an incorrect block nonce, and when a certain transaction queue length was achieved, the batch poster halted. The configuration which caused the batch poster to halt was an old temporary workaround that should have been removed previously. To resolve this issue, the Redis storage was cleared and the batch poster was restarted with an older version without the newly introduced bug. Because the sequencer was functioning normally during this time, the timestamps of blocks on Arbitrum look normal, with many blocks produced during the period where the batch poster was down (blocks 98671050 to 98696350). There are several PRs mitigating this issue (#1682, #1684, and #1685).

Takeaway: Remove any unnecessary logic that can lead to denial-of-service conditions, such as a paused chain.

Blocks: Downtime with the Arbitrum batch poster where no transactions were sent to mainnet Ethereum started at block 17427658 until block 17428180. It then underwent a restart with another period of delay and downtime between block 17428239 and 17428486.

Arbitrum: December 15, 2023

This downtime incident lasted just over an hour, but the side effects went beyond sequencer downtime. The root cause of the issue was a combination of a surge in transactions from inscriptions combined with a syncing bug in the Ethereum consensus client that the sequencer relied on. If the consensus client had synced normally, as was the case after the downtime incident, the chain would likely have been able to handle the increased number of transactions. The bug caused the backlog to grow in a way that metrics did not properly capture, and presumably did not trigger alerts, resulting in some servers running out of memory and a halted sequencer.

To resolve the issue, a new sequencer build was deployed and the servers that ran out of memory were restarted. There are several commits that are related to the fix for this issue (1, 2, 3, 4).

A separate issue that resulted from this incident was an imbalance in the gas pricing mechanism. Users paid too little gas at the start of the surge of transactions, and the resulting imbalance caused gas fees to jump in an effort to compensate. The Arbitrum Foundation decided to pay off the gas fees deficit and investigate improvements to their gas pricing mechanics. For curious readers, a deeper technical examination of this incident was performed by Dedaub.

Arbitrum Dec 15 2023 backlog

After the sequencer fix was applied, the backlog was able to return to high levels without problems.

Takeaway: Monitoring and alerting should be applied to all parts of the system to properly identify anomalies in realtime. Prepare to handle sudden increases in transaction volume.

Blocks: Delays started at block 160384179 until block 160385961.

zkSync: December 25, 2023

This incident involved over 4 hours of chain downtime. Similar to the bad luck of the April 1st zkSync downtime incident, the developer team was not working normal hours during this downtime event because of the Christmas holiday. The issue was caused by a combination of two factors. The first factor was that the operator did not calculate the state update exactly right. The second factor was an unnecessary safety state that was then triggered. The operator’s output must match the state update calculated by L1 contracts, and because these were not the same, it triggered the safety state which caused the sequencer to wait for manual intervention to resolve the situation. The resolution was to fix the operator state update bug and to remove the safety state, which was designed for an earlier codebase version and was no longer necessary.

Takeaway: Remove any unnecessary logic that can lead to denial-of-service conditions, such as a paused chain.

Blocks: Downtime started at block 363034 until block 363035. The proving timestamp in the explorer is where the downtime is visible.

Optimism: February 15, 2024

This two-hour incident required an upgrade to the sequencer. The initial fix for the issue did not resolve the chain stability issues, so a secondary fix was needed. One of the comments in the incident notice states “we are waiting for the sequencer to come back online”, but the L2 timestamps actually don’t show any sequencer downtime (unless the timestamps of the blocks in the explorer aren’t accurate and are somehow implied). The downtime is visible in the irregular cadence of transactions from the batch submitter on Ethereum L1. Node operators with a stuck node needed to perform a full node restart. It is unclear if any commits to the Optimism monorepo on February 15 are related to this downtime incident, but there is one odd PR without any explanation that could be related.

Takeaway: None, the exact sequencer fix was not explained.

Blocks: Downtime and irregular transaction intervals with the Optimism batch submitter started at block 19231012 until block 19231545. The most notable period of complete downtime started at block 19231012 until block 19231145. A shorter downtime period started at block 19231183 until block 19231255.

Linea: June 2, 2024

This downtime incident of about 90 minutes may be the first known example of a purposeful pause of a L2 sequencer that was not a planned maintenance period. The cause of the sequencer pause was not a bug in the L2 chain, but rather a bug in a smart contract deployed on the chain. The protocol team explained that the sequencer was shut down to minimize the financial impact to Linea TVL during a hack of the Velocore DEX. Not surprisingly, there was some criticism after this event that a voluntary pause of an L2 chain could happen due to a unilateral decision by the protocol team.

Takeaway: If an L2 chain maintains a policy of pausing the chain in the event of a hack, users should be informed of this policy in advance.

Blocks: Downtime started at block 5081800 until block 5081801.

Data Collection Methods

If a researcher wishes to identify L2 downtime events using only on-chain data, there are a couple of options. The easiest shortcut available is to use the Average Block Time Chart that is found on some L2 etherscan pages. Some chains (Optimism, Arbitrum) do not have this chart currently (although Arbitrum used to earlier this month, so maybe it’s only temporarily gone?). If the L2 etherscan site does have this data, anomalies in the chart will normally be caused by downtime incidents. One example is shown below, where the Polygon chart shows the increased block times due to downtime on March 10-11 2022.

Polygon downtime

A second option to identifying downtime events can rely on data collection from public blockchain RPC endpoints. The basic idea is to query the timestamp of a sequence of blocks and check whether the difference in block timestamps is outside of the normal range. The difficulty here is in identifying what the normal range is. Even mainnet Ethereum does not generate a block every 12 seconds like clockwork. If mainnet Ethereum misses a slot, there might be a 24 second difference between block timestamps, or even more if multiple slots are missed in sequence. For readers that prefer hard proof, missed slots are visualized as red squares on beaconcha.in. Two examples are mainnet Ethereum blocks 19521620 and 19522751 which have a gap of 36 seconds from the previous block. These blocks were identified using this small custom script that proved very useful in identifying the relevant block numbers for each incident below. A more advanced tool to find block timestamp anomalies is here and mentioned in this detailed twitter thread.

The primary data collection method used to find the incidents below was manual searching on search engines with different keywords. Some AI search tools assisted as well. No L2 chains examined were found to have a single resource showing all the downtime events for the chain. Some chains have status pages (Optimism, Arbitrum, Linea, zkSync, etc.) but they are often light on details about downtime incidents. Detailed postmortems are often only found in twitter threads soon after the downtime incident rather than on the official L2 status page. Understanding the root cause of L2 downtime is further hindered by the closed source code of most L2 sequencers. It’s unfortunate that L2s collectively holding billions of TVL rely on centralized and closed source sequencers. Therefore, when an L2 development team explains that they have upgraded the sequencer, there is no public “source of truth” where users can see the changes themselves - instead, users must trust that the L2 devs know what they are doing.

Conclusion

Overall the information about L2 downtime incidents is hard to find, often unclear when it comes to technical details, and not stored in consistent locations. One example of this is how one zkSync downtime announcement came from their @zksync twitter account while another one came from their @zkSyncDevs twitter account. Similarly, Optimism has created a postmortems folder in their repository, but it is not updated for new downtime incidents like the February 15 2024 incident. While the marketing teams of these L2 chains may wish to hide anything perceived as a problem with the chain, the lack of organized information sharing results in a lost opportunity for technical information sharing. Such information sharing could be useful between different L2 chains, but also for engineers internal to the L2 chain who are less informed about the details of past downtime incidents. Given that the recent inscriptions craze took down half a dozen chains, there is some distance to go before L2 chains can become as robust as mainnet Ethereum. At the same time, L2 chains offer a playground to move faster and try more risky solutions than mainnet Ethereum would adopt, so it is understandable that this is where issues would appear.

An honorable mention also goes out to Stacks, a Bitcoin L2, that had a 9-hour downtime incident on June 13-14 2024. Regardless of where the tech stack is built, L2 systems are hard. No technical details were given for this particular incident.

Some suggestions I would offer L2 chains regarding their post-mortem writeups include:

  • In postmortems, clearly specify the timezone of any timestamps related to an incident. It’s annoying that some writeups use US Eastern Time when all blockchain explorers use UTC.
  • Compile incident postmortems in a single location, to avoid twitter threads getting lost to history.
  • Specify the block numbers where an incident starts and ends (the block numbers in this article were obtained manually).

One point to note is that the developers in the Ethereum ecosystem have pointed to Solana downtime as an issue with Solana’s design. However, comparing the slow L1 mainnet Ethereum chain (counted by tps) to the much faster Solana chain is potentially less accurate than comparing L2 chains to Solana. Comparing the downtime incidents of L2 chains built on Ethereum is a better comparison to Solana’s downtime incidents.