Alfajores L2 Node Incident

mariano · October 24, 2024, 9:15pm

On Thursday October 24, cLabs detected an issue with Alfajores Testnet fullnodes that it operates, which started crashing and were unable to sync with the network.

The specific issue was encountered in op-geth instances with state obtained by migrating the Celo L1 data directory during the L1 to L2 upgrade (ie nodes using --scheme.state=hash and --syncmode=consensus-layer). These nodes struggled to sync and started crash looping. Nodes using state obtained via snap sync (ie --syncmode=snap) were not affected. Historical bad blocks have been identified as the cause of the incident, which led to the Alfajores testnet to experience brief stalls.

After some investigation, it appears that full nodes are encountering an error due to the combination of two unrelated issues:

During the L1 to L2 upgrade, the data migration script that upgraded the Celo L1 data directory to the Celo L2 data directory failed to migrate “bad blocks” (blocks that don’t follow block validation logic). Instead of purging them as expected, they were not migrated. This is an issue, since if at any point op-geth tries to read this list, it won’t be able to decode blocks properly and thus fail.
Recently, Alfajores full nodes received a bad block (the cause of this is still being investigated), which caused them to access the bad block list which due to the first issue was corrupted, and thus made them crash.

Affected nodes were able to be recovered by purging their bad block database, by snap syncing from one of the unaffected nodes, or by recovering from a prior backup. As a result, Alfajores is operational. Partners running Alfajores nodes may be experiencing the same issue, potentially leading to their nodes crash-looping. For instructions on how to address the issue, please visit the live incident notes.

The cLabs team is actively investigating this issue. For up-to-date information, please visit the live incident notes, which will be updated with findings, instructions and fixes on an ongoing basis.

cLabs Team

Alex_APF · October 27, 2024, 2:37am

Thanks for the update.

gastonponti · November 6, 2024, 3:58pm

Incident Update

The issue impacting Alfajores Testnet fullnodes has been resolved. The issue was identified at the Sequencer, which wasn’t fully following the block-building rules outlined in CIP-43. Instead of calculating the rates for each whitelisted ERC-20 (feeCurrency) at the start of a block, it was recalculating these rates after every transaction execution.

This led to a validation problem in cases where a transaction within the block changed the rate of a whitelisted ERC-20, and a later transaction in the same block used that ERC-20 as feeCurrency. When this happened, the gas price charged would differ, resulting in a different output that couldn’t validate the original block (thus, the “bad block”).

We fixed the Sequencer to calculate rates only once at the start of each block, as specified in CIP-43. This fix was deployed to Alfajores on October 31, 2024. The rules of validation of the fullnodes were correct, so there’s no need of updating its version

Thank you for your patience and support during our investigation.

–cLabs Team

Topic		Replies	Views
Alfajores long block times Alfajores Testnet	2	1004	September 23, 2019
Preparing for the Pectra Hard Fork - Important Updates for Alfajores Node Operators Announcements	0	139	February 19, 2025
L2 Testnet Update and Upcoming Code Audit Protocol	3	309	October 25, 2024
Slow block times: Is your Alfajores validator no longer validating? If so, pls read Alfajores Testnet	0	945	September 23, 2019
Alfajores goes L2 Announcements	0	758	September 20, 2024

Alfajores L2 Node Incident

Related topics