On Thursday October 24, cLabs detected an issue with Alfajores Testnet fullnodes that it operates, which started crashing and were unable to sync with the network.
The specific issue was encountered in op-geth instances with state obtained by migrating the Celo L1 data directory during the L1 to L2 upgrade (ie nodes using --scheme.state=hash and --syncmode=consensus-layer). These nodes struggled to sync and started crash looping. Nodes using state obtained via snap sync (ie --syncmode=snap) were not affected. Historical bad blocks have been identified as the cause of the incident, which led to the Alfajores testnet to experience brief stalls.
After some investigation, it appears that full nodes are encountering an error due to the combination of two unrelated issues:
- During the L1 to L2 upgrade, the data migration script that upgraded the Celo L1 data directory to the Celo L2 data directory failed to migrate “bad blocks” (blocks that don’t follow block validation logic). Instead of purging them as expected, they were not migrated. This is an issue, since if at any point op-geth tries to read this list, it won’t be able to decode blocks properly and thus fail.
- Recently, Alfajores full nodes received a bad block (the cause of this is still being investigated), which caused them to access the bad block list which due to the first issue was corrupted, and thus made them crash.
Affected nodes were able to be recovered by purging their bad block database, by snap syncing from one of the unaffected nodes, or by recovering from a prior backup. As a result, Alfajores is operational. Partners running Alfajores nodes may be experiencing the same issue, potentially leading to their nodes crash-looping. For instructions on how to address the issue, please visit the live incident notes.
The cLabs team is actively investigating this issue. For up-to-date information, please visit the live incident notes, which will be updated with findings, instructions and fixes on an ongoing basis.
cLabs Team