Post Mortem - Celo Stall, July 13 2022

On July 13, Celo Mainnet suffered a stall that lastest 24hs. I’m writing this post to share the post mortem document about the incident.

You can find the details of it here.

Aside from that, we’ll be holding a community wide post mortem meeting tomorrow. This will happen July 27th, 8AM PST / 3PM UTC. The meeting will be recorded.

Meeting Link: meet.google.com/kim-ezgt-nwe

The Hotfix on July20th

Additionally, I wanted to share that on July 20th, a governance hotfix was approved to decrease the block gas limit to 20 million gas from 50 million gas. (for those not familiar with the hotfix mechanism, check our docs)

After the network stall event, we started analyzing other possibilities for the network to generate consensus messages that exceed the 10MB size limit imposed by the p2p network protocol layer to proactively catch any future issues. Unfortunately, with the block gas limit of 50 million gas, there are still ways that a consensus message > 10MB can be generated.

To fix this, we’re working on refactoring consensus messages to avoid this scenario. While this change is in progress, it was important to take immediate action to prevent another outage by reducing the block gas limit by submitting a hotfix proposal to revert the gas limit to i’s value before the CGP-53 (20 million gas).

Using celocli, you can check the hotfix


governance:show --hotfix 0x781f90afc086489bda4e9ad15b881ca7f505bb4bdc6f46a4e0dfb016ad702467

We didn’t want to publish technical information before the approval of the Hotfix. After it, there was a technical debrief last Thursday, and now we are able to share the Post Mortem and have a proper community call with everyone.

Thank you for your patience!

2 Likes

Oops, i was the advocate for the 50M gas block thinking it wouldn’t cause concensus side-effects… Sorry.

In hindsight, I think we could have really stress tested things more with “realistic” mainnet workloads on Baklava before green-lighting the 50M gas block upgrade.

I guess before we do another upgrade, and for any bug fixes to concensus, we should try to reproduce workloads on baklava and vet it out on realistic TX patterns. I will help and join the baklava stress testing.

2 Likes

May I suggest additional action?
5. Incentivize realistic workloads on Baklava so that decentralized stress testing is carried out at scale.

Quite related to one of the root causes:

Stress testing for CGP-56 didn’t include scenarios with prepareCertificate or big blocks (in bytes); thus it was unable to catch this scenario

There’s only so much stress testing that a single human can think of and carry out, but everyone in the network can be a participant to “chaos” testing of the baklava testnet before we upgrade mainnet.

3 Likes

@mariano would it be able possible to share a link to the meeting recording above? Thanks in advance!

Yes, i like the idea. We first need to make baklava a healthy network. Too many down validators. But, if we do that, i’d like the idea of having this kind of testing