Hi Celorians,
My first post will be simple and short…
Is CLabs/Celo Foundation going to release a detailed post-mortem of the events that occurred 14 of July that halted the Mainnet for ~24 hours?
Thanks in advance for your answers.
Hi Celorians,
My first post will be simple and short…
Is CLabs/Celo Foundation going to release a detailed post-mortem of the events that occurred 14 of July that halted the Mainnet for ~24 hours?
Thanks in advance for your answers.
This is the PR for the actual fix:
The main comment to pay attention to is that the consensus message’s encoding format changed:
// ## RoundChangeCertificate ##############################################################
// To considerably reduce the bandwidth used by the RoundChangeCertificate type (which often
// contains repeated Proposal from different RoundChange messages), we break it apart during
// RLP encoding and then build it back during decoding. Proposals are sent just once, and
// Messages referencing them will use their Hash instead.
Here’s what I observed around the time of the halt:
There’s probably more details that the team can fill in on, but looking at the actual code fix that solved the problem, the solution seem to reduce the size of the messages that validator need to broadcast to each other as part of block validation.
Once baklava is back up and running, I’ll probably start running some stress tests on the testnet as well to see if there are more edge cases like this that would stress out the validators. (with early notice to the validator operators so they dont get caught by surprise)
@0xGoldo I’ll be posting the post mortem document today.
Tomorrow we are holding a Post Mortem Meeting for the community to participate. Details:
We’ve scheduled a post-mortem for last week’s mainnet stall on Wednesday, July 27th at 8am PST / 3pm UTC. We’ll attempt to record the meeting for those of you who can’t make it.
Here’s the meeting link: meet.google.com/kim-ezgt-nwe
Check it here Post Mortem - Celo Stall, July 13 2022
Thanks!