On Friday Jan 28th at 10:11pm PDT, the Baklava testnet stopped producing blocks. A number of validators were alerted, including cLabs. After investigating the reason, the cLabs blockchain team found a bug that allowed validators to reach consensus on an invalid block. A missing validation check meant that validators would accept a block proposed by another validator missing a metadata field used for uptime score calculation.
Because this bug could have been used by a malicious validator to intentionally stop Mainnet, the patch was developed and distributed according to our security process. Now that more than ⅓ of the validators on Mainnet have applied the patch, there is no ongoing security risk to the network, and we can share a full disclosure of the problem.
After finding the cause of the network stall, an emergency patch was developed, tested and distributed to the mainnet validators as v1.5.2 at 3pm PDT on Jan 30th, who then upgraded the nodes. Following our security process, the patch was developed on a private repository and distributed only as a binary release. By early on Feb 2nd, a quorum of active validators had deployed a patched release.
Now that the security concern has been addressed, the source code for the v1.5.2 patch is now available publicly on the celo-blockchain repository. Thank you to all the validators who helped quickly address this issue.
We are currently still working on remediating the Baklava network stall. With the help of all validators on Baklava, we should have the network running again soon.
- The validator acting as proposer for block
9949702, proposed a block missing the
ParentAggregatedSealfield. This field is used for monitoring uptime of validators, and represents the “truth” about which validators actually signed the previous block. It’s a required field, and thus the proposed block was invalid
- Validators instead of rejecting the block, accepted it, and consensus was achieved for block
- Every other node (full nodes, proxies) rejected the block, leading to a split in the view of chain state between validators the rest of the network
- Upon receiving a
BAD BLOCKfrom a peer, a node automatically disconnects from it. This implies that proxies disconnected from their validators. Also, if a validators first receives a block from another validator (instead of achieving ⅔ commits on it), they will also reject the block and disconnect from the other validator.
- Since network connections between validators were broken, baklava network stalled
Block Headers in the Celo network contain 3 seals:
ECDA Seal(signature from the proposer)
AggregatedSeal(BLS aggregated signature from ⅔ of the validators)
ParentAggregatedSeal(BLS aggregated signature with all validators that signed previous block)
Each validator considers a block valid (and broadcasts it) the moment they receive ⅔ commits from other validators. Because of that, all commit messages received after that are not added to the AggregateSeal, additionally, validators might not received the same commits message so each validator’s AggregatedSeal might be different
In order to consistently compute which validators signed a block and thus compute uptime for them, the ParentAggregatedSeal collects the signatures that were received after the previous block was sent.
Missing this field then only affects uptime scores. So, the consensus bug doesn’t offer any way for an attacker to steal funds, or otherwise modify EVM state outside of the uptime score. Instead it affects liveness, as it will stall the network.
We found a bug on block verification that happens during consensus. This bug implied that all header fields are checked with the exception of the ParentAggregatedSeal.
Details on it can be found in the following PR-1830
The invalid block was sent due to a subtle bug during block creation.
In the event of a validator performing a hot swap at the same time as it is their time to propose a block, there’s a race condition that results in the proposer not adding the ParentAggregatedSeal.
Although the cause of the network stall was due to a rare race condition, triggered incidentally to normal operation, it was possible this to be done intentionally by a malicious validator. This is why this issue was addressed as a security concern.