Validator consensus update plan

To all validator-operators:

The previous mainnet stall uncovered a defect in the celo-blockchain implementation of the Istanbul PBFT consensus, which was temporarily solved with the 1.5.8 patch and a hotfix on the gas limit per block. In order to increase the gas limit again, a change in the implementation of the Istanbul message types is needed.

The technical TLDR; of the issue is the following: in the proposal message, a RoundChangeCertificate is included if the validator is proposing for round > 0. This certificate holds at least quorum amount of RoundChange messages from validators. Said RoundChange messages also may hold a PreparedCertificate if they have one, which contains the previous proposal that may have passed prepare consensus. This leads to, in a worst case scenario, the RoundChangeCertificate holding at least [(validators * â…“) + 1] different blocks in the proposal message.

The change we plan to implement is adding to the RoundChange a signed slim certificate, referencing a block hash and not a full block via the PreparedCertificate, which can then be grouped in the new RoundChangeCertificate (since only the highest round PreparedCertificate block is used).

Such change is backwards incompatible, and a careful approach must be taken to ensure the liveness guarantees of the blockchain.

After some thought, we at cLabs have currently two different upgrade paths that lead to the fix being applied while maintaining the blockchain uptime:

1. Two phase rollout

A phase1 client where both message types are used, both for sending and receiving, increasing the bandwidth used.

A phase2 client which only uses the new message types, reverting the increased bandwidth usage to normal, and the fix being completely applied at this point.

It is important to note that phase2 can only be rolled out when at least quorum amount of validators are running phase1, but the more the safer. There is no particular rush to switch to phase2, other than the community need to have the issue fixed.

Also, it is not necessary to have different version patches, since phase2 could easily be activated by a configuration flag or even an admin rpc call.

2. Consensus fork

Deciding a block number where the new message types will be used for Istanbul, marking a clear point in time where the change will be activated. This will mark a clear deadline, prompting operators to upgrade upon the risk of stalling the network if quorum is not reached.

It’s important to note that this is not a chain Hard Fork, but only a consensus fork, and full nodes won’t be even aware that this is happening.


The reason why a mixed approach is not viable is that if we mix old and new RoundChange messages, the still unupdated machines won’t be able to see or understand the new RoundChange messages present in the RoundChangeCertificate.

We would like to hear your input on what you think is the best approach for applying the update.

1 Like

Option 1 sounds lower risk in terms of liveness because there is no risk of stalling the network. Whereas, option 2 risks a stall if a quorum is not reached before the deadline. Is that correct?

That is correct.

And also option 2 gives a clear timeline on when the fix will be completely implemented, while option1 will depend heavily on when the phase1 update is fully delivered, and then waiting for the phase2 to be fully delivered.

So I would probably risk saying that option1 will take longer than option2. So it will take longer to raise the gas limit.

1 Like

What are the risks of taking longer to raise the gas limit?

Technically speaking, limited to L1 blockchain risks, none probably.

While option 1 sounds less risky, we have done option 2 on previous updates succesfully. If enough time is given it would not be a problem imo. (Also ask every validator to reply to the signal email when upgraded)

Is it possible to techincally check if quorum is reached on Phase1?
Are there any other possible side effects/risks in phase1 besides the increased bandwidth or could there be a negative side effect if Phase2 (admin rpc call) isn’t executed by all validators?

I think the gas limit being high make this more likely to occur, but I have a hunch that with the right kind of “grinder” contract, even with current block gas limit, I can make enough validator machines stall to the point where they start missing rounds, and can hit the same stall.

Think of the gas limit as this:

  • theres a budget for how many EVM bytecodes u can run for a given budget in the block
  • high gas limit mean i can run a LOT of operations that make validator slow down and miss round
  • if gas limit lower, I need to be more specific about which bytecode I pick and choose pathologically mis-priced EVM bytecodes like SSTORE and SSLOAD to randomized addresses to really grind the block storage on the validator

So, i don’t necessarily think we’re fully safe right now on mainnet TBH.

@diwu1989 that could be an attack to performance, not to liveness. if it takes longer to execute a block, then it might get to round 3 or 4 before it gets accepted, which is bad, but not a stall scenario. And it gets fixed by changing gas costs on for example cold storage reads (which has its specific gas cost)

And a reminder that if you actually have instructions for a vulnerability that can be exploited to stall the network, maybe posting it in the public forum is not the best way to go :slight_smile: Of course discussing issues or possible problems with future configuration is very welcomed! But a more private approach is usually used for these kind of things, at least until fix or confirmation that it’s a non-issue.

Yeah ur right, when baklava is back healthy and ready for stress testing, I’ll stress test it with various workloads for us privately. Thanks.

2 Likes

Given that we’re in/coming up on vacation season for many parts of the world, Tessellated Geometry would rather rip the bandaid off and just do it in #2. I believe C-Labs or the Foundation circulated a validator contact form to entities, so it should be easy to coordinate quorum in a non-emergency situation.

I realize we had trouble doing this on Baklava, but I’d be inclined to think that it has more to do with low stakes for whether or not that network is operating (as opposed to mainnet where there are real financial penalties if the network comes up and your node isn’t live).

1 Like

I think I like more the Consensus fork (2), in the past hard-fork were implemented with no node staying behind in the desired block. It would probably make the implementation easier and preventing bugs (and testing) from supporting both messages at the same time.

What about making this a new p2p.Protocol? (istabul 68) It would be almost exactly the same as the current (istanbul 67) protocol except that it would encode the consensus messages in a more efficient manner. Then nodes would run both protocols and through the inbuilt protocol negotiation, each peer to peer connection would run the latest protocol supported by both sides.

This is somewhat like the “Two phase rollout” suggestion except that it avoids increasing the bandwidth and it doesn’t require a second phase in order to enact the fix, instead the network will move towards a complete fix in a transparent manner as validators update.

This would work for the Round change messages, but not for the Round Change certificate.

Imagine the scenario where exactly 50% of the validators have upgraded.

  1. The proposer of the current round does not propose (may be down, delayed, or whatever).
  2. Everyone sends a round change message.
  3. The new proposer is in the 68 (new) protocol. It will receive 50% of round change messages in the new format, and 50% in the old format.

How does it create the Round Change Certificate for the old protocol? If it doesn’t have the old messages, it can’t, since the (old) RCC needs at least quorum Round Change (old) messages. This stalls the network since a new RCC can never be created.

This is the reason for having to send both, to be able to craft both old and new RCC messages with quorum. (Granted, a more careful approach could only send the double message to the new ones, but it’s mostly the same).

Given the replies we will go with the consensus fork approach. I’ll notify in discord when we start testing it, and then we’ll discuss dates.

I think option 1 seems more safe and decentralized way.