A Short Note on Celo's Current Uptime Requirement

A Celo validator’s uptime score takes a hit if the validator is down for a minute or longer. This feels overly restrictive to me.

An uptime requirement should be reasonably calibrated to the impact downtime has on the network as a whole, i.e. liveness and/or security impact. The current metric isn’t long enough to restart a proxy or validator without incurring a penalty.

Validators are subject to this strict requirement. At the same time, the network only supports a single proxy configuration. This feels quite inconsistent. We are told that the multi-proxy option is “coming soon”. Real world circumstances put a heavy burden on the validators to compensate for the protocol shortcoming. A validator has to think twice about making any change to their infrastructure that would require a restart. The available key rotation solution is an unsatisfactory and expensive workaround.

This is an example of how a super-strict liveness requirement may compromise network security. If validators make the decision to either delay or not make infrastructure improvements to avoid the uptime hit, network security may suffer.

In contrast, I’ve spent the past couple days optimizing my Cosmos multi-tier, multi-sentry (i.e. proxy) node setup. Because of the multi-sentry node setup in Cosmos, I’m able to fine tune that setup without fearing a downtime penalty. (As long as it’s done right, of course…!)

My suggestion would be to relax the 1 minute downtime hit parameter, at least until multi-proxy support is available. Over the long term, the uptime measurement should happen over a longer window. The measurement can then be calibrated to the impact on the actual impact on the network.

As a related point, at least to date, validators are generally a responsible set of actors, who value their reputation. This is motivating enough to help ensure we maintain the highest levels of uptime we’re able to maintain, given the resources available to us.

(First published on the Chainflow blog here.)

6 Likes

Second this sentiment and think it’s a worthy discussion to relax the downtime hit parameter both based on the state on the current limitations on Celo that Chris mentioned and generally based on the incentives that the current configuration sets.

For the first part, I think the reasons Chris brought up make a lot of sense. In our case, our validator node did recently suffer from a downtime impacting its score that would have been mitigated if we were able to use a multi-proxy setup.

For the second part, I feel that such a strict requirement might lead to unwanted centralization and potentially less secure setups. In general, if a relatively short downtime results in loss of rewards, setups that heavily favor liveness (e.g. not using a sentry/proxy architecture) could become preferred, which might incentivize validators to use less secure infrastructure. In addition, if we think about the connectivity between validating nodes, it might well be the case that a node in say, Sub-Saharan Africa, has higher latency/misses more blocks than a well-connected validator run from NYC. So a too high uptime requirement might in an extreme case lead to increasing centralization among well-connected/geographically centralized/large validators. Would like to hear how people think about this.

4 Likes

Thanks Chris for writing this up, you make a lot of really great points.

At first glance, I think I agree that currently the strict liveness requirement may actually be counter-productive. As far as I can tell, this stems from the fact that zero-downtime, zero risk of double signing, validator maintenance is not convenient to do, as the key rotation workaround as you mentioned is a PITA.

That said, I wonder if it’s worth considering all the ways this problem can be addressed before deciding on which solution to take instead of jumping directly to increasing the number of blocks a validator can be down without being penalized.

Here are some options but would love to hear more ideas if folks have them:

  1. Increase the number of blocks that validators can be down for without incurring a penalty. Unfortunately, this parameter is unfortunately not currently governable, and if we collectively decide that’s worth changing, it will take some time to get that change implemented and deployed. Making this (and other non-governable parameters) governable was actually a cLabs internship project I proposed, but I don’t think it ever got picked up.

  2. Support multiple proxies. This feature is nearly done, and should make upgrading proxies with zero downtime relatively straightforward.

  3. RPCs to start/stop signing at a certain block. This would allow validators to upgrade by bringing up a second validator with the same key material, and protect against double signing by forcing the old validator to stop signing at block X and the new validator to start signing at block X+1. This is just one of a handful of options for solutions that are not hard forks and relatively simple to implement.

  4. Support threshold signatures for validators. In addition to availability and security improvements, this would make zero downtime validator upgrades possible, as each “shard” can be upgraded one at a time without losing the ability to sign. This would take quite some time to build. More details here: https://docs.google.com/document/d/1_Hxg_xB-VF4v_-P8Woi_uqBQr7P40WpxJXj8ZMXovf0/edit?usp=sharing

2 Likes

I generally agree that the strict liveness penalty is likely hurting the network (by reducing security) more than it’s helping the network (by incentivizing liveness) today. Nearly all past double signing on other staking activities has in the past come from operational errors by validators that were over-optimizing for liveness (e.g. hot spares)

But the network is also only 85 days old and I’d rather keep pushing towards a great long-term solution than jump straight to reducing the liveness criteria. Particularly because the a small (1-2m) outage is near zero.

I think we all agree #2 (multiproxy) from Asa will go a long way to addressing this.

Improving container restart speed would also be a nice optimization.

I also agree that Asa’s threshold signatures could be fantastic to optimize both security and availability.

2 Likes