A Celo validator’s uptime score takes a hit if the validator is down for a minute or longer. This feels overly restrictive to me.
An uptime requirement should be reasonably calibrated to the impact downtime has on the network as a whole, i.e. liveness and/or security impact. The current metric isn’t long enough to restart a proxy or validator without incurring a penalty.
Validators are subject to this strict requirement. At the same time, the network only supports a single proxy configuration. This feels quite inconsistent. We are told that the multi-proxy option is “coming soon”. Real world circumstances put a heavy burden on the validators to compensate for the protocol shortcoming. A validator has to think twice about making any change to their infrastructure that would require a restart. The available key rotation solution is an unsatisfactory and expensive workaround.
This is an example of how a super-strict liveness requirement may compromise network security. If validators make the decision to either delay or not make infrastructure improvements to avoid the uptime hit, network security may suffer.
In contrast, I’ve spent the past couple days optimizing my Cosmos multi-tier, multi-sentry (i.e. proxy) node setup. Because of the multi-sentry node setup in Cosmos, I’m able to fine tune that setup without fearing a downtime penalty. (As long as it’s done right, of course…!)
My suggestion would be to relax the 1 minute downtime hit parameter, at least until multi-proxy support is available. Over the long term, the uptime measurement should happen over a longer window. The measurement can then be calibrated to the impact on the actual impact on the network.
As a related point, at least to date, validators are generally a responsible set of actors, who value their reputation. This is motivating enough to help ensure we maintain the highest levels of uptime we’re able to maintain, given the resources available to us.
(First published on the Chainflow blog here.)