Hey @thezviad!
Thanks for bringing up these concerns, I can add more information on the status of the oracles.
Afaik, reporting oracle code is still not open source.
I agree ideally the code should be open source— we’re currently waiting for a 3rd party security firm to audit the oracle code before it gets open sourced. It’s hard to know timelines exactly, but if I had to guess it will be a couple weeks before the code is released.
Who/how many people have access to Azure account where HSM keys are stored?
Who/how many people have direct access to oracle machines?
Who has access to deploy new code/image to oracle machines?
There’s a separate Azure subscription that only includes a small group of relevant engineers and is meant for production services (most of cLabs is on GCP, not Azure). The resource groups that house the mainnet oracle infrastructure are locked down such that only the relevant on-call engineers (6 people) have read access, and in order to change the infrastructure one of these on call engineers has to submit a request for more permissions (we do this using Privileged Identity Management to give just-in-time time-boxed permissions). Pushing a new Docker image to the container registry also requires a PIM request to get the sufficient permissions. Oracles are deployed in two AKS clusters that are split across 2 regions. In the coming days we plan to stand up a Kubernetes cluster in a third region with full nodes ready to go that we can fall back to in case of downtime in one region. AWS (who also has secp256k1 HSM support) support is also being worked on so we don’t depend too heavily on Azure.
What does internal or external auditing of this system look like?
An internal code review was done by a separate group of engineers who did not work on any of the oracle development. There was also an internal review of the security of the infrastructure prior to the mainnet deployment. Deployments are always made to Baklava as a staging environment prior to Mainnet.
From a monitoring point of view, there are dashboards with metrics exposed by the oracle clients themselves as well as relevant on-chain data. There are also a number of alerts for oracle client & on-chain issues (eg a client has an error, hasn’t reported, the number of on-chain reports is low, the on-chain rates are dramatically different, etc).
It’s also clear that the worst case isn’t necessarily if oracles stop reporting values (this will result in buckets not updating, which protects the reserve, and stops on chain exchanges from occurring if there is an expired report), but if the oracles report bad values that could deplete the reserve or result in a depeg. Given that, there are a number of “safegaurds” in the code mostly related to verifying exchange data is robust enough for use— these are things like only considering an exchange if the bid/ask spread is small enough, or refusing to report a price if less than N exchanges have provided exchange rates that are robust enough, or refusing to report if the price has changed a lot in a short period. Because this means there can be cases in which oracles not sending reports is actually intended, there’s been some talk on sharing some oracle metric charts (or a status page of some kind) to make it publicly clear what the state of the oracles are. Some ideas have been to share a stackdriver or grafana dashboard, but I’d love to hear any thoughts you have on this topic.
Is there an audit trail for all potential actions that might cause changes in Oracle operation?
On the Azure level, changes in permissions & resources are logged. On the K8s level, audit logging is recorded.
Let me know if you have any other questions. Thanks for bringing these points up!