NEAR Sharding Rollout and Aurora IDO Postmortems
December 3, 2021 — NEAR Team
Between November 8th and November 19th, multiple incidents related to Near Inc RPC services caused severe degradation of user experience. The Near Inc team takes this kind of issue seriously and aims to continually improve its services.
The following postmortems on the NEAR sharding rollout and Aurora IDO summarize the incidents. They provide root cause analyses, mitigations, and the future improvements we are working on to prevent similar incidents from happening again.
Incident 1 – November 10th (12:00 UTC – 15:00 UTC) NEAR Inc RPC Service degraded during Boca Chica Aurora token sale
During the Boca Chica Aurora token sale, community members reported an inability to access the service to buy the Aurora lottery ticket. The cause: too much load on the RPC endpoint. Due to higher than expected demand during the sale, the RPC became overloaded in the EU region.
We’ve increased our RPC capacity by rolling out new RPC nodes across all regions. In particular, we’ve bumped up the number of nodes in the EU region. In the meantime, we’ve made and are working on several other improvements to increase the RPC throughput and reliability.
Incident 2 – November 16th (19:00 UTC – 19:30 UTC) Near Inc Services down due to global cloud provider outage
All Near Inc services stopped working to some extent for a short period of time including RPC, Indexers, Explorer, and Wallet. Issues were reported from across the community and internally all alerts started to fire.
We rapidly identified the root cause: an outage affecting the load balancing stack hit our cloud provider, which impacted RPC and therefore made all other dependent services unavailable as well. Unfortunately, there was no quick mitigation we could do on our side. We had to wait for the cloud provider to resolve the issue to get our services operational again. The cloud provider was able to quickly detect and restore service in under 20 minutes.
Currently, we use one cloud provider for the services we host. We do this for simplicity of maintenance. Due to the decentralized nature of NEAR, anyone is able to run their own RPC services, providing alternative ways to access the network. Making it easy for users to switch RPC nodes is something that both wallets and developer tooling will be looking into.
To prevent issues like this from happening to Near Inc services, we are also looking into implementing a multi-cloud deployment with client-side fallbacks in case the primary service goes down.
Incident 3 – November 17th(03:00 UTC – 19:00 UTC)Near Inc services degraded including Skyward during Aurora IDO, Wallet, and Explorer
During the sharding rollout on mainnet we encountered two issues caused by state splitting as part of the protocol upgrade: 1) high disk usage affecting the RPC service, and 2) the inability of archival nodes to split state within one epoch.
Let’s discuss them separately.
The first issue was caused by high disk usage (IOPS) during the process of splitting state. Even though the RPC traffic was very low across the network, performance of our RPC nodes drastically deteriorated and we observed that RPC latency in some regions jumped from 1s to 60s.
This was not a capacity problem: each RPC request takes longer to respond and adding more nodes wouldn’t help much. Most services were almost unavailable, but due to client side retries they could be used with high response times. Most affected users were in Europe and Asia.
The second issue was caused by archival nodes being unable to finish state splitting within one epoch as expected and, therefore, they got stuck when the new epoch arrived. This issue arose unexpectedly, as we ran simulations beforehand for regular RPC for the sharding upgrade using mainnet data. The edge case we missed was running the simulation on archival nodes, which took significantly longer. The issue had not been previously identified on testnet, as archival data is smaller there.
The failure of archival nodes affected all services depending on them: Indexer, Explorer, Wallet, Aurora, etc. The Infrastructure team was able to rapidly redirect traffic to non-archival nodes as a patch until archival nodes were restored. We waited until archival nodes finished splitting their states, and once the first node synced we generated a backup and started the rest of the archival nodes from there. This was a failure of not testing for all cases and we plan to invest more time and effort in making sure releases go as smoothly as previous ones.
Some Final Thoughts
It’s important to note that, despite all the aforementioned incidents, the network was always functioning as expected and only RPC services were affected, causing dependent services to have issues.
We want to be transparent about all past and future issues we face. We believe the community members will understand that “NEAR stands for iteration” and we are doing our best to prevent such incidents from happening in the future.
Join the community: