Microsoft Azure had a major downtime event on April 1, 2021, which affected their East US and Central regions. Companies running their production systems on one of these services had no service availability for nearly two hours.
This extended outage directly affected a Parametrix customer with mission-critical systems on Azure’s East US regions. This technology company, which suffered financial, reputational, and productivity damages, received the monetary payout triggered by the outage in just six (6) days!
Why did the Azure downtime happen?
On April 1st starting around 9 PM GMT and ending nearly two hours later, the Parametrix Monitoring System (PMS) indicated that Microsoft Azure was experiencing downtime due to DNS server errors that affected the Azure VM (Virtual Machines) and the Azure SQL services.
Azure’s DNS servers provide domain name resolution using the Microsoft Azure infrastructure. This service is provided natively as part of the cloud as a network Protocol with multiple Azure services relying on these DNS servers.
Both the VM computational service and SQL database service were impacted by these errors. Azure customers running their production systems on one of these services in either the East US or Central US regions had no service available for the duration of the downtime incident.
What did the Parametrix monitoring system detect?
The Parametrix Monitoring System identified the outage as soon as it began and recorded the Azure VM and Azure SQL services that were affected in the East US and Central US regions. The deep granularity of the system allows it to capture exactly which services were down during the entire event, and to specify exactly what didn’t work on each service.
The system detected Service Unavailable Status on Azure VM instances in the East US region which had 100% error rates, and in the Central US region which had 50% error rates. The Azure SQL service outage only impacted management operations and not instance operations or running operations.
The graph below shows the uptime rate of the Azure VM service instances in all regions (connectivity from 0-1 which equates to 0-100%). The blue line represents the East US region which had 0% availability during the downtime event, while the purple line represents the Central US region which had a little over 50% availability during the same period.
As you can see from the graph below, the monitoring system also detected management error rates during the incident and identified a peak error rate of over 80% in management operations (create, delete, etc.) of the Azure SQL service in the East US region (purple). Management of other services in this region including VM Management (blue) and Storage Management (yellow) was not affected during the outage period.
What did the Parametrix customer experience?
The customer validated the serious effects of the downtime incident on their operations and finances. Their mission-critical systems, which reside on Azure’s East US regions, went down for nearly two hours. They started receiving customer support tickets almost immediately, requiring them to mobilize many of their support and technical team members in order to address these issues.
The downtime incident was damaging not only to their employee productivity but also to their reputation with customers whose operations were directly impacted. Just six (6!) days later, Parametrix covered their financial damage through the monetary payout which was triggered by the outage.
If you want to make sure you are covered for future cloud outages, please contact us or request a quote today!