Microsoft’s latest cloud authentication outage: What went wrong
Microsoft has published a preliminary root cause analysis of its March 15 Azure Active Directory outage, which took down Office, Teams, Dynamics 365, Xbox Live and other Microsoft and third-party apps that depend on Azure AD for authentication. The roughly 14-hour outage affected a “subset” of Microsoft customers worldwide, officials said.
Microsoft’s preliminary analysis of the incident, published March 16, indicated that “an error occurred in the rotation of keys used to support Azure AD’s use of OpenID, and other, Identity standard protocols for cryptographic signing operations,” according to the findings published to its Azure Status History page.
Officials said as part of normal security practices, an automated system removes keys that are no longer in use, but over the past few weeks, a key was marked as “retain” for longer than normal to support a complex cross-cloud migration. This resulted in a bug being exposed causing the retained key to be removed. Metadata about the signing keys is published by Microsoft to a global location, its analysis notes. But once the metadata was changed around 3 p.m. ET (the start of the outage, applications using these protocols in Azure AD started picking up the new metadata and stopped trusting tokens/assertions that were signed with the removed key.
Microsoft engineers rolled back the system to its prior state around 5 p.m. ET, but it takes a while for applications to pick up the rolled-back metadata and refresh with the correct metadata. A subset of storage resources required an update to invalidate the incorrect entries and force a refresh.
Microsoft’s post explains that Azure AD is undergoing a multi-phase effort to apply additional protections to the back-end Safe Deployment Process to prevent these kinds of problems. The remove-key component is in the second phase of the process, which isn’t scheduled to be finished until mid-year. Microsoft officials said the Azure AD authentication outage that happened at the end of September is part of the same class of risks that they believe they will circumvent once the multi-phase project is complete.
“We understand how incredibly impactful and unacceptable this is and apologize deeply. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future,” the blog post said.
A full root-cause analysis will be published once the investigation is complete, officials said.