Google Explains YouTube, Gmail, Cloud Service Outage
Google has blamed a bug in its global authentication system for last week’s outage that affected Gmail, Calendar, YouTube, Meet and multiple other Google services.
The 47-minute outage last Monday, which severely affected operations at workplaces and schools globally, was caused by a bug in an automated quota management system that powers the Google User ID Service.
In a root cause incident report, Google explained that the Google User ID Service maintains a unique identifier for every account and handles authentication credentials for OAuth tokens and cookies. This account data is stored in a distributed database, which uses Paxos protocols to coordinate updates.
For security reasons, this service is programmed to reject requests when it detects outdated data.
Google said one of its automated tools used to manage the quota of various resources allocated for services contained a bug that caused error in authentication results, leading to the service outage.
“As part of an ongoing migration of the User ID Service to a new quota system, a change was made in October to register the User ID Service with the new quota system, but parts of the previous quota system were left in place which incorrectly reported the usage for the User ID Service as 0. An existing grace period on enforcing quota restrictions delayed the impact, which eventually expired, triggering automated quota systems to decrease the quota allowed for the User ID service and triggering this incident,” the company explained.
“Existing safety checks exist to prevent many unintended quota changes, but at the time they did not cover the scenario of zero reported load for a single service,” Google added.
The problem “was immediately clear as the new quotas took effect.” At the height of the incident, Google could not verify that user requests were authenticated and the company confirmed it was seeing 5xx errors on virtually all authenticated traffic.
“The majority of authenticated services experienced similar control plane impact: elevated error rates across all Google Cloud Platform and Google Workspace APIs and Consoles, the company said.