On November 19th we notified our clients that Microsoft’s Azure Active Directory Multi-Factor Authentication (MFA) services were down for a lot of users. The reason for notifying particular clients is that Office 365 and Dynamics users authenticate via this service, they also were affected.
Microsoft’s Azure team has gone public with the root cause they discovered when doing internal investigations regarding the November 19th worldwide multi-factor-authentication outage that affected a large number of users. The team has come out and stated that they unearthed three independent root causes.
“yes, as of 14:25 UTC today, MFA was having problems. But it’s ok – it’s only a “subset” of customers.” stated Microsoft engineer.
The tech giants went on to warn that those who had MFA required by policy might experience intermittent issues signing in to Azure resources, resources that include Azure Active Directory (AD).
There is no question that MFA is a good thing, as it forces users to adopt two or more ways of authentication beyond just a password. A phone, dongle or biometrics can come into play as well. Assuming the MFA service is actually running, of course.
The issue, which is worldwide, comes hot on the heels of the publication of a root cause analysis into the incident last week, which saw a trio of failures that led to users being unable to access their beloved Office 365 services.
Engineers are actively investigating an ongoing issue affecting Azure Active Directory, when Multi-Factor Authentication is required by policy. Please refer to https://t.co/Dw19fIoS5H for updates.
— Azure Support (@AzureSupport) November 19, 2018
There were three independent root causes discovered. In addition, gaps in telemetry and monitoring for the MFA services delayed the identification and understanding of these root causes which caused an extended mitigation time. The first two root causes were identified as issues on the MFA frontend server, both introduced in a roll-out of a code update that began in some datacenters (DCs) on Tuesday, 13 November 2018 and completed in all DCs by Friday, 16 November 2018. The issues were later determined to be activated once a certain traffic threshold was exceeded which occurred for the first time early Monday (UTC) in the Azure West Europe (EU) DCs. Morning peak traffic characteristics in the West EU DCs were the first to cross the threshold that triggered the bug. The third root cause was not introduced in this rollout and was found as part of the investigation into this event.
1. The first root cause manifested as latency issue in the MFA frontend’s communication to its cache services. This issue began under high load once a certain traffic threshold was reached. Once the MFA services experienced this first issue, they became more likely to trigger second root cause.
2. The second root cause is a race condition in processing responses from the MFA backend server that led to recycles of the MFA frontend server processes which can trigger additional latency and the third root cause (below) on the MFA backend.
3. The third identified root cause, was previously undetected issue in the backend MFA server that was triggered by the second root cause. This issue causes accumulation of processes on the MFA backend leading to resource exhaustion on the backend at which point it was unable to process any further requests from the MFA frontend while otherwise appearing healthy in our monitoring.
Microsoft also mentioned that they are going to take the following steps to avoid such issues in the future.
- Review our update deployment procedures to better identify similar issues during our development and testing cycles (completion by Dec 2018)
- Review the monitoring services to identify ways to reduce detection time and quickly restore service (completion by Dec 2018)
- Review our containment process to avoid propagating an issue to other datacenters (completion by Jan 2019)
- Review our containment process to avoid propagating an issue to other datacentres (completion by Jan 2019)
- Update communications process to the Service Health Dashboard and monitoring tools to detect publishing issues immediately during incidents (completion by Dec 2018)
Situtation update from Microsoft
“CURRENT MITIGATION: Engineers are currently in the process of cycling backend services responsible for processing MFA requests. This mitigation stap is being rolled out region by region with a number of regions already completed. Engineers are reassessing iompact after each region completes. Engineers have also determined a Domain Name System (DNS) issue caused sign-in requests to fail, but this issue is mitigated and engineers are restarting the authentication infrastructure.