2024-02-02 Summary of Outage (Resolved)
Table of Contents
Event Description: Outage caused by DNS
Event Start Time: 2024-02-02 23:22 EST
Event End Time: 2024-02-03 01:05 EST
RFO Issue Date: 2024-02-06
Affected Services:
- Inbound and outbound calling
- Phone and device registration
- Access to Manager Portal
Event Summary
The domain ucaasnetwork.com registration lapsed causing DNS resolvers unable to resolve anything for that domain. This caused services to fail as DNS caching was updated with no or incorrect information. Once the issue was identified, domain registration was renewed and services came back online quickly. Over the next 48 hours, some endpoints were still affected as their DNS providers took longer than expected to refresh their caches.
Event Timeline
2024-02-02 23:22 PM EST - First alert received from Insight regarding failed HTTPS health check to core1-atl
2024-02-02 23:36 PM EST - Steven responded immediately and began checking Apache, NMS, Manager Portal accessibility, and Insight monitoring. Noticed a drop in registrations on all servers
2024-02-02 23:44 PM EST - Contacted additional support
2024-02-03 00:06 AM EST - Nodeping alerts began coming in. First indication that this was related to DNS. Began checking Constellix
2024-02-03 00:21 AM EST - Garrett noticed the ucaasnetwork.com domain was expired
2024-02-03 00:25 AM EST - Steven began attempting to contact Kevin and Ray to declare a major incident.
2024-02-03 00:30 AM EST - Attempts to log into domain registrar are hindered by MFA authentication going to non-public email address
2024-02-03 00:48 AM EST - Jack retrieved renewal email and MFA code direclty from Exchange
2024-02-03 00:54 AM EST - War Room/ Major incident created by Jack
2024-02-03 00:56 AM EST - Steven successfully renewed ucaasnetwork.com
2024-02-03 01:05 AM EST - Announcement posted to Discord, Uptime Robot, and Partner Central by Jack.
Root Cause
Domain registration for ucaasnetwork.com lapsed due to misconfigured alerting. Alerts were being sent to a legacy email address without copying new workflows.
Future Preventative Action
An internal project was created and will be worked to isolate the monitoring lapse and create long-term scalable solutions. Additional criteria and alerting options will also be explored as part of the project