2022-09-03 - SJE Outage / Inbound/Outbound Calls/Device Registration (Resolved)
Table of Contents
Event Description: A hardware failure caused voice services on the SJE voice server to fail, causing failover of services to the ATL voice servers.
Event Start Time: 2022-09-03 02:47 EST
RFO Issue Date: 2022-09-08
Affected Services:
- Inbound calling
- Outbound calling
- Device Registration
- API based services
Event Summary
Our SJE core server suffered a catastrophic hardware failure. This caused inbound and outbound calls to failover to the ATL core servers.
Device registration to the SJE server would have failed and caused the phones to failover to ATL core servers.
API services were temporarily affected, while we redirected the api traffic to an alternate server.
During this time, there was intermittent call quality issues due to the extra connections that had failed over from the SJE server.
Event Timeline
September 3, 2022
04:47 EST - OIT Engineers received alerts for core1-sje services being offline
04:54 EST - Engineers confirmed services were offline and began investigating root cause.
05:09 EST - Hardware failure was confirmed as the root cause of the services failing.
05:32 EST - Engaged server data center support after attempts to bypass the hardware failure failed.
05:44 EST - confirmed that phone registration and inbound calls were failing over correctly.
06:28 EST - updated discord announcement channel.
06:30 EST - DC Support started investigating the hardware failure and confirmed the OIT Engineers findings.
08:56 EST - updated status page.
13:00 EST - DC support began replacing the failed hardware
14:23 EST - DC support confirmed hardware replacement complete.
14:35 EST - received alerts of configuration files not generating correctly for phones previously set to register to the SJE server
14:40 EST - Informed DC Support to begin reinstall of Server operating system.
15:33 EST - Updated the DNS record for API calls to redirect to the alternate server ATL.
15:40 EST - Confirmed configuration files are generating successfully after the DNS update.
15:58 EST - Added DNS failover for the DNS record responsible for API calls between the servers.
21:34 EST - DC support confirmed the request to reinstall the server operating system.
September 5, 2022
08:40 EST - updated partner central with the outage information.
16:08 EST - DC Support indicated that the requested operating system had been installed.
18:50 EST - After internal discussion, an alternate operating system was requested to be installed.
September 6, 2022
08:48 EST - DC support advised the server operating system and all other base settings were setup and configured
08:50 EST - updated API DNS records to force failover to alternate server ATL, as SJE was starting to come back online.
08:53 EST - OIT engineers started preparing the server for voice services application installation.
10:19 EST - Reports of poor call quality were reported to OIT staff.
10:55 EST - OIT prep of the servers completed and engaged the voice services vendor to complete installation of the voice applications.
11:21 EST - Confirmed call quality degradation was due to the additional calls and device connections failing over from SJE
12:09 EST - Reports of devices not registering were reported to OIT staff.
12:15 EST - Device registration was determined to be due to a delayed application of a setting change on the ATL firewall to help alleviate the call quality concerns.
13:14 EST - Additional reports of devices failing to register.
13:17 EST - SBC-WEST DNS entry was still set to SJE and during the voice service installation, was attempting to return devices back to SJE before it was fully functional. Updated the sbc-west dns record to have ATL as the primary server.
13:51 EST - received reports of the Manager portal showing delayed updates to calls and BLF statuses not reporting correctly.
13:56 EST - confirmed manager portal updates and BLF status update delays were due to the excess congestion experienced on the ATL server.
18:59 EST - All voice services reinstalled and confirmed functioning correctly and data replication between the servers was operating as expected.
September 7, 2022
00:00 EST - Data migration from the ATL servers to the SJE server started.
04:12 EST - updated sbc-west.ucaasnetwork.com DNS records to point back to the SJE server.
04:24 EST - Data migration completed and functionality of voice services and device registration confirmed.
08:52 EST - updated NDP to accept first time provisioning for phones pointing to the SJE server.
Root Cause
This required a full replacement of all raid storage drives and re-installation of the Operating System and Voice service applications.
Call quality concerns for devices that had failed over to the ATL voice servers was due to extreme CPU load on the firewall, due to the extra connections from the SJE devices.
Device Registration failures were due to the SBC-West DNS records auto failing back to SJE once the operating system was re-installed. Manually updating the DNS record to point to the ATL server allowed for devices to successfully register again.
Future Preventative Action
- Added additional logging specific to the hardware that failed.
- added additional fail over points to allow for quicker and more smooth failover of services in the event of hardware failure.
- disabled unnecessary services on the ATL firewall to reduce CPU load in the event of device failover.