2022-09-22 - SJE & NYJ Outage (Resolved)
Table of Contents
Event Description: Spikes in CPU usage on firewalls at both NYJ & SJE caused both cores to become unreachable. The spikes occurred randomly over the course of several weeks, causing repeated intermittent outages.
Event Start Time: 2022-09-22 19:01 EST
RFO Issue Date: 2022-11-07
Affected Services:
- Inbound calling
- Outbound calling
- Device Registration
- Call Quality on SJE, NYJ & ATL
Event Summary
Following the catastrophic hardware failure of our SJE core (for information, see the 2022-09-03 Major Incident Report), the firewalls located at SJE and NYJ began experiencing extreme spikes in CPU usage. Both firewalls rebooted several times over the course of 8 weeks as a result of data_plane exhaustion due to a lack of resources.
During these reboots, SJE and NYJ would become unreachable, and device registration to the SJE & NYJ servers would have failed and caused the phones to failover to ATL core servers. There were also intermittent call quality issues due to the extra connections that had failed over from the SJE server.
After communicating with both the data center and firewall vendors and completing multiple maintenance windows to upgrade the firmware on the firewalls, adjust settings, and troubleshoot the cause of the reboots, it was determined that the firewalls at both NYJ and SJE were not appropriately specced and needed to be replaced. Final maintenance was completed to bypass the physical firewalls and use UFW on the servers, which resolved both registration and call quality issues.
Event Timeline
September 22, 2022
- 19:01 EST - Engineers received alerts for extreme latency in the 10000ms+ range for both NYJ and SJE servers and began investigating the cause.
- 19:12 EST - Connectivity to SJE & NYJ was restored as firewalls came back online.
- 19:19 EST - Engineers engaged datacenter support to investigate root cause
- 20:51 EST - Engineers engaged firewall support, where it was determined that the dataplane cpu was stuck 100% on both firewalls. Initial reports showed the cause to be high bursts of traffic, but this was found to be inaccurate
September 23, 2022
- 16:19 EST - Devices homed to NYJ lost registration. Immediately after, high latency started to the same core. Devices failed over to ATL as expected
- 16:34 EST - Engineers engaged data center & firewall vendor support to investigate
- 16:45 EST - Engineers failed over NYJ to secondary firewall. Packet loss and latency immediately returned to nominal, and devices began registering back to NYJ.
- 17:27 EST - Engineers continued monitoring NYJ to ensure no further downtime occurred while continuing to investigate the root cause. After troubleshooting with the datacenter and firewall support, a bug was found that had the possibility of spiking the cpu usage to 100% due to a stuck packet in the dataplane processer while it was being inspected for routing.
October 03, 2022
- 09:54 EST - Engineers were alerted to spikes in traffic to ATL, NYJ, and SJE, possibly originating at NYJ and began investigating
- 10:36 EST - Datacenter support was engaged for further assistance
- 11:16 EST - Latency returned to nominal metrics, all devices on NYJ registered to the proper core. Further investigation confirmed that the firewall at NYJ had spiked to 100% usage again.
October 15, 2022
- 02:00 EST - Planned maintenance was completed on standby firewalls for NYJ & SJE to upgrade them to latest firmware in relation to the bug identified on September 23rd. During maintenance, failover to standby firewalls to upgrade active firewalls on NYJ failed, and SJE firewall became unresponsive
- 02:46 EST - Engineers began seeing both NYJ and SJE firewalls with extreme latency of 6000ms+
- 02:51 EST - Engineers engaged datacenter and firewall support
- 07:30 EST - After consulting datacenter support, it was confirmed that CPU usage had spiked to 100% again. SJE was failed over to the standby firewall indefinitely, and NYJ firewalls were rebooted. After reboot and failover, both cores were stable.
October 20, 2022
- 13:15 EST - Engineers were alerted that SJE's firewall had rebooted again
- 13:18 EST - Engaged datacenter support
- 13:30 EST - Firewall came back online, however there were still significant latency spikes
- 13:31 EST - Datacenter support confirmed the dataplane CPU was exhausted, causing the firewall to reboot.
- 17:57 EST - Engineers engaged firewall support and began reviewing logs to investigate the root cause. During investigation, it was gound that the firewalls on NYJ and SJE were not properly specced for the load and internet connection. Discussions began regarding firewall replacements and alternatives
October 26, 2022
- 10:46 EST - We received alerts of SJE firewall cpu peaking at 100% again for 15 minutes, which cascaded to NYJ. Both servers resolved themselves during this timeframe. However, both NYJ and SJE continued to experience increased latency while attempting to resync with ATL
- 10:53 EST - Engineers engaged datacenter and firewall support for a root cause analysis. Confirmed dataplane CPU was exhausted, causing the firewalls to reboot.
- 11:35 EST - Reports of poor call quality and de-registration continued. Users on ATL also began reporting poor call quality as devices failed over to ATL
- 11:51 EST - Latency on NYJ and SJE began to drop to normal levels as syncing was completed.
- 12:21 EST - Further monitoring on all three cores revealed intermittent spikes in latency beginning at NYJ and SJE, causing poor call quality and de-registration during the spikes.
October 27-28, 2022
- During this time, latency on NYJ & SJE continued to spike, however the root cause was determined to be the faulty firewalls, which we were planning to replace in the coming days
October 29, 2022
- 12:14 EST - Engineers were alerted that SJE's firewall had rebooted again, and significant latency was occurring on NYJ
- 12:18 EST - Engaged datacenter support. Firewall came back online, however there were still significant latency spikes
- 13:58 EST - Latency on SJE began returning to normal.
- 14:29 EST - After discussing internally, engineers decided to take SJE offline and force failover to ATL in attempt to minimize the outages until firewall replacements were completed
October 30-31, 2022
- Users on SJE, ATL, and NYJ experienced significant call quality and registration failures during this time due to the additional load from SJE being taken offline.
November 1, 2022
- 00:00 EST - Physical firewall on SJE was bypassed in favor of UFW rules on SJE server which were already in place and SJE was brought back online. Device registration and call quality was restored immediately, and SJE and ATL showed stable for the remainder of the day.
November 2, 2022
- 00:00 EST - Physical firewall on NYJ was bypassed in favor of UFW rules on NYJ server which were already in place and NYJ was brought back online. Device registration and call quality was restored immediately, and NYJ showed stable for the remainder of the day.
Root Cause
Call quality concerns for devices that had failed over to the ATL voice servers was due to extreme CPU load on the firewall, due to the extra connections from the SJE & NYJ devices.
Future Preventative Action
- disabled unnecessary services on the ATL firewall to reduce CPU load in the event of device failover.
- Reviewing replacement firewall specs
- Reviewing additional datacenters for additional points of failover