2022-09-22 - SJE & NYJ Outage (Resolved)

Updated at April 22nd, 2023

+ More

Table of Contents

Affected Services: Event Summary Event Timeline September 22, 2022 September 23, 2022 October 03, 2022 October 15, 2022 October 20, 2022 October 26, 2022 October 27-28, 2022 October 29, 2022 October 30-31, 2022 November 1, 2022 November 2, 2022 Root Cause Future Preventative Action

Event Description: Spikes in CPU usage on firewalls at both NYJ & SJE caused both cores to become unreachable. The spikes occurred randomly over the course of several weeks, causing repeated intermittent outages.

Event Start Time: 2022-09-22 19:01 EST

RFO Issue Date: 2022-11-07

Affected Services:

Inbound calling
Outbound calling
Device Registration
Call Quality on SJE, NYJ & ATL

Event Summary

Following the catastrophic hardware failure of our SJE core (for information, see the 2022-09-03 Major Incident Report), the firewalls located at SJE and NYJ began experiencing extreme spikes in CPU usage. Both firewalls rebooted several times over the course of 8 weeks as a result of data_plane exhaustion due to a lack of resources.

During these reboots, SJE and NYJ would become unreachable, and device registration to the SJE & NYJ servers would have failed and caused the phones to failover to ATL core servers. There were also intermittent call quality issues due to the extra connections that had failed over from the SJE server.

After communicating with both the data center and firewall vendors and completing multiple maintenance windows to upgrade the firmware on the firewalls, adjust settings, and troubleshoot the cause of the reboots, it was determined that the firewalls at both NYJ and SJE were not appropriately specced and needed to be replaced. Final maintenance was completed to bypass the physical firewalls and use UFW on the servers, which resolved both registration and call quality issues.

Event Timeline

September 22, 2022

19:01 EST - Engineers received alerts for extreme latency in the 10000ms+ range for both NYJ and SJE servers and began investigating the cause.
19:12 EST - Connectivity to SJE & NYJ was restored as firewalls came back online.
19:19 EST - Engineers engaged datacenter support to investigate root cause
20:51 EST - Engineers engaged firewall support, where it was determined that the dataplane cpu was stuck 100% on both firewalls. Initial reports showed the cause to be high bursts of traffic, but this was found to be inaccurate

September 23, 2022

16:19 EST - Devices homed to NYJ lost registration. Immediately after, high latency started to the same core. Devices failed over to ATL as expected
16:34 EST - Engineers engaged data center & firewall vendor support to investigate
16:45 EST - Engineers failed over NYJ to secondary firewall. Packet loss and latency immediately returned to nominal, and devices began registering back to NYJ.
17:27 EST - Engineers continued monitoring NYJ to ensure no further downtime occurred while continuing to investigate the root cause. After troubleshooting with the datacenter and firewall support, a bug was found that had the possibility of spiking the cpu usage to 100% due to a stuck packet in the dataplane processer while it was being inspected for routing.

October 03, 2022

09:54 EST - Engineers were alerted to spikes in traffic to ATL, NYJ, and SJE, possibly originating at NYJ and began investigating
10:36 EST - Datacenter support was engaged for further assistance
11:16 EST - Latency returned to nominal metrics, all devices on NYJ registered to the proper core. Further investigation confirmed that the firewall at NYJ had spiked to 100% usage again.

October 15, 2022

02:00 EST - Planned maintenance was completed on standby firewalls for NYJ & SJE to upgrade them to latest firmware in relation to the bug identified on September 23rd. During maintenance, failover to standby firewalls to upgrade active firewalls on NYJ failed, and SJE firewall became unresponsive
02:46 EST - Engineers began seeing both NYJ and SJE firewalls with extreme latency of 6000ms+
02:51 EST - Engineers engaged datacenter and firewall support
07:30 EST - After consulting datacenter support, it was confirmed that CPU usage had spiked to 100% again. SJE was failed over to the standby firewall indefinitely, and NYJ firewalls were rebooted. After reboot and failover, both cores were stable.

October 20, 2022

13:15 EST - Engineers were alerted that SJE's firewall had rebooted again
13:18 EST - Engaged datacenter support
13:30 EST - Firewall came back online, however there were still significant latency spikes
13:31 EST - Datacenter support confirmed the dataplane CPU was exhausted, causing the firewall to reboot.
17:57 EST - Engineers engaged firewall support and began reviewing logs to investigate the root cause. During investigation, it was gound that the firewalls on NYJ and SJE were not properly specced for the load and internet connection. Discussions began regarding firewall replacements and alternatives

October 26, 2022

10:46 EST - We received alerts of SJE firewall cpu peaking at 100% again for 15 minutes, which cascaded to NYJ. Both servers resolved themselves during this timeframe. However, both NYJ and SJE continued to experience increased latency while attempting to resync with ATL
10:53 EST - Engineers engaged datacenter and firewall support for a root cause analysis. Confirmed dataplane CPU was exhausted, causing the firewalls to reboot.
11:35 EST - Reports of poor call quality and de-registration continued. Users on ATL also began reporting poor call quality as devices failed over to ATL
11:51 EST - Latency on NYJ and SJE began to drop to normal levels as syncing was completed.
12:21 EST - Further monitoring on all three cores revealed intermittent spikes in latency beginning at NYJ and SJE, causing poor call quality and de-registration during the spikes.

October 27-28, 2022

During this time, latency on NYJ & SJE continued to spike, however the root cause was determined to be the faulty firewalls, which we were planning to replace in the coming days

October 29, 2022

12:14 EST - Engineers were alerted that SJE's firewall had rebooted again, and significant latency was occurring on NYJ
12:18 EST - Engaged datacenter support. Firewall came back online, however there were still significant latency spikes
13:58 EST - Latency on SJE began returning to normal.
14:29 EST - After discussing internally, engineers decided to take SJE offline and force failover to ATL in attempt to minimize the outages until firewall replacements were completed

October 30-31, 2022

Users on SJE, ATL, and NYJ experienced significant call quality and registration failures during this time due to the additional load from SJE being taken offline.

November 1, 2022

00:00 EST - Physical firewall on SJE was bypassed in favor of UFW rules on SJE server which were already in place and SJE was brought back online. Device registration and call quality was restored immediately, and SJE and ATL showed stable for the remainder of the day.

November 2, 2022

00:00 EST - Physical firewall on NYJ was bypassed in favor of UFW rules on NYJ server which were already in place and NYJ was brought back online. Device registration and call quality was restored immediately, and NYJ showed stable for the remainder of the day.

Root Cause

During the investigation, it was found that both the standby and active firewalls on SJE and NYJ were not appropriately specced for the network throughput and load that had originally been planned at initial install. The increased load caused dataplane CPU to become exhausted. This required a full bypass of the physical firewalls while UFW on the servers remained active.

Call quality concerns for devices that had failed over to the ATL voice servers was due to extreme CPU load on the firewall, due to the extra connections from the SJE & NYJ devices.

Future Preventative Action

disabled unnecessary services on the ATL firewall to reduce CPU load in the event of device failover.
Reviewing replacement firewall specs
Reviewing additional datacenters for additional points of failover

september 22 sje 2022