2019-11-22 - ACS001 Core Failure (Resolved)
This article shares insights on an ACS001 Core Failure episode, detailing how the issue was resolved in 2019-11-22.
Table of Contents
Event Description: ACS001 Crash
Event Start Time: 2019-11-22 11:42 PM EST
Event End Time: 2019-11-23 01:29 AM EST
RFO Issue Date: 2019-11-25
Affected Services
Phones registered to Atlanta lost registration. Devices configured with SRV or UDP failed over. Devices configured as TCP or manually registered to core1-atl did not regain registration.
Event Summary
On November 22nd, 2019, at 23:42 EDT, the ACS clusterbegan crashing repeatedly. Several phones lost registration and the ability to make or receive calls.
Event Timeline (All times 24-hour format, EST)
November 22nd, 2019
- 23:42 First crash reporting by monitoring systems. Phones lost registration. Notice placed in partner server
- 23:43 Failover verified to SJE and NYJ clusters
November 23rd, 2019
- 00:00 Issue verified isolated to Atlanta servers
- 00:25 Rolled back 40.2 updates in case that might have been the cause of the issue
- 00:57 Services to Atlanta resumed and functional. Endpoints configured for UDP or SRV registered back to the Atlanta cluster. Cause yet undetermined.
- 01:29 Atlanta cluster remained online. All UDP and SRV phones remained registered. Some reports of call history not functioning. Would continue to investigate in the morning.
- 17:12 Call history page restored. All services functional
Root Cause Analysis
In troubleshooting with NetSapiens and our own senior engineers, we determined that malformed TCP packets were causing a crash in the TCP stack. We believe the packets were isolated to a single device but more testing is needed. Normally this would not affect services. However, the repeated crashes and subsequent core dumps quickly filled the server's storage. The eventual full storage prevented normal functions from processing.
Future Preventative Action
While not a permanent fix, the decision was made to block TCP device registration. This affected less than 1% of our total registered devices. Devices that were configured for TCP must be reconfigured to use UDP. We are continuing to work with NetSapiens senior engineer staff to determine how a single errant device could crash the stack. Finally, we have instituted additional safeguards that will immediately move core dump files off the server immediately to prevent full storage again.
Update 11/26/19
Worked with NS engineering to isolate issues to TCP connections in SIP trunks only. SIP trunk TCP functionality was left disabled. TCP functionality for endpoints was restored and devices re-registered successfully. Systems have been stable since. NS will continue to follow up regarding SIP trunk TCP.