2021-11-29 System Restart SJE and ATL (Resolved)
Table of Contents
Event Description: Core services restarted in SJE and ATL nodes
Event Start Time: 2021-11-29 12:12 PM EST
Event End Time: 2021-11-29 13:48 PM EST
RFO Issue Date: 2021-11-30
Affected Services
Media services for greetings, ringing, voicemail recordings, Auto Attendants, and any other recorded message. Calls were unable to complete during the restart of services. Loss of portal access during reconvergence.
Event Timeline (All times 24-hour format, EST)
November 29th, 2021
- 12:12 Core1-SJE: Our systems alerted to a restart of the NmsMedRecMgr service due to a crash in the same service. Call stats were reviewed and test calls were conducted. There did not appear to be any loss of service, nor did we receive any reports from users. The decision was made to investigate and monitor.
- 12:31 Could not find documentation on NmsMedRecMgr service. T4 reached out to Netsapiens engineering for clarification.
- 13:45 Core1-ATL: Our systems alerted to a crash of the same NmsMedRecMgr service. The service immediately restarted as designed. This time we received reports from clients of 486 busy, unable to complete internal and external calls, and other loss of voice service. The restart took less than 60 seconds. Most devices were able to resume calling immediately after. Some devices required rebooting to re-establish registration. While data synced between the cores portals were briefly unavailable.
- 13:51 Netsapiens engineering identified the issue and began working on a patch.
- 17:58 Netsapiens engineering provided a software patch that required service restarts. Maintenance was scheduled for the following morning.
- 11/30/21 05:30 Patches were applied successfully to all cores. Services show nominal. We will continue to monitor for 24 hours before marking the incident as resolved.
Root Cause Analysis
Part of the architectural changes to v42 included several performance enhancements. Among these was the move to make the service responsible for playback of media into a multi-threaded service. It was also moved to a sub-service of NMS which is core to calling, registration, and other major functions.
An issue was identified where a high volume of calls needing this media could crash the service. Because the service was a sub-thread of NMS, the crash would also bring down core calling features. A software patch was provided that lowered the threading count. Testing by Netsapiens engineering showed the patch to be successful in preventing further crashes. We applied those same patches to all cores on 11/30/2021.
Future Preventative Action
We will continue to monitor for 24 hours to ensure the efficacy of the patch. If all remains nominal we will mark the incident as resolved.
Update 11/30/21: No further issues were experienced. Marking issue as resolved.