2024-07-15 Atlanta Core Server Outage (Resolved)
Table of Contents
Event Description: The Atlanta core server that provides voice services and device registration became unreachable due to hardware failures related to the system memory.
Event Start Time: 2024-07-15 17:45 EST
Event End Time: 2024-07-17 18:02 EST
RFO Issue Date: 2024-07-22
Affected Services:
- Inbound and outbound calling
- Phone and device registration
- Briefly ability to generate configuration files.
Event Summary
The Atlanta core server that provides voice services and device registration became unreachable due to hardware failures related to the system memory.
Server memory had to be removed to restore system stability.
During this time calls and device registrations failed over to alternate servers
successfully. After confirming system stability, the server was put back into service and an announcement was made.
Event Timeline
July 15, 2024
- 5:45p EST CSE team received alerts from our monitoring systems for
Health Check failures for Core1-atl - 5:46p EST CSE team logged in to the monitoring systems to
verify the alerts. Subsequently, the alerts were verified with initial basic checks. - 5:47p EST We logged into the server's management system and found the server had crashed and was stuck on the server boot-up screen. Reboots of the system were attempted and a Major Incident was declared.
- 6:05p EST Advised in the Discord support channel that we were aware of the loss of connectivity to the server and an announcement would be made momentarily.
- 6:06p EST The Core1-atl server had finished rebooting, although we observed that Core1-atl had crashed after being up for a few minutes. We investigated further into the logs and found memory module failures for DIMM B4, B8, and B12
- 6:14p EST It was determined that memory failure was the cause of the server crash and started the process of engaging the DC smart hands for assistance.
- 6:21p EST The first official announcement was sent out.
- 6:21p EST While waiting for a call from the Data Center smart hands to begin replacing memory modules, the CSE team verified that there were no previous alerts/alarms in the server's logs before the system crashed.
- 6:45p EST An announcement was sent out with an update on the status of the failover.
- 7:05p EST The CSE team received a call back from the Data Center onsite technicians
and began the process of replacing the alarmed memory modules. At this time,
we identified that there were no spare memory modules and determined we had to remove some memory modules from one of the other servers. - 7:15p EST We powered off the QoS server in Atlanta and removed multiple memory modules to be used on the Core server.
- 7:23p EST An announcement was sent out informing of the server shutdown to facilitate the hardware replacement.
- 7:29p EST Core1-atl had memory modules B4, B8, and B12 replaced and began the process to power back on the server.
- 7:41p EST Core1-atl had crashed again for DIMM's B8 and B12 in alarm state. We powered off Core1-atl again to replace B8 and B12 with another set of memory modules.
- 7:58p EST We powered on Core1-atl again and were able to successfully put the server into 503 mode to ensure phones do not attempt to fail back to it.
- 8:03p EST Core1-atl crashed again. We powered down Core1-atl again and instructed the DC smart hands to completely remove DIMM's B12 and A12
- 8:20p EST We powered on Core1-atl again and began monitoring again. It was at this time, we were informed by the smart hands that DIMM B12 and A12 were not removed
but rather they had swapped them again. - 8:24p EST An announcement was sent out indicating we are still troubleshooting the affected hardware with Data Center technicians.
- 8:41p EST Core1-atl was observed to be stable.
- 9:00p EST Core1-atl had crashed again, where we observed the threshold for when it was crashing at 73 GB of memory utilization
- We powered down Core1-atl again and ensured DIMM B12 was removed.
- 9:30p EST We observed prolonged stability in the Core1-atl server but determined we needed to remove DIMM A12 as well. We powered down Core1-atl to remove DIMM A12
- 9:32p EST An announcement was sent out indicating that while the server was showing prolonged stability, the Core1-atl server will remain in 503 mode until we can perform further testing on the server. This was the last announcement for the day.
July 16, 2024
- 9:01a EST We checked on the stability of the server and found that it was
remaining stable and had consumed 109 GB of memory, which was well above the observed 73 GB threshold previously that was triggering the server crash. - 9:46a EST We began making a change to NDP WLP preferred server configurations to allow for better failover in the portal. This was performed due to an observation from when the server was unreachable the day prior.
- 10:05a EST An announcement was put out with an update on the status of the hardware stability and that additional testing will be performed throughout the day and the server will remain out of service until the tests are completed.
- 10:15a EST The NDP changes were completed.
- 10:21a EST We powered off Core1-atl to begin performing server hardware tests.
- 11:39a EST Basic hardware tests completed with no errors. Began running extended tests
- 1:08p EST Extended tests passed with no errors. We rebooted the server to allow it to boot into the operating system normally
- 2:46p EST An announcement was put out providing an update that all hardware tests were completed and successful. During this time the CSE team was actively monitoring the server.
- 3:30p EST We observed prolonged stability on the Core1-atl server and
removed the 503 allowing calls and device registrations to return to the core1-atl server. - 3:33p EST An announcement was put out indicating that the 503 was removed from the Atlanta server and we are observing device registrations and calls returning to the server successfully.
- 4:19p EST An announcement was put out indicating that the server is remaining stable and we continue to actively monitor.
- 8:26p EST The server remained stable and confirmed alerting systems were actively monitoring the server.
July 17, 2024
- 8:30a EST We observed continued stability with no alarms going off overnight.
- 6:00p EST We confirmed system stability and put out an announcement indicating the systems are working as intended and remain stable.
Root Cause
Upon reviewing system logs and hardware testing, it was determined that the B12 DIMM slot on the server's mainboard had failed, causing memory corruption and system instability.
Removal of the B12 and A12 memory modules allowed the system to stabilize and be returned to service.
Future Preventative Action
Additional items identified to facilitate a more seamless failover include.
- Fanvil - Adding system default to inform the phones to utilize SRV records instead of an A record for registration
- Grandstream - Updating the DNS failover setting to send calls to the registered server instead of the device's primary server that is in a failed/maintenance state.
- Snom - Adding system default to explicitly set SRV lookup for device registration.
- Yealink - Investigate possible optimizations to ensure failover both on server unreachability and on 503 response.