Event Description: Inbound and Outbound faxes intermittently failed to be transmitted or received.
Event Start Time: 2024-06-22 11:51 EDT
Event End Time: 2024-06-24 14:21 EDT
RFO Issue Date: 2024-06-26
Affected Services:
- Native Fax Inbound and outbound transmission of faxes.
Event Summary
Initially during the major incident outbound faxes were reporting “30236 Gateway not responding to dial request” when attempting to send. While investigating, a new error “Port Server Busy” was reported by the software, and incoming faxes were also identified as not being received by the Native Fax software. Working with our software vendor, we were able to isolate the issue to the "Port Server" service. Multiple restarts of the service were required to restore Native Fax functionality.
Event Timeline
June 22, 2024
- 23:51 EDT - First case related to faxing failure submitted.
June 24, 2024
- 08:30 EDT - Investigation into fax failure started for existing cases and all newly identified fax related cases.
- 11:44 EDT - Major incident is declared.
- 11:52 EDT - First announcement posted "Major Incident Announcement: We have identified consistent failures with outbound faxing via Native Fax. This is impacting both analog and digital native fax. Outbound faxes are failing with the error "30236 Gateway not responding to dial request". Inbound faxes do not appear to be impacted at this time. Our engineering team is working with our fax vendor to investigate and determine a fix as soon as possible."
- 12:38 EDT - Our engineering team isolated the problem to a malfunctioning transmission service within the Native Fax server software.
- 12:52 EDT - Announcement posted “Major Incident Update: Outbound Fax Failure on Native Fax Our engineering team has been able to isolate the cause of the failures and is continuing to work to resolve it as quickly as possible. In order to troubleshoot, we will be restarting the transmission services on the fax server immediately. This will cause inbound faxes on Native Fax and email to fax to fail for 1-2 minutes during the reboot.”
- 13:09 EDT - Restart of all Native Fax services.
- 13:11 EDT - New error message “Port Server Busy” registered in logs.
- 13:15-13:45 EDT - Faxes began to send successfully however the majority were still receiving “Port Server Busy” and we started observing inbound faxes failing intermittently with the same error. Services were restarted again and faxes were still generating errors.
- 13:35 EDT - Confirmed that inbound faxes were also intermittently failing due to the same “Port Server Busy” error message.
- 13:43 EDT - Announcement posted "Major Incident Update: Outbound Faxes on Native Fax Failing We were successfully able to restart transmission services on our fax server. At this time, outbound faxes are retrying. Due to the quantity of outbound faxes retrying at once, inbound & outbound faxes may receive a port busy error. Our engineering team is actively monitoring. We will provide an update as soon as queued faxes have returned to nominal levels."
- 13:57 EDT - Further escalation with the Native Fax software vendor.
- 14:03 EDT - Power cycled Native Fax server operating system.
- 14:04 EDT - Announcement posted: "Major Incident Update: Outbound Faxes on Native Fax Failing At this time, the queued faxes are continuing to overwhelm our port services, causing outbound faxes to fail intermittently. Our team has determined that the best course of action is to restart our fax server entirely to remove the queued faxes. Inbound and outbound faxes will continue to unavailable for at least 15 minutes while services restart. We apologize for the inconvenience."
- 14:05 EDT - Connected with Native Fax technical support and began troubleshooting with them.
- 14:10 EDT - NSX Port Server service restarted.
- 14:19 EDT - NSX Port Server service restarted.
- 14:21 EDT - Logs indicate successful transmissions of faxes.
- 14:33 EDT - Announcement posted: "Major Incident Update: Inbound & Outbound Faxes Failing After rebooting our fax server entirely, all services are functioning as expected. Inbound and outbound faxes are now going through successfully. Our engineers are still investigating the cause of this outage and a Major Incident Report will be posted within 48 hours at https://voipdocs.io/"
Root Cause
Both error messages “30236 Gateway not responding to dial request” and “Port Server Busy” were caused by the “NSX Port Server” service being unable to process requests due to memory conflicts with a Netsapiens integration module.
Future Preventative Action
Going forward, we has documented a temporary fix for the mentioned error messages and will work with the Native Fax server software developer to permanently correct the integration module.