2020-09-17 - Portal, API, SNAPMobile Outage (Resolved)
Table of Contents
Event Description: ACS001 HDD Full
Event Start Time: 2020-09-17 10:31 EST
Event End Time: 2020-09-17 11:58 EST
RFO Issue Date: 2020-09-18
Affected Services
Portals and SNAPmobile
- Users unable to login into SNAPmobile and Portal.
Event Summary
On September 17th, 2020, Portal and SNAPmobile were unavailable. Inbound and outbound calls processed normally. Other mobile apps and services not affected.
Event Timeline (All times 24-hour format, EST)
September 17th, 2020
- 10:31 NOC identified ACS portals not responding
- 10:32 Tier 3 began reviewing logs
- 10:33 Portals redirected to alternate servers
- 10:35 Partner notifications sent out
- 10:40 Restarted web services allowing access to portals. But still unable to login
- 10:43 Tier 4 engaged
- 11:10 Identified errant databases causing corruption
- 11:15 Engaged vendor support
- 11:30 Moved DB entries and logs that were causing corruption. Restored services
- 11:34 Confirmed Portals and SNAPmobile were restored to full functionality
Root Cause Analysis
One of the databases required for portal and API functionality had become overloaded creating log files that quickly filled the available storage extremely quickly.
Future Preventative Action
Configuration changes were made to allow for offloading of the older tables and adjustments were made to the frequency of the pruning. Additional monitors were created to notify our NOC earlier in case symptoms were to arise again.