Major service outage
Incident Report for iHook
Postmortem

Summary

On Sunday July 26th, 2020 we experienced a major service outage between 2020-07-26 23:06 UTC to 2020-07-27 01:08 UTC that caused:

  • 2 hours delay for scheduled task executions and related notifications
  • Login failure to admin dashboard‌

  • Permanent schedule change for tasks with longer than 2 hour interval and were scheduled to execute during this window. Cron based tasks are not affected.

Root Cause

The outage is caused by network connectivity issue between our admin dashboard servers, task execution servers and our primary database. The connectivity issue is caused by an improper VPC network configuration between the app servers and the primary database.

Lessons Learned

  • During the investigation, we found that some of our monitoring tests was wrongly configured to detect the connectivity failure in a timely manner, this has been fixed. We will further enhance our monitoring tool to help us detect the issue. Also we will setup a secondary monitoring service from a different vendor to ensure the availability and validity of our monitoring component.
  • We will require all ops operation to be tested in our staging environment prior to be executed in production environment.
Posted Jul 30, 2020 - 18:42 UTC

Resolved
Due to primary database connectivity issue, from 2020-07-26 23:06 UTC to 2020-07-27 01:08 UTC, all logins were blocked, scheduled tasks and related notifications were delayed. The services are now back to normal, root cause analysis to follow.
Posted Jul 26, 2020 - 23:00 UTC