Summary of the Incident
On December 27, 2024, at 6:29 AM NZT, a customer, reported that form submissions on their website were not being processed, a situation confirmed to have started three days earlier.
Escalation of the incident followed promptly:
- 6:41 AM: The issue was escalated to CTO.
- 8:17 AM: CTO escalated the ticket to the on-call engineer, who initiated the investigation.
Root Cause Analysis
The investigation revealed that the failure originated from an AWS ECS task update for the hooks-worker
service:
- The ECS
hooks-worker
task failed to initialize following an AWS update/restart.
- The failure was due to a
CannotPullContainerError
, indicating that the required image in the Elastic Container Registry (ECR) was no longer available due to lifecycle policy limitations.
Relevant log:
CannotPullContainerError: pull image manifest has been retried 1 time(s): failed to resolve ref ...: not found
Immediate Actions Taken
- A new deployment of the
hooks-worker
pipeline using an updated image was approved and executed.
- ECS tasks were scaled from 1 to 4 to process the backlog of approximately 7,770 messages.
- By 8:56 AM, the backlog was cleared, and the ECS tasks were scaled back to 1.
- Additional services potentially impacted by similar issues, were redeployed with updated task definitions by 11:18 AM.
Long-Term Improvements
- ECR Repository Lifecycle Policies
- Issue: The ECR lifecycle policy retained insufficient versions (30 images), leading to premature cleanup due to the additional ARM image builds.
Actions:
- Increased the ECR lifecycle policy retention by threefold.
- Planned a review of the ARM build process to reduce noise and optimize repository usage.
- Alarm Noise Reduction
- Issue: Alerts for the oldest message age were received but overlooked due to excessive noise in the alert channel.
Actions:
- Enhanced monitoring to ensure critical alerts are highlighted.
- Improved visibility of alarm graphs and on-call escalation processes.
- OpsGenie Alert Escalation Rules
- Issue: A prior unresolved alert from October prevented escalation of new alerts.
Actions:
- Updated the rules to ensure alerts are resolved correctly by CloudWatch signals.
- Conducted a full audit of existing alert configurations to prevent recurrence.
Key Lessons Learned
- Lifecycle Management: Ensure sufficient ECR lifecycle retention for critical services, particularly when introducing new build artifacts.
- Alert Management: Reduce alert fatigue by refining notification channels and escalation policies.
- Proactive Monitoring: Implement additional safeguards to detect and address issues before they escalate to customer impact.
Timeline
- December 24, 2024, 5:39 AM: ECS
hooks-worker
task update/restart initiated.
- December 27, 2024, 6:29 AM: Incident reported by Customer.
- December 27, 2024, 8:26 AM: Deployment approved and backlog processing initiated.
- December 27, 2024, 8:56 AM: Backlog cleared.
- December 27, 2024, 11:18 AM: Additional impacted services redeployed.
Closing Remarks
We sincerely apologize for the inconvenience caused by this incident. We are committed to learning from this event and implementing the necessary changes to prevent similar issues in the future.