Form Submission Processing Delays

Incident Report for CloudCannon

Postmortem

Summary of the Incident

On December 27, 2024, at 6:29 AM NZT, a customer, reported that form submissions on their website were not being processed, a situation confirmed to have started three days earlier.

Escalation of the incident followed promptly:

6:41 AM: The issue was escalated to CTO.
8:17 AM: CTO escalated the ticket to the on-call engineer, who initiated the investigation.

Root Cause Analysis

The investigation revealed that the failure originated from an AWS ECS task update for the hooks-worker service:

The ECS hooks-worker task failed to initialize following an AWS update/restart.
The failure was due to a CannotPullContainerError, indicating that the required image in the Elastic Container Registry (ECR) was no longer available due to lifecycle policy limitations.

Relevant log:

CannotPullContainerError: pull image manifest has been retried 1 time(s): failed to resolve ref ...: not found

Immediate Actions Taken

A new deployment of the hooks-worker pipeline using an updated image was approved and executed.
ECS tasks were scaled from 1 to 4 to process the backlog of approximately 7,770 messages.
By 8:56 AM, the backlog was cleared, and the ECS tasks were scaled back to 1.
Additional services potentially impacted by similar issues, were redeployed with updated task definitions by 11:18 AM.

Long-Term Improvements

ECR Repository Lifecycle Policies

Issue: The ECR lifecycle policy retained insufficient versions (30 images), leading to premature cleanup due to the additional ARM image builds.
Actions:
- Increased the ECR lifecycle policy retention by threefold.
- Planned a review of the ARM build process to reduce noise and optimize repository usage.

Alarm Noise Reduction

Issue: Alerts for the oldest message age were received but overlooked due to excessive noise in the alert channel.
Actions:
- Enhanced monitoring to ensure critical alerts are highlighted.
- Improved visibility of alarm graphs and on-call escalation processes.

OpsGenie Alert Escalation Rules

Issue: A prior unresolved alert from October prevented escalation of new alerts.
Actions:
- Updated the rules to ensure alerts are resolved correctly by CloudWatch signals.
- Conducted a full audit of existing alert configurations to prevent recurrence.

Key Lessons Learned

Lifecycle Management: Ensure sufficient ECR lifecycle retention for critical services, particularly when introducing new build artifacts.
Alert Management: Reduce alert fatigue by refining notification channels and escalation policies.
Proactive Monitoring: Implement additional safeguards to detect and address issues before they escalate to customer impact.

Timeline

December 24, 2024, 5:39 AM: ECS hooks-worker task update/restart initiated.
December 27, 2024, 6:29 AM: Incident reported by Customer.
December 27, 2024, 8:26 AM: Deployment approved and backlog processing initiated.
December 27, 2024, 8:56 AM: Backlog cleared.
December 27, 2024, 11:18 AM: Additional impacted services redeployed.

Closing Remarks

We sincerely apologize for the inconvenience caused by this incident. We are committed to learning from this event and implementing the necessary changes to prevent similar issues in the future.

Posted Jan 22, 2025 - 20:29 UTC

Resolved

The incident has been resolved and all backlog has been processed.

Posted Dec 26, 2024 - 20:35 UTC

Monitoring

The backlog of data is being processed now. We are continuing to monitor it.

Posted Dec 26, 2024 - 19:56 UTC

Identified

We have identified and fixed the issue. The backlog of form submissions is being processed now.

Posted Dec 26, 2024 - 19:46 UTC

Investigating

Our form submission processing infrastructure is running behind. We are investigating the issue.

Posted Dec 26, 2024 - 19:33 UTC