Syncing downtime
Incident Report for CloudCannon
Postmortem

At 5:30 am NZT, our syncer controller service restarted, resulting in downtime for all syncing across the app until 8:40 am NZT. The issue was that a future release of a microservice was released ahead of time without approval. The release occurred due to a bug in the release pipeline with zero human interaction.

The CloudCannon infrastructure is managed with a combination of AWS CodePipelines and Cloudformation scripts. This helps us to keep changes rigid and reproducible. When initially setting up a service on the platform, Cloudformation scripts configure the service to use the latest docker image. When a deploy from a Codepipeline occurs, the service is pinned to a docker image that matches the commit hash of the current deploy (e.g. ag5123s). The latest tag on the Docker image is updated when a new image is created which happens after every git push that passes unit tests.

As part of a recent update, the syncer controller was recreated from using Cloudformation, resetting the pinned docker tag to latest. As a service that requires few updates this had not completed a Codepipleine deploy to set the image to a pinned commit build. Later on, the syncer controller encountered a situation that caused it to restart and pull the new latest image. This image contained breaking changes from the staging environment that caused the syncing downtime. These changes were due to go out later in the week with a careful migration process. Doing the migration once the issue was detected resolved the issues and syncing returned to normal.

The initial configuration for the syncer controller service

Syncer controller event log from the time of the incident

To prevent this kind of error from reoccurring in the short term, we have identified all other services that have been recently updated but have had a production deployment since then and have manually pinned those services to the latest production image. In the long term, we will be reworking our deployment process to use separate production and staging tagged images, instead of the latest tag, so that newly set up services will be consistent with their environment by default.

Posted May 19, 2022 - 02:13 UTC

Resolved
Syncing not working
Posted May 18, 2022 - 17:30 UTC