Increased rate of 503 errors

Incident Report for CloudCannon

Postmortem

At 9:34 pm (24th February NZT), our servers reported a Redis out-of-memory error. This error caused issues across different areas of the application, with the primary issue being 503 errors when uncached sites were requested. Our team started an investigation at 9:44 pm, and by 10 pm, we had identified the Redis issue.

Our immediate solution was to increase the memory allocated to the Redis machine. A larger machine was made available at 10:14 pm. Once the servers returned to normal, the previously errored clients started reconnecting. This created a large influx of connections, increasing the load on our servers. We temporarily scaled up our servers to account for the extra load. Our team continued to monitor performance and deemed the service was back to normal by 10:57 pm. CloudCannon maintained the increased memory allocation until we could implement a longer-term solution.

Upon reviewing our Redis database, we found a number of databases that were not expiring correctly. Some databases are now expiring correctly, and existing items have been updated with the same TTL. The remaining databases which contained larger key s have been migrated to use S3 files. This creates a sustainable solution that allows memory to be self-maintained. At 4:47 pm (25th February NZT), we updated the Redis server to its original size.

Posted Feb 25, 2025 - 21:40 UTC

Resolved

This incident has been resolved.

Posted Feb 24, 2025 - 09:59 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 24, 2025 - 09:21 UTC

Identified

The issue has been identified to be a redis cache under load. We are working to increase resources allocated.

Posted Feb 24, 2025 - 09:01 UTC

Investigating

We are currently investigating this issue.

Posted Feb 24, 2025 - 08:51 UTC

This incident affected: Host (Hosting (sites.cloudcannon.com)).