At 9:34 pm (24th February NZT), our servers reported a Redis out-of-memory error. This error caused issues across different areas of the application, with the primary issue being 503 errors when uncached sites were requested. Our team started an investigation at 9:44 pm, and by 10 pm, we had identified the Redis issue.
Our immediate solution was to increase the memory allocated to the Redis machine. A larger machine was made available at 10:14 pm. Once the servers returned to normal, the previously errored clients started reconnecting. This created a large influx of connections, increasing the load on our servers. We temporarily scaled up our servers to account for the extra load. Our team continued to monitor performance and deemed the service was back to normal by 10:57 pm. CloudCannon maintained the increased memory allocation until we could implement a longer-term solution.
Upon reviewing our Redis database, we found a number of databases that were not expiring correctly. Some databases are now expiring correctly, and existing items have been updated with the same TTL. The remaining databases which contained larger key s have been migrated to use S3 files. This creates a sustainable solution that allows memory to be self-maintained. At 4:47 pm (25th February NZT), we updated the Redis server to its original size.