Duration: Recurring over approximately one week, with individual episodes lasting 30-90 minutes before self-recovering.
Impact: During episodes, some requests to sites hosted on CloudCannon returned error responses. Static assets and HTML pages were all affected. Sites recovered without manual intervention, but the disruption was repeated.
Over the past week, a significant increase in bot and crawler traffic drove higher-than-usual request volumes to our file serving infrastructure. This elevated load exposed a weakness in how our application managed database connections, leading to periodic service degradation.
Our file serving layer uses a pool of database connections to resolve site configuration for each incoming request. Under the increased traffic load, a combination of network conditions between our application servers and the database caused some of these connections to silently break mid-use.
Under normal traffic levels, the small number of affected connections would have had minimal impact. However, the sustained increase in request volume meant the connection pool was under constant pressure, and broken connections accumulated faster than the system could recover from them. Once enough dead connections built up, the pool became saturated, and new requests could not be served.
The service self-recovered each time as the underlying connections were eventually recycled, but the recovery was slower than acceptable.
The database connection pool lacked client-side safeguards for detecting and recovering from broken connections. Specifically:
Under the increased traffic volume, these gaps were exposed more frequently and with greater impact.
We have deployed a series of changes to harden the connection pool and ensure rapid recovery:
We are also working on broader architectural improvements to reduce per-request database dependency through better caching and conditional responses, which will provide additional resilience under high traffic conditions.
We have deployed fixes that we believe address the root cause and are actively monitoring. We will continue to update this page as we confirm the effectiveness of these changes over the coming days.