Elevated 503 Error Rate on Hosting

Incident Report for CloudCannon

Postmortem

Incident: Intermittent 503 errors on hosted sites

Duration: Recurring over approximately one week, with individual episodes lasting 30-90 minutes before self-recovering.

Impact: During episodes, some requests to sites hosted on CloudCannon returned error responses. Static assets and HTML pages were all affected. Sites recovered without manual intervention, but the disruption was repeated.

Summary

Over the past week, a significant increase in bot and crawler traffic drove higher-than-usual request volumes to our file serving infrastructure. This elevated load exposed a weakness in how our application managed database connections, leading to periodic service degradation.

What happened

Our file serving layer uses a pool of database connections to resolve site configuration for each incoming request. Under the increased traffic load, a combination of network conditions between our application servers and the database caused some of these connections to silently break mid-use.

Under normal traffic levels, the small number of affected connections would have had minimal impact. However, the sustained increase in request volume meant the connection pool was under constant pressure, and broken connections accumulated faster than the system could recover from them. Once enough dead connections built up, the pool became saturated, and new requests could not be served.

The service self-recovered each time as the underlying connections were eventually recycled, but the recovery was slower than acceptable.

Root cause

The database connection pool lacked client-side safeguards for detecting and recovering from broken connections. Specifically:

There were no client-side timeouts on database queries, so a query sent over a broken connection would wait indefinitely for a response that would never arrive.
TCP keepalive was not enabled, so broken connections were not detected until the operating system's own timeout expired, which could take over an hour.

Under the increased traffic volume, these gaps were exposed more frequently and with greater impact.

Resolution

We have deployed a series of changes to harden the connection pool and ensure rapid recovery:

Client-side query timeouts ensure that broken connections are detected and evicted from the pool within seconds, rather than blocking indefinitely.
TCP keepalive is now enabled on all database connections, allowing the operating system to actively detect broken connections.
Connection acquisition timeouts ensure that requests fail fast with a clear error rather than queuing indefinitely when the pool is under pressure.
Improved error observability now captures which specific operations are affected, enabling faster diagnosis.
Graceful shutdown handling prevents request drops during deployments.

We are also working on broader architectural improvements to reduce per-request database dependency through better caching and conditional responses, which will provide additional resilience under high traffic conditions.

Current status

We have deployed fixes that we believe address the root cause and are actively monitoring. We will continue to update this page as we confirm the effectiveness of these changes over the coming days.

Posted Feb 13, 2026 - 03:02 UTC

Resolved

Increased 503 errors on the hosting. We have identified an issue and we are deploying a fix.

Posted Feb 12, 2026 - 07:00 UTC