Observation
On 2026-05-11 22:22-22:32 UTC, New Relic recorded a burst of 440 django.db.utils:OperationalError events in production:
22:22 UTC — 4 errors
22:27 UTC — 152 errors
22:32 UTC — 284 errors
all other hours — 0
Error messages:
- 402×
connection failed: FATAL: remaining connection slots are reserved for non-replication superuser connections
- 22×
connection failed: FATAL: sorry, too many clients already
- 15× combination of both
Faceted by transactionName:
- 435 / 440 at
WebTransaction/Function/django.middleware.csrf:CsrfViewMiddleware.process_view
- 4 at
OtherTransaction/Celery/ami.jobs.tasks.update_async_services_seen_for_pipelines
- 1 at
WebTransaction/Function/ami.main.api.views:SourceImageViewSet.list
Interpretation (hedged)
CSRF middleware runs early in the Django request lifecycle, and resolving request.user opens a DB connection. Errors landing on CSRF rather than a downstream view means most of these requests never reached a queryset — they couldn't open a connection to start with. That points at a pool-sizing / connection-lifecycle issue, not a slow-endpoint cause.
The curve shape (4 → 152 → 284 over 10 minutes, growing) is not a restart blip. Restarts produce spike-and-drop. Sustained growth implies real load contention.
Suspected contributors (none verified)
hostCount = 16 × WEB_CONCURRENCY=4 (per .envs/.production/.django-example line 130) = 64+ web workers plus Celery (CELERY_WORKER_CONCURRENCY=16, same file line 28) on the worker host = ~80 processes against PG.
psycopg[binary]==3.1.9 + Django default CONN_MAX_AGE=0 (settings not searched for an override) = every request opens and closes a connection. Under uvicorn ASGI (async workers can handle many concurrent requests), the simultaneous-connection ceiling is much higher than WEB_CONCURRENCY alone suggests.
- PG default
max_connections=100 with superuser_reserved_connections=3 = ~97 usable. Easily exhausted by the above.
What we still need to verify
Directions to discuss (ordered by effort/risk)
CONN_MAX_AGE=60 in production settings — Django persistent connections. Single-line change, large effect on per-request connection churn. Risk: if a worker holds a stale connection across a PG restart, the request errors once. Acceptable trade-off.
- pgbouncer in transaction-pooling mode in front of PG — caps the connection ceiling regardless of worker count. Bigger infra change but standard for Django + Celery setups.
- Lower
WEB_CONCURRENCY until pool is sized — quick bandage if neither above is fast to ship.
Source
Surfaced from a 30-min NR window after PR #1299 (agent 9.6.0 → 12.1.0 + tuned function_trace) shipped. The agent upgrade is what made databaseCallCount queryable on >50% of Transactions (vs ~2.8% before), which is how this burst became visible. The error itself almost certainly predates the upgrade.
Observation
On 2026-05-11 22:22-22:32 UTC, New Relic recorded a burst of 440
django.db.utils:OperationalErrorevents in production:Error messages:
connection failed: FATAL: remaining connection slots are reserved for non-replication superuser connectionsconnection failed: FATAL: sorry, too many clients alreadyFaceted by
transactionName:WebTransaction/Function/django.middleware.csrf:CsrfViewMiddleware.process_viewOtherTransaction/Celery/ami.jobs.tasks.update_async_services_seen_for_pipelinesWebTransaction/Function/ami.main.api.views:SourceImageViewSet.listInterpretation (hedged)
CSRF middleware runs early in the Django request lifecycle, and resolving
request.useropens a DB connection. Errors landing on CSRF rather than a downstream view means most of these requests never reached a queryset — they couldn't open a connection to start with. That points at a pool-sizing / connection-lifecycle issue, not a slow-endpoint cause.The curve shape (4 → 152 → 284 over 10 minutes, growing) is not a restart blip. Restarts produce spike-and-drop. Sustained growth implies real load contention.
Suspected contributors (none verified)
hostCount = 16×WEB_CONCURRENCY=4(per.envs/.production/.django-exampleline 130) = 64+ web workers plus Celery (CELERY_WORKER_CONCURRENCY=16, same file line 28) on the worker host = ~80 processes against PG.psycopg[binary]==3.1.9+ Django defaultCONN_MAX_AGE=0(settings not searched for an override) = every request opens and closes a connection. Under uvicorn ASGI (async workers can handle many concurrent requests), the simultaneous-connection ceiling is much higher thanWEB_CONCURRENCYalone suggests.max_connections=100withsuperuser_reserved_connections=3= ~97 usable. Easily exhausted by the above.What we still need to verify
CONN_MAX_AGEsetting inconfig/settings/production.py. If 0 (default), suspicion Bump pre-commit/action from 2.0.0 to 3.0.0 #2 is correct.docker-compose.yml, but production may differ).max_connectionsfrom the live server (SHOW max_connections;).Directions to discuss (ordered by effort/risk)
CONN_MAX_AGE=60in production settings — Django persistent connections. Single-line change, large effect on per-request connection churn. Risk: if a worker holds a stale connection across a PG restart, the request errors once. Acceptable trade-off.WEB_CONCURRENCYuntil pool is sized — quick bandage if neither above is fast to ship.Source
Surfaced from a 30-min NR window after PR #1299 (agent 9.6.0 → 12.1.0 + tuned
function_trace) shipped. The agent upgrade is what madedatabaseCallCountqueryable on >50% of Transactions (vs ~2.8% before), which is how this burst became visible. The error itself almost certainly predates the upgrade.