Skip to content

Investigate PG connection-slot exhaustion bursts on CsrfViewMiddleware #1302

@mihow

Description

@mihow

Observation

On 2026-05-11 22:22-22:32 UTC, New Relic recorded a burst of 440 django.db.utils:OperationalError events in production:

22:22 UTC — 4 errors
22:27 UTC — 152 errors
22:32 UTC — 284 errors
all other hours — 0

Error messages:

  • 402× connection failed: FATAL: remaining connection slots are reserved for non-replication superuser connections
  • 22× connection failed: FATAL: sorry, too many clients already
  • 15× combination of both

Faceted by transactionName:

  • 435 / 440 at WebTransaction/Function/django.middleware.csrf:CsrfViewMiddleware.process_view
  • 4 at OtherTransaction/Celery/ami.jobs.tasks.update_async_services_seen_for_pipelines
  • 1 at WebTransaction/Function/ami.main.api.views:SourceImageViewSet.list

Interpretation (hedged)

CSRF middleware runs early in the Django request lifecycle, and resolving request.user opens a DB connection. Errors landing on CSRF rather than a downstream view means most of these requests never reached a queryset — they couldn't open a connection to start with. That points at a pool-sizing / connection-lifecycle issue, not a slow-endpoint cause.

The curve shape (4 → 152 → 284 over 10 minutes, growing) is not a restart blip. Restarts produce spike-and-drop. Sustained growth implies real load contention.

Suspected contributors (none verified)

  1. hostCount = 16 × WEB_CONCURRENCY=4 (per .envs/.production/.django-example line 130) = 64+ web workers plus Celery (CELERY_WORKER_CONCURRENCY=16, same file line 28) on the worker host = ~80 processes against PG.
  2. psycopg[binary]==3.1.9 + Django default CONN_MAX_AGE=0 (settings not searched for an override) = every request opens and closes a connection. Under uvicorn ASGI (async workers can handle many concurrent requests), the simultaneous-connection ceiling is much higher than WEB_CONCURRENCY alone suggests.
  3. PG default max_connections=100 with superuser_reserved_connections=3 = ~97 usable. Easily exhausted by the above.

What we still need to verify

  • Confirm CONN_MAX_AGE setting in config/settings/production.py. If 0 (default), suspicion Bump pre-commit/action from 2.0.0 to 3.0.0 #2 is correct.
  • Confirm whether pgbouncer fronts Postgres in production (no entry in docker-compose.yml, but production may differ).
  • Get PG max_connections from the live server (SHOW max_connections;).
  • Correlate the 22:22 burst with deploy logs — did the agent upgrade cause a brief connection spike during worker rollout? Or was it coincident with user-triggered load (e.g. a large job submission)?

Directions to discuss (ordered by effort/risk)

  1. CONN_MAX_AGE=60 in production settings — Django persistent connections. Single-line change, large effect on per-request connection churn. Risk: if a worker holds a stale connection across a PG restart, the request errors once. Acceptable trade-off.
  2. pgbouncer in transaction-pooling mode in front of PG — caps the connection ceiling regardless of worker count. Bigger infra change but standard for Django + Celery setups.
  3. Lower WEB_CONCURRENCY until pool is sized — quick bandage if neither above is fast to ship.

Source

Surfaced from a 30-min NR window after PR #1299 (agent 9.6.0 → 12.1.0 + tuned function_trace) shipped. The agent upgrade is what made databaseCallCount queryable on >50% of Transactions (vs ~2.8% before), which is how this burst became visible. The error itself almost certainly predates the upgrade.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions