Investigate PG connection-slot exhaustion bursts on CsrfViewMiddleware

## Observation

On 2026-05-11 22:22-22:32 UTC, New Relic recorded a burst of 440 `django.db.utils:OperationalError` events in production:

```
22:22 UTC — 4 errors
22:27 UTC — 152 errors
22:32 UTC — 284 errors
all other hours — 0
```

Error messages:
- 402× `connection failed: FATAL: remaining connection slots are reserved for non-replication superuser connections`
- 22× `connection failed: FATAL: sorry, too many clients already`
- 15× combination of both

Faceted by `transactionName`:
- **435 / 440 at `WebTransaction/Function/django.middleware.csrf:CsrfViewMiddleware.process_view`**
- 4 at `OtherTransaction/Celery/ami.jobs.tasks.update_async_services_seen_for_pipelines`
- 1 at `WebTransaction/Function/ami.main.api.views:SourceImageViewSet.list`

## Interpretation (hedged)

CSRF middleware runs early in the Django request lifecycle, and resolving `request.user` opens a DB connection. Errors landing on CSRF rather than a downstream view means **most of these requests never reached a queryset** — they couldn't open a connection to start with. That points at a pool-sizing / connection-lifecycle issue, not a slow-endpoint cause.

The curve shape (4 → 152 → 284 over 10 minutes, growing) is **not** a restart blip. Restarts produce spike-and-drop. Sustained growth implies real load contention.

## Suspected contributors (none verified)

1. `hostCount = 16` × `WEB_CONCURRENCY=4` (per `.envs/.production/.django-example` line 130) = **64+ web workers** plus Celery (`CELERY_WORKER_CONCURRENCY=16`, same file line 28) on the worker host = ~80 processes against PG.
2. `psycopg[binary]==3.1.9` + Django default `CONN_MAX_AGE=0` (settings not searched for an override) = every request opens and closes a connection. Under uvicorn ASGI (async workers can handle many concurrent requests), the simultaneous-connection ceiling is much higher than `WEB_CONCURRENCY` alone suggests.
3. PG default `max_connections=100` with `superuser_reserved_connections=3` = ~97 usable. Easily exhausted by the above.

## What we still need to verify

- [ ] Confirm `CONN_MAX_AGE` setting in `config/settings/production.py`. If 0 (default), suspicion #2 is correct.
- [ ] Confirm whether pgbouncer fronts Postgres in production (no entry in `docker-compose.yml`, but production may differ).
- [ ] Get PG `max_connections` from the live server (`SHOW max_connections;`).
- [ ] Correlate the 22:22 burst with deploy logs — did the agent upgrade cause a brief connection spike during worker rollout? Or was it coincident with user-triggered load (e.g. a large job submission)?

## Directions to discuss (ordered by effort/risk)

1. **`CONN_MAX_AGE=60`** in production settings — Django persistent connections. Single-line change, large effect on per-request connection churn. Risk: if a worker holds a stale connection across a PG restart, the request errors once. Acceptable trade-off.
2. **pgbouncer in transaction-pooling mode** in front of PG — caps the connection ceiling regardless of worker count. Bigger infra change but standard for Django + Celery setups.
3. **Lower `WEB_CONCURRENCY`** until pool is sized — quick bandage if neither above is fast to ship.

## Source

Surfaced from a 30-min NR window after PR #1299 (agent 9.6.0 → 12.1.0 + tuned `function_trace`) shipped. The agent upgrade is what made `databaseCallCount` queryable on >50% of Transactions (vs ~2.8% before), which is how this burst became visible. The error itself almost certainly predates the upgrade.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate PG connection-slot exhaustion bursts on CsrfViewMiddleware #1302

Observation

Interpretation (hedged)

Suspected contributors (none verified)

What we still need to verify

Directions to discuss (ordered by effort/risk)

Source

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Investigate PG connection-slot exhaustion bursts on CsrfViewMiddleware #1302

Description

Observation

Interpretation (hedged)

Suspected contributors (none verified)

What we still need to verify

Directions to discuss (ordered by effort/risk)

Source

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions