Sefaria Status Monitor

Real-time uptime monitoring and a public status page for Sefaria's critical services — live at status.sefaria.org.

A small, self-contained Django application that checks Sefaria's services on a fixed interval, records every result, confirms outages before alerting (to filter out brief blips), posts rich Slack notifications when a service goes down or recovers, and renders a public, SEO-optimized status page.

Why this exists

Sefaria runs several public services (the main site, an MCP server, an AI chatbot, and the Linker API). When one degrades, the team needs to (a) find out fast, and (b) give users a single trustworthy place to check. This project does both:

Fast, accurate alerts to a Slack channel, with the real outage start time and total downtime on recovery.
A public status page that anyone can check during an incident, so support volume drops and users aren't left guessing.

The design goal is low false-positive rate: a single failed request never pages anyone. A service must fail N consecutive check cycles (configurable per service) before it is reported as down, both in Slack and on the status page.

What it monitors

Services are declared in config/settings/base.py (MONITORED_SERVICES). Each URL can be overridden via an environment variable.

Service	Check	Method	Expects	Failure threshold	Env override
sefaria.org	`…/healthz`	GET	`200`	2 cycles	`SEFARIA_HEALTH_URL`
MCP Server	`mcp.sefaria.org/healthz`	GET	`200`	2 cycles	`MCP_HEALTH_URL`
AI Chatbot	`chat.sefaria.org/api/health`	GET	`200`	2 cycles	`AI_CHATBOT_HEALTH_URL`
Linker	`…/api/find-refs`	POST	`202` + async result	3 cycles	`LINKER_HEALTH_URL`

The Linker uses a two-phase async check (see below) and has a higher threshold because it is the noisiest service.

How it works

A single check cycle runs every HEALTH_CHECK_INTERVAL seconds (default 60):

Check — All services are checked in parallel (ThreadPoolExecutor) so one slow/down service never blocks the others. Each request is retried up to HEALTH_CHECK_RETRIES times with HEALTH_CHECK_RETRY_DELAY seconds between attempts.
Persist — Every result is written to the HealthCheck table (status, HTTP code, response time, error).
Detect transitions — A StateTracker compares each result against the last known state and decides whether a reportable transition occurred.
Alert — On a confirmed went_down or recovered transition, a Slack Block Kit message is sent.

Confirmation logic (the important part)

A service is only reported DOWN after it fails failure_threshold consecutive cycles. The first failures in a streak are counted but stay silent.
When a service is confirmed down, an Outage record is opened with the timestamp of the first failure in the streak — so the "Since" time in Slack and the measured downtime are accurate, not the time the threshold was crossed.
Recovery fires immediately on the first successful check after a confirmed outage. The open Outage is closed (end_time, resolved=True) and its duration drives the "Downtime" field in the recovery alert.
A blip that self-resolves before hitting the threshold produces no alert and no Outage.

The tracker is a process-global singleton (get_state_tracker()) that rebuilds its in-memory state from the database on first use, so it survives process restarts without re-alerting on an already-known outage.

Two-phase async check (Linker)

A plain 202 Accepted from the Linker only means "task queued" — it doesn't prove the background worker, ML model, or ElasticSearch actually work. The Linker check therefore:

Phase 1 — POSTs a real reference ("Job 1:1"), expects 202 and extracts a task_id.
Phase 2 — Polls …/api/async/<task_id> until the task reaches SUCCESS with a non-empty result. A FAILURE state, an empty result, or a polling timeout all count as down.

This catches end-to-end failures a shallow check would miss.

Architecture

The system runs as two long-lived processes plus an on-demand maintenance job, all from the same image:

                          ┌──────────────────────────────────────────┐
                          │              PostgreSQL                   │
                          │   HealthCheck · Outage · Message tables   │
                          └──────────────────────────────────────────┘
                              ▲                              ▲
              writes results  │                              │  reads latest state
                              │                              │
┌─────────────────────────────────────────┐   ┌──────────────────────────────────────┐
│   SCHEDULER process (run_checks)         │   │   WEB process (gunicorn)             │
│                                          │   │                                      │
│   APScheduler                            │   │   Django                             │
│    ├─ every 60s → health check cycle     │   │    ├─ GET /         → status page    │
│    │     check_all_services (parallel)   │   │    ├─ GET /robots.txt                │
│    │     → StateTracker (UP/DOWN detect) │   │    └─ GET /sitemap.xml               │
│    │     → Slack alerter (Block Kit)     │   │    └─ GET /admin/   → incident mgmt  │
│    └─ daily 03:00 UTC → cleanup old rows  │   │                                      │
└─────────────────────────────────────────┘   └──────────────────────────────────────┘
                              │
                              ▼
                       ┌─────────────┐
                       │    Slack    │  🔴 down / 🟢 recovered
                       └─────────────┘

The scheduler does all the checking, alerting, and daily cleanup. It owns the StateTracker.
The web process only reads from the database to render the status page and serve the Django admin (used by operators to post incident messages). It never performs checks.
Both connect to the same PostgreSQL database, which is the single source of truth.

Key modules:

File	Responsibility
`monitoring/services/checker.py`	Performs HTTP checks (standard + async two-phase), retries, parallel execution
`monitoring/services/state.py`	`StateTracker` — consecutive-failure logic, transition detection, `Outage` lifecycle
`monitoring/services/alerter.py`	Builds and sends Slack Block Kit alerts
`monitoring/services/scheduler.py`	APScheduler wiring; the `run_health_check_cycle` and cleanup jobs
`monitoring/views.py`	Status page, status-aware quotes, robots.txt, sitemap.xml
`monitoring/models.py`	`HealthCheck`, `Outage`, `Message`

Data model

Model	Purpose	Notes
`HealthCheck`	One row per service per check cycle	The raw time-series. Pruned after `HEALTH_CHECK_RETENTION_DAYS`.
`Outage`	One row per confirmed downtime period	`start_time` = first failure in the streak; closed on recovery. Drives accurate Slack downtime. Viewable in the admin and can be force-resolved if stuck (see runbook).
`Message`	An operator-authored incident note shown on the status page	Severity `high` / `medium` / `resolved`; managed in Django admin.

Quick start (local development)

Prerequisites: Python 3.12. Local development uses SQLite — no PostgreSQL or Docker required.

git clone <repository-url>
cd "Sefaria Down Detector"

# Create and activate a virtual environment
python -m venv .venv
.venv\Scripts\activate            # Windows (PowerShell/CMD)
# source .venv/bin/activate       # macOS / Linux

# Install dependencies
pip install -r requirements.txt

# Initialize the database and an admin user
python manage.py migrate
python manage.py createsuperuser

# Terminal 1 — run the web/status page (http://localhost:8000)
python manage.py runserver

# Terminal 2 — run one check cycle and exit (no Slack needed)
python manage.py run_checks --once

manage.py defaults to config.settings.development (SQLite, DEBUG=True). To run the full scheduler loop locally, use python manage.py run_checks (Ctrl+C to stop).

Without a SLACK_WEBHOOK_URL, checks still run and persist — alerts are simply skipped with a log line. This is the normal local-dev setup.

Configuration

Settings are split by environment under config/settings/ and selected with DJANGO_SETTINGS_MODULE:

Module	Used by	Database	Debug
`config.settings.development`	`manage.py` default	SQLite	`True`
`config.settings.production`	Docker / Coolify	PostgreSQL (`DATABASE_URL`)	`False`
`config.settings.test`	pytest (`pytest.ini`)	in-memory SQLite	`False`

Environment variables

Copy .env.example to .env and edit. All monitoring tunables read from the environment via django-environ.

Variable	Description	Default
`SECRET_KEY`	Django secret key	dev placeholder (set in prod)
`DEBUG`	Enable debug mode	`False`
`ALLOWED_HOSTS`	Comma-separated hosts	`status.sefaria.org` (prod)
`DATABASE_URL`	PostgreSQL connection URL (prod)	SQLite (dev)
`SLACK_WEBHOOK_URL`	Slack incoming webhook; alerts are skipped if empty	`""`
`SLACK_CHANNEL`	Informational only — the webhook itself determines the channel	`sefaria-down`
`STATUS_PAGE_URL`	Public URL used in Slack links and the sitemap	`https://status.sefaria.org`
`HEALTH_CHECK_INTERVAL`	Seconds between check cycles	`60`
`HEALTH_CHECK_RETRIES`	Retry attempts per request	`3`
`HEALTH_CHECK_RETRY_DELAY`	Seconds between retries	`10`
`ALERT_AFTER_CONSECUTIVE_FAILURES`	Default consecutive-failure threshold (per-service values override this)	`2`
`HEALTH_CHECK_RETENTION_DAYS`	Days of `HealthCheck` history to keep	`60`
`SEFARIA_HEALTH_URL` / `MCP_HEALTH_URL` / `AI_CHATBOT_HEALTH_URL` / `LINKER_HEALTH_URL`	Per-service URL overrides	see base.py

Tuning a monitored service

Each entry in MONITORED_SERVICES accepts:

{
    "name": "Linker",                       # display name + DB key
    "url": "https://www.sefaria.org/api/find-refs",
    "method": "POST",                       # default GET
    "expected_status": 202,                 # status that means "healthy"
    "timeout": 20,                          # per-request seconds
    "follow_redirects": False,              # default False
    "failure_threshold": 3,                 # consecutive cycles before DOWN
    "check_type": "async_two_phase",        # omit for a standard check
    "request_body": {"text": {"title": "", "body": "Job 1:1"}},
    "async_verification": {
        "base_url": "https://www.sefaria.org/api/async/",
        "max_poll_attempts": 10,
        "poll_interval": 1,                 # seconds between polls
    },
}

To add a service, append a dict here — no migration or code change is needed. The status page and alerter pick it up automatically.

Management commands

# Run the scheduler loop: checks every interval + daily cleanup at 03:00 UTC
python manage.py run_checks

# Run a single check cycle and exit (handy for testing/CI)
python manage.py run_checks --once

# Delete HealthCheck rows older than the retention window
python manage.py cleanup_old_checks
python manage.py cleanup_old_checks --days 14   # override retention
python manage.py cleanup_old_checks --dry-run   # preview only

Cleanup runs automatically inside the scheduler; the standalone command exists for manual/maintenance use.

The status page

monitoring/templates/monitoring/status.html renders a self-refreshing public page:

Overall banner — All Systems Operational / Partial Issues / Major Outage, computed from confirmed service states and active incident severity.
Per-service list — Operational / Down / Unknown, with last response time. A service shows "Down" only when its last failure_threshold checks all failed (mirrors the alert logic so the page and Slack never disagree).
Status-aware Tanakh verse — A Hebrew + English verse, with a deep link to Sefaria, chosen from a pool that matches the current status (reassuring when up, hopeful during an outage). Defined in views.py.
Incidents — Operator-posted Message records (active + recent history), authored in the Django admin.
Dynamic favicon — A colored status dot is drawn onto the favicon client-side.
Auto-refresh — The page reloads every 60s; the view itself is cached for 30s (@cache_page).
SEO — Open Graph + Twitter cards, JSON-LD, robots.txt, and sitemap.xml, targeting the query "is Sefaria down".

To post an incident: log into /admin/, add a Message (severity high/medium), and it appears immediately. Mark it resolved via the bulk admin action.

Deployment

Production runs in Docker, orchestrated by Coolify (which provides the Traefik reverse proxy and TLS). The Dockerfile is a multi-stage build that runs as a non-root user and serves static files via WhiteNoise + gunicorn.

docker-compose.yml defines four services:

Service	Role	Command
`db`	PostgreSQL 16	—
`web`	Status page + admin	`migrate` then `gunicorn`
`scheduler`	The health-check loop	`python manage.py run_checks`
`cleanup`	One-shot maintenance (profile `maintenance`)	`python manage.py cleanup_old_checks`

cp .env.example .env          # set SECRET_KEY, DB_PASSWORD, SLACK_WEBHOOK_URL, ALLOWED_HOSTS
docker compose up -d --build
docker compose logs -f scheduler

docker-compose.override.yml (git-ignored) adds local port mapping and a source mount; in production Coolify routes traffic to the web service, so no published ports are needed.

Testing

83 tests cover the checker, state machine, alerter, scheduler, models, admin, cleanup, views, and SEO.

# All tests (uses config.settings.test via pytest.ini)
.venv/Scripts/python -m pytest tests/ -v

# With coverage
.venv/Scripts/python -m pytest tests/ --cov=monitoring --cov-report=html

Tests mock all outbound HTTP and Slack calls, so they make no network requests. time-machine is used to test time-dependent outage-duration logic.

Project structure

config/
  settings/          base · development · production · test
  urls.py            admin/ + monitoring routes
monitoring/
  models.py          HealthCheck · Outage · Message
  views.py           status page, quotes, robots.txt, sitemap.xml
  admin.py           HealthCheck + Outage (read-only) · Message (CRUD)
  services/
    checker.py       HTTP checks, retries, parallelism, async two-phase
    state.py         StateTracker — transitions, Outage lifecycle
    alerter.py       Slack Block Kit alerts
    scheduler.py     APScheduler jobs (checks + cleanup)
  management/commands/
    run_checks.py        scheduler entrypoint
    cleanup_old_checks.py
  templates/ static/ migrations/
tests/               83 tests + factories + fixtures
Dockerfile  docker-compose.yml  requirements.txt  .env.example

Operations & runbook

A service is red but I think it's fine — Check the scheduler logs (docker compose logs scheduler). The page only goes red after failure_threshold consecutive failures; confirm the health endpoint really returns the expected status.
No Slack alerts — Verify SLACK_WEBHOOK_URL is set in the scheduler service's environment; an empty value logs "skipping alert" and sends nothing.
Post an incident banner — /admin/ → Incident Messages → add (severity high → "Major Outage", medium → "Partial Issues").
A recovered outage is stuck open (e.g. a recovery alert was missed) — /admin/ → Outages → select it → "Force-resolve selected outages". The scheduler reconciles with the database on its next cycle (≤ HEALTH_CHECK_INTERVAL): it clears its in-memory down state, and if the service is in fact still failing it opens a fresh outage and re-alerts. You never need to restart the scheduler to fix a dangling outage.
Database growing — HealthCheck rows are pruned daily; adjust HEALTH_CHECK_RETENTION_DAYS or run cleanup_old_checks manually.
Tune noise — Raise a service's failure_threshold in base.py if it flaps; lower it for faster paging.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sefaria Status Monitor

Table of Contents

Why this exists

What it monitors

How it works

Confirmation logic (the important part)

Two-phase async check (Linker)

Architecture

Data model

Quick start (local development)

Configuration

Environment variables

Tuning a monitored service

Management commands

The status page

Deployment

Testing

Project structure

Operations & runbook

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
config		config
monitoring		monitoring
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
manage.py		manage.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Sefaria Status Monitor

Table of Contents

Why this exists

What it monitors

How it works

Confirmation logic (the important part)

Two-phase async check (Linker)

Architecture

Data model

Quick start (local development)

Configuration

Environment variables

Tuning a monitored service

Management commands

The status page

Deployment

Testing

Project structure

Operations & runbook

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages