Real-time uptime monitoring and a public status page for Sefaria's critical services — live at status.sefaria.org.
A small, self-contained Django application that checks Sefaria's services on a fixed interval, records every result, confirms outages before alerting (to filter out brief blips), posts rich Slack notifications when a service goes down or recovers, and renders a public, SEO-optimized status page.
- Why this exists
- What it monitors
- How it works
- Architecture
- Data model
- Quick start (local development)
- Configuration
- Management commands
- The status page
- Deployment
- Testing
- Project structure
- Operations & runbook
- License
Sefaria runs several public services (the main site, an MCP server, an AI chatbot, and the Linker API). When one degrades, the team needs to (a) find out fast, and (b) give users a single trustworthy place to check. This project does both:
- Fast, accurate alerts to a Slack channel, with the real outage start time and total downtime on recovery.
- A public status page that anyone can check during an incident, so support volume drops and users aren't left guessing.
The design goal is low false-positive rate: a single failed request never pages anyone. A service must fail N consecutive check cycles (configurable per service) before it is reported as down, both in Slack and on the status page.
Services are declared in config/settings/base.py (MONITORED_SERVICES). Each URL can be overridden via an environment variable.
| Service | Check | Method | Expects | Failure threshold | Env override |
|---|---|---|---|---|---|
| sefaria.org | …/healthz |
GET | 200 |
2 cycles | SEFARIA_HEALTH_URL |
| MCP Server | mcp.sefaria.org/healthz |
GET | 200 |
2 cycles | MCP_HEALTH_URL |
| AI Chatbot | chat.sefaria.org/api/health |
GET | 200 |
2 cycles | AI_CHATBOT_HEALTH_URL |
| Linker | …/api/find-refs |
POST | 202 + async result |
3 cycles | LINKER_HEALTH_URL |
The Linker uses a two-phase async check (see below) and has a higher threshold because it is the noisiest service.
A single check cycle runs every HEALTH_CHECK_INTERVAL seconds (default 60):
- Check — All services are checked in parallel (
ThreadPoolExecutor) so one slow/down service never blocks the others. Each request is retried up toHEALTH_CHECK_RETRIEStimes withHEALTH_CHECK_RETRY_DELAYseconds between attempts. - Persist — Every result is written to the
HealthChecktable (status, HTTP code, response time, error). - Detect transitions — A
StateTrackercompares each result against the last known state and decides whether a reportable transition occurred. - Alert — On a confirmed
went_downorrecoveredtransition, a Slack Block Kit message is sent.
- A service is only reported DOWN after it fails
failure_thresholdconsecutive cycles. The first failures in a streak are counted but stay silent. - When a service is confirmed down, an
Outagerecord is opened with the timestamp of the first failure in the streak — so the "Since" time in Slack and the measured downtime are accurate, not the time the threshold was crossed. - Recovery fires immediately on the first successful check after a confirmed outage. The open
Outageis closed (end_time,resolved=True) and its duration drives the "Downtime" field in the recovery alert. - A blip that self-resolves before hitting the threshold produces no alert and no
Outage.
The tracker is a process-global singleton (get_state_tracker()) that rebuilds its in-memory state from the database on first use, so it survives process restarts without re-alerting on an already-known outage.
A plain 202 Accepted from the Linker only means "task queued" — it doesn't prove the background worker, ML model, or ElasticSearch actually work. The Linker check therefore:
- Phase 1 — POSTs a real reference (
"Job 1:1"), expects202and extracts atask_id. - Phase 2 — Polls
…/api/async/<task_id>until the task reachesSUCCESSwith a non-empty result. AFAILUREstate, an empty result, or a polling timeout all count as down.
This catches end-to-end failures a shallow check would miss.
The system runs as two long-lived processes plus an on-demand maintenance job, all from the same image:
┌──────────────────────────────────────────┐
│ PostgreSQL │
│ HealthCheck · Outage · Message tables │
└──────────────────────────────────────────┘
▲ ▲
writes results │ │ reads latest state
│ │
┌─────────────────────────────────────────┐ ┌──────────────────────────────────────┐
│ SCHEDULER process (run_checks) │ │ WEB process (gunicorn) │
│ │ │ │
│ APScheduler │ │ Django │
│ ├─ every 60s → health check cycle │ │ ├─ GET / → status page │
│ │ check_all_services (parallel) │ │ ├─ GET /robots.txt │
│ │ → StateTracker (UP/DOWN detect) │ │ └─ GET /sitemap.xml │
│ │ → Slack alerter (Block Kit) │ │ └─ GET /admin/ → incident mgmt │
│ └─ daily 03:00 UTC → cleanup old rows │ │ │
└─────────────────────────────────────────┘ └──────────────────────────────────────┘
│
▼
┌─────────────┐
│ Slack │ 🔴 down / 🟢 recovered
└─────────────┘
- The scheduler does all the checking, alerting, and daily cleanup. It owns the
StateTracker. - The web process only reads from the database to render the status page and serve the Django admin (used by operators to post incident messages). It never performs checks.
- Both connect to the same PostgreSQL database, which is the single source of truth.
Key modules:
| File | Responsibility |
|---|---|
monitoring/services/checker.py |
Performs HTTP checks (standard + async two-phase), retries, parallel execution |
monitoring/services/state.py |
StateTracker — consecutive-failure logic, transition detection, Outage lifecycle |
monitoring/services/alerter.py |
Builds and sends Slack Block Kit alerts |
monitoring/services/scheduler.py |
APScheduler wiring; the run_health_check_cycle and cleanup jobs |
monitoring/views.py |
Status page, status-aware quotes, robots.txt, sitemap.xml |
monitoring/models.py |
HealthCheck, Outage, Message |
| Model | Purpose | Notes |
|---|---|---|
HealthCheck |
One row per service per check cycle | The raw time-series. Pruned after HEALTH_CHECK_RETENTION_DAYS. |
Outage |
One row per confirmed downtime period | start_time = first failure in the streak; closed on recovery. Drives accurate Slack downtime. Viewable in the admin and can be force-resolved if stuck (see runbook). |
Message |
An operator-authored incident note shown on the status page | Severity high / medium / resolved; managed in Django admin. |
Prerequisites: Python 3.12. Local development uses SQLite — no PostgreSQL or Docker required.
git clone <repository-url>
cd "Sefaria Down Detector"
# Create and activate a virtual environment
python -m venv .venv
.venv\Scripts\activate # Windows (PowerShell/CMD)
# source .venv/bin/activate # macOS / Linux
# Install dependencies
pip install -r requirements.txt
# Initialize the database and an admin user
python manage.py migrate
python manage.py createsuperuser
# Terminal 1 — run the web/status page (http://localhost:8000)
python manage.py runserver
# Terminal 2 — run one check cycle and exit (no Slack needed)
python manage.py run_checks --oncemanage.py defaults to config.settings.development (SQLite, DEBUG=True). To run the full scheduler loop locally, use python manage.py run_checks (Ctrl+C to stop).
Without a
SLACK_WEBHOOK_URL, checks still run and persist — alerts are simply skipped with a log line. This is the normal local-dev setup.
Settings are split by environment under config/settings/ and selected with DJANGO_SETTINGS_MODULE:
| Module | Used by | Database | Debug |
|---|---|---|---|
config.settings.development |
manage.py default |
SQLite | True |
config.settings.production |
Docker / Coolify | PostgreSQL (DATABASE_URL) |
False |
config.settings.test |
pytest (pytest.ini) |
in-memory SQLite | False |
Copy .env.example to .env and edit. All monitoring tunables read from the environment via django-environ.
| Variable | Description | Default |
|---|---|---|
SECRET_KEY |
Django secret key | dev placeholder (set in prod) |
DEBUG |
Enable debug mode | False |
ALLOWED_HOSTS |
Comma-separated hosts | status.sefaria.org (prod) |
DATABASE_URL |
PostgreSQL connection URL (prod) | SQLite (dev) |
SLACK_WEBHOOK_URL |
Slack incoming webhook; alerts are skipped if empty | "" |
SLACK_CHANNEL |
Informational only — the webhook itself determines the channel | sefaria-down |
STATUS_PAGE_URL |
Public URL used in Slack links and the sitemap | https://status.sefaria.org |
HEALTH_CHECK_INTERVAL |
Seconds between check cycles | 60 |
HEALTH_CHECK_RETRIES |
Retry attempts per request | 3 |
HEALTH_CHECK_RETRY_DELAY |
Seconds between retries | 10 |
ALERT_AFTER_CONSECUTIVE_FAILURES |
Default consecutive-failure threshold (per-service values override this) | 2 |
HEALTH_CHECK_RETENTION_DAYS |
Days of HealthCheck history to keep |
60 |
SEFARIA_HEALTH_URL / MCP_HEALTH_URL / AI_CHATBOT_HEALTH_URL / LINKER_HEALTH_URL |
Per-service URL overrides | see base.py |
Each entry in MONITORED_SERVICES accepts:
{
"name": "Linker", # display name + DB key
"url": "https://www.sefaria.org/api/find-refs",
"method": "POST", # default GET
"expected_status": 202, # status that means "healthy"
"timeout": 20, # per-request seconds
"follow_redirects": False, # default False
"failure_threshold": 3, # consecutive cycles before DOWN
"check_type": "async_two_phase", # omit for a standard check
"request_body": {"text": {"title": "", "body": "Job 1:1"}},
"async_verification": {
"base_url": "https://www.sefaria.org/api/async/",
"max_poll_attempts": 10,
"poll_interval": 1, # seconds between polls
},
}To add a service, append a dict here — no migration or code change is needed. The status page and alerter pick it up automatically.
# Run the scheduler loop: checks every interval + daily cleanup at 03:00 UTC
python manage.py run_checks
# Run a single check cycle and exit (handy for testing/CI)
python manage.py run_checks --once
# Delete HealthCheck rows older than the retention window
python manage.py cleanup_old_checks
python manage.py cleanup_old_checks --days 14 # override retention
python manage.py cleanup_old_checks --dry-run # preview onlyCleanup runs automatically inside the scheduler; the standalone command exists for manual/maintenance use.
monitoring/templates/monitoring/status.html renders a self-refreshing public page:
- Overall banner —
All Systems Operational/Partial Issues/Major Outage, computed from confirmed service states and active incident severity. - Per-service list — Operational / Down / Unknown, with last response time. A service shows "Down" only when its last
failure_thresholdchecks all failed (mirrors the alert logic so the page and Slack never disagree). - Status-aware Tanakh verse — A Hebrew + English verse, with a deep link to Sefaria, chosen from a pool that matches the current status (reassuring when up, hopeful during an outage). Defined in
views.py. - Incidents — Operator-posted
Messagerecords (active + recent history), authored in the Django admin. - Dynamic favicon — A colored status dot is drawn onto the favicon client-side.
- Auto-refresh — The page reloads every 60s; the view itself is cached for 30s (
@cache_page). - SEO — Open Graph + Twitter cards, JSON-LD,
robots.txt, andsitemap.xml, targeting the query "is Sefaria down".
To post an incident: log into /admin/, add a Message (severity high/medium), and it appears immediately. Mark it resolved via the bulk admin action.
Production runs in Docker, orchestrated by Coolify (which provides the Traefik reverse proxy and TLS). The Dockerfile is a multi-stage build that runs as a non-root user and serves static files via WhiteNoise + gunicorn.
docker-compose.yml defines four services:
| Service | Role | Command |
|---|---|---|
db |
PostgreSQL 16 | — |
web |
Status page + admin | migrate then gunicorn |
scheduler |
The health-check loop | python manage.py run_checks |
cleanup |
One-shot maintenance (profile maintenance) |
python manage.py cleanup_old_checks |
cp .env.example .env # set SECRET_KEY, DB_PASSWORD, SLACK_WEBHOOK_URL, ALLOWED_HOSTS
docker compose up -d --build
docker compose logs -f schedulerdocker-compose.override.yml (git-ignored) adds local port mapping and a source mount; in production Coolify routes traffic to the web service, so no published ports are needed.
83 tests cover the checker, state machine, alerter, scheduler, models, admin, cleanup, views, and SEO.
# All tests (uses config.settings.test via pytest.ini)
.venv/Scripts/python -m pytest tests/ -v
# With coverage
.venv/Scripts/python -m pytest tests/ --cov=monitoring --cov-report=htmlTests mock all outbound HTTP and Slack calls, so they make no network requests. time-machine is used to test time-dependent outage-duration logic.
config/
settings/ base · development · production · test
urls.py admin/ + monitoring routes
monitoring/
models.py HealthCheck · Outage · Message
views.py status page, quotes, robots.txt, sitemap.xml
admin.py HealthCheck + Outage (read-only) · Message (CRUD)
services/
checker.py HTTP checks, retries, parallelism, async two-phase
state.py StateTracker — transitions, Outage lifecycle
alerter.py Slack Block Kit alerts
scheduler.py APScheduler jobs (checks + cleanup)
management/commands/
run_checks.py scheduler entrypoint
cleanup_old_checks.py
templates/ static/ migrations/
tests/ 83 tests + factories + fixtures
Dockerfile docker-compose.yml requirements.txt .env.example
- A service is red but I think it's fine — Check the scheduler logs (
docker compose logs scheduler). The page only goes red afterfailure_thresholdconsecutive failures; confirm the health endpoint really returns the expected status. - No Slack alerts — Verify
SLACK_WEBHOOK_URLis set in theschedulerservice's environment; an empty value logs "skipping alert" and sends nothing. - Post an incident banner —
/admin/→ Incident Messages → add (severityhigh→ "Major Outage",medium→ "Partial Issues"). - A
recoveredoutage is stuck open (e.g. a recovery alert was missed) —/admin/→ Outages → select it → "Force-resolve selected outages". The scheduler reconciles with the database on its next cycle (≤HEALTH_CHECK_INTERVAL): it clears its in-memory down state, and if the service is in fact still failing it opens a fresh outage and re-alerts. You never need to restart the scheduler to fix a dangling outage. - Database growing —
HealthCheckrows are pruned daily; adjustHEALTH_CHECK_RETENTION_DAYSor runcleanup_old_checksmanually. - Tune noise — Raise a service's
failure_thresholdinbase.pyif it flaps; lower it for faster paging.
MIT