GitHub - firefly-operationOS/flycanon: Operational Knowledge Repository -- versioned, provenance-tracked canonical knowledge with hybrid retrieval (BM25 + vector + RRF + rerank), grounded RAG with citations, knowledge graph, multi-turn conversations, async ingest, PII guardrail and a billing + corpus snapshot. Part of Firefly OperationOS.

Operational Knowledge Repository

The living source of truth for canonical operational knowledge. Universal ingestion, hybrid retrieval, retrieval-augmented answering with citations — all behind a single HTTP service.

In a hurry? Jump to the 10-minute Quickstart → · SDK paths: Python · Java / Spring Boot · Wire payloads: Payload reference →

Why this service exists

Every operations team builds the same workflow underneath process docs, compliance runbooks, vendor agreements, internal policies, and post-incident retrospectives:

"Take this document, file it under the right thing, keep the version chain honest, and let me ask it questions later — with citations I can audit."

Doing that with a wiki is a losing game: layouts change, formats mutate (DOCX, PDF, scanned PDF, HEIC, ZIP bundles, .eml threads…), and the team ends up hand-copying snippets into another tool every time a single thing moves.

flycanon collapses the whole workflow into a single HTTP service. You ship any file format, declare the metadata you care about, and the service hands back a structured SourceRecord whose content is parsed, normalised, chunked, embedded, and indexed — ready for hybrid retrieval and grounded RAG answers. Knowledge is never edited in place: every revision appends a new version row, the previous one transitions to superseded, and the provenance graph travels with it.

It is built to drop into a production back-office stack: idempotent APIs, event-driven downstream notifications via a durable Postgres outbox, observability out of the box, and clean failure isolation per pipeline stage.

What you get back

You give the service one HTTP request. The response is a single JSON object that carries, for every interaction:

Layer	What it tells you
Sources	`SourceRecord` per ingested artefact (id, kind, status, content sha256, chunk count, ancestry chain when the artefact came out of a bundle). Bulk + async ingest variants return per-item results / a job id; `PUT /api/v1/sources/{id}` re-ingests in place and preserves the row id.
Knowledge items	Canonical pointer (status, current version, domain, jurisdiction). Updates append a new version; the previous one flips to `superseded`. `GET /api/v1/knowledge/{id}/diff` returns a unified diff + field changes + citation set deltas between any two versions.
Knowledge versions	Append-only revisions of a knowledge item. Citations to source chunks travel with the edge.
Knowledge graph	Typed edges between items (`related` / `depends_on` / `conflicts_with` / `replaces`) over `/api/v1/knowledge/{id}/relations`, plus a whole-canon view at `/api/v1/knowledge:graph` (JSON or `Accept: text/vnd.mermaid`). Conflict detection materialises `conflicts_with` edges automatically.
Candidates	Pre-canonical LLM proposals tied to a source. Accept / reject lifecycle materialises them into the knowledge chain.
Hybrid retrieval	`SearchResponse` with BM25 (Postgres `tsvector` + GIN) + dense vectors (`pgvector`) fused via Reciprocal Rank Fusion (RRF), optional cross-encoder rerank (Cohere / Voyage), and optional LLM query expansion. Each hit carries `chunk_id`, `source_id`, `source_filename`, `source_title`, `source_kind`, `source_uri`, `section_path`, `page`, the matching `content`, and the fused `score` — UIs can render citation labels without a second `GET /api/v1/sources/{id}`.
Grounded RAG answers	`AnswerResponse` with the answer + citation list (same enriched `Hit` shape — filename / title / kind / section / page populated), `model`, `elapsed_ms`. `POST /api/v1/query:stream` emits the same payload as Server-Sent Events. A grounded "I don't know" is `answer == ""` with empty citations — flycanon never hallucinates.
Conversations	Multi-turn threads at `/api/v1/conversations/...` with rolling summary + last-N-turn context windowing. Each turn returns the same enriched citation set as `/query`; `:suggest` proposes 3-5 grounded follow-up questions.
Provenance	Resolved citation graph for one knowledge version plus the source summaries it touches plus the version chain of its item.
Async ingest jobs	`IngestJob` row + SSE event stream for any large or bulk ingest. Status / stage / progress / source id / RFC 7807 error envelope all surface through `GET /api/v1/ingest-jobs/{id}` and `GET /api/v1/ingest-jobs/{id}/stream` (cursor-resumable).
Knowledge quality	`GET /api/v1/knowledge:stale` returns per-item staleness scores (cosine vs fresh sources, 6h cached); `POST /api/v1/knowledge:detect-conflicts` runs an LLM-judged pairwise conflict scan, queues confirmed conflicts as candidates, and auto-creates the matching `conflicts_with` edges.
PII guardrail	Configurable regex scanner with four policies (`disabled` / `warn` / `redact` / `reject`). Runs on every intake path (initial submit, bulk, async, replace). `reject` returns RFC 7807 + `findings[]` so callers can surface a precise diagnostic.
Billing + cost stream	`/api/v1/billing` aggregates spend; `/events` drills into per-call breadcrumbs (correlation id, subject, latency); `/summary` returns 24h / 7d / 30d snapshots; `/top` and `/by-subject` answer "who" and "where did it go"; `/latency` returns p50 / p95 / p99 from the same cost-event stream.
Corpus inventory	`GET /api/v1/stats` -- one-shot snapshot covering sources (by kind + status + bytes), knowledge items (by status + domain), versions, candidates, chunks (embedded coverage), ingest jobs (by status + avg attempts), and the cost headline (24h / 30d).
Append-only audit log	Every mutation (`/api/v1/audit`) with correlation id, actor, payload, and W3C trace context.
EDA topics	Three durable topics published via the Postgres outbox: `flycanon.ingest`, `flycanon.knowledge`, `flycanon.audit`.
RFC 7807 error envelope	Every non-2xx response is a `ProblemDetail` (singular) payload from `flycanon.web.conventions` with a stable `code` field for branching. `type` URI base: `https://firefly.dev/problems/...`.
OpenAPI 3.1	Multi-paragraph DTO descriptions mixing business and technical context, served live at `/openapi.json` (Swagger UI at `/docs`, ReDoc at `/redoc`).

Universal ingestion

Submit any file format. flycanon detects the media type from the magic bytes (stdlib mimetypes + a curated header table + ZIP central-directory inspection to disambiguate Office formats from generic archives) and routes the payload through a fixed routing matrix before the parse / chunk / embed / index pipeline runs:

Class	Examples	Strategy
Plain text	`text/plain`, `text/markdown`, `text/csv`, JSON, XML	Pass-through.
PDF — Full Digital Text	Born-digital PDFs (Word / LibreOffice / LaTeX exports, browser "Save as PDF", reporting pipelines)	Phase 1 (PyMuPDF text-layer): `pymupdf.get_text()` per page returns the encoded text stream in reading order. No rendering, microseconds per page.
PDF — Image (scanned)	Scanned contracts, fax output, photographed pages, mobile-camera captures — pages are raster images of the original	Phase 2 (OCR fallback): pages under `_MIN_CHARS_PER_PAGE` rasterised by PyMuPDF at `_OCR_DPI` (200) and OCR'd via Tesseract (`pytesseract.image_to_string`) — engine selectable via `FLYCANON_PDF_OCR_ENGINE` (`tesseract` default; `docling` for layout-aware OCR with the `docling` extra). Languages default to `eng+spa`, override via `FLYCANON_OCR_LANG`.
PDF — Hybrid	Mixed: typed body + scanned signature page, or any blend of digital and image pages	Phase 1 runs on every page; Phase 2 only fires for pages flagged as image-only. The two phases compose page-by-page.
PDF — guard rail	Encrypted or corrupt PDFs	Rejected up-front by `PdfGuard` (lightweight `pypdf` pre-flight) with `error_code=encrypted_pdf` / `corrupt_source`.
Office	DOCX / XLSX / PPTX / ODT / ODS / ODP / RTF	`office_converter=none` (default) uses native per-format loaders (python-docx / openpyxl / python-pptx / odfpy / striprtf); `gotenberg` (HTTP sidecar) or `libreoffice` (in-container `soffice`) render to PDF first.
Raster images	PNG / JPG / WEBP	Pass-through to OCR (Tesseract, multi-language).
Converted images	HEIC / AVIF / TIFF / SVG / BMP	Pillow + pillow-heif + cairosvg → PNG, then OCR.
Archives	ZIP / 7Z / TAR / TAR.GZ / TAR.BZ2 / EPUB	Expanded recursively (capped at `binary_max_recursion_depth` and `binary_max_expanded_files`). Each child re-enters the normaliser.
Emails	EML / MSG	Body + each attachment exposed as a separate artefact carrying `parent_artifact` ancestry.
Web	HTML / XHTML	BeautifulSoup-backed `HtmlLoader`.
Transcripts	WebVTT / SRT	Cue-aware loader.
Unknown	everything else	`UnsupportedBinaryError` → `IngestionFailed` event with stable `code`.

Multi-artefact intakes (archives, multi-attachment emails) are merged into a single Markdown document with ## Artifact: <filename> section markers, so chunks remain attributable via metadata.parent_artifact.

Postgres-native retrieval

The BM25 corpus is co-located with the dense projection so hybrid retrieval is a single-host, Postgres-native operation:

BM25 rides on a tsvector + GIN index on canon_chunks.tsv (a Postgres GENERATED column derived from content). No extra service, no SQLite file. Text-search config is simple by default (multilingual); switch to english / spanish / … via FLYCANON_BM25_TEXT_SEARCH_CONFIG.
Dense vectors live in a pluggable backend. pgvector (default) keeps them in the same operational Postgres — an HNSW index on vector_cosine_ops, tuneable m / ef_construction. qdrant and chroma use the adapters that ship in fireflyframework-agentic.

FLYCANON_VECTOR_STORE selects the dense backend (BM25 always stays on Postgres):

Backend	Use case
`pgvector` (default)	PostgreSQL + pgvector extension. HNSW on `vector_cosine_ops`, tuneable `m` / `ef_construction`. Same operational Postgres as the canonical store AND the BM25 projection, with DB-enforced Row-Level Security per scope.
`qdrant`	Self-hosted or Qdrant Cloud (`uv sync --extra qdrant`). Good filtering + scaling when you want the dense index off Postgres.
`chroma`	ChromaDB, in-process or server (`uv sync --extra chroma`). Simplest external option.

Every backend is wrapped in a tenant/workspace-scoped layer, so reads and writes are confined to (tenant_id, workspace_id) via a canonical t/<tenant>/w/<workspace> namespace. Fusion always happens via Reciprocal Rank Fusion over the two channels.

Public surface

Concern	Endpoint(s)
Source intake (any format, bytes / base64 / URL)	`POST /api/v1/sources`
Bulk + async intake (jobs + SSE progress)	`POST /api/v1/sources:bulk`, `:async`, `GET /api/v1/ingest-jobs/{id}/stream`
Source re-ingest (preserves the row id)	`PUT /api/v1/sources/{id}`
Source lookup / pagination	`GET /api/v1/sources[/{id}]`
Knowledge-item lifecycle (draft / published / superseded / retired)	`/api/v1/knowledge/...`
Versioned diff between two knowledge versions	`GET /api/v1/knowledge/{id}/diff`
Knowledge graph (typed edges + JSON / Mermaid view)	`/api/v1/knowledge/{id}/relations`, `GET /api/v1/knowledge:graph`
Hybrid retrieval (+ optional rerank + query expansion)	`POST /api/v1/search`
RAG answer with citations (+ token streaming)	`POST /api/v1/query`, `POST /api/v1/query:stream`
Multi-turn conversations + suggested follow-ups	`/api/v1/conversations/...`
Candidate proposals (pre-canonical)	`/api/v1/candidates/...`
Provenance graph	`GET /api/v1/knowledge/{id}/provenance`
Quality scans (staleness + conflict detection)	`GET /api/v1/knowledge:stale`, `POST /api/v1/knowledge:detect-conflicts`
Cost / billing rollups	`GET /api/v1/billing` (aggregate)
Cost drill-down -- per-call events	`GET /api/v1/billing/events`
Cost drill-down -- 24h / 7d / 30d snapshot	`GET /api/v1/billing/summary`
Cost drill-down -- top-N consumers	`GET /api/v1/billing/top`
Cost drill-down -- per-subject attribution	`GET /api/v1/billing/by-subject`
Cost drill-down -- latency percentiles (p50/p95/p99)	`GET /api/v1/billing/latency`
Corpus + queue + cost inventory snapshot	`GET /api/v1/stats`
Append-only audit log	`GET /api/v1/audit`
Taxonomy (domain + jurisdiction)	`/api/v1/taxonomy/...`
Agent-token CRUD (user-tier; mint returns secret ONCE)	`/api/v1/agent-tokens`
Agent surface (`X-Agent-Token`-protected, 8 endpoints)	`/api/v1/agent/sources`, `.../query`, `.../query/stream`, `.../search`, `.../knowledge/{id}`, `.../knowledge/{id}/provenance`, `.../candidates:propose`
Identity / model info	`GET /api/v1/version`
Health / readiness / liveness	`/actuator/health/...`
OpenAPI 3.1	`/openapi.json`, `/docs`, `/redoc`

Quickstart

Want the 10-minute curl tour instead? See QUICKSTART.md — task docker:up:test + one curl call against a mock LLM, no API keys.

git clone https://github.com/firefly-operationOS/flycanon.git
cd flycanon
task deps:install          # uv sync --extra dev (pins .venv)
task docker:up             # api + worker + postgres(pgvector) + redis
curl -fsS http://localhost:8500/actuator/health | jq .

Ingest a sample DOCX (the binary normaliser handles every format — this is just the simplest curl):

curl -fsS -X POST http://localhost:8500/api/v1/sources \
  -F "file=@./tests/fixtures/sample.docx" \
  -F 'metadata={"title":"Sample","domain":"process_owner"};type=application/json' \
  | jq .

Search the corpus:

curl -fsS -X POST http://localhost:8500/api/v1/search \
  -H 'Content-Type: application/json' \
  -d '{"query":"what does the document say about scope","top_k":5}' | jq .

Ask a grounded question:

curl -fsS -X POST http://localhost:8500/api/v1/query \
  -H 'Content-Type: application/json' \
  -d '{"question":"Summarise the scope section in three sentences."}' | jq .

A grounded "I don't know" looks like {"answer":"","citations":[]} — flycanon never hallucinates.

Local development

task dev:db              # Postgres (pgvector/pg16) + Redis only
task dev:migrate         # alembic upgrade head
task dev:serve           # FastAPI hot-reload on :8500
task dev:worker          # EDA worker in a separate terminal

Smoke the running service:

task health              # /actuator/health
task version             # /api/v1/version
task openapi             # /openapi.json

SDKs

Both SDKs pin their version to the service's CalVer (26.5.6), so the client and server upgrade in lockstep.

SDK	Highlights
Python	Async-first, `httpx` + Pydantic. Python ≥ 3.11.
Java	Spring Boot 3.5.9 + Spring `RestClient` + Jackson. Java 25 (LTS). `groupId = com.firefly`. Ships an `@AutoConfiguration` so a `CanonClient` bean is wired straight from `flycanon.*` properties.

Java consumers just declare the dependency and inject the bean:

@Service
public class CopilotService {
    private final CanonClient canon;
    public CopilotService(CanonClient canon) { this.canon = canon; }
    // canon.submitSource(...), canon.search(...), canon.answer(...)
}

Documentation

Document	Read it when…
QUICKSTART.md	You want your first ingest + search + answer in ten minutes (HTTP / curl).
docs/architecture.md	You need the data model, the binary-normaliser routing matrix, the Postgres-native retrieval design, the dependency arrows.
docs/pipeline.md	You're touching the orchestrator, adding a new stage, or chasing a slow ingest.
docs/api-reference.md	You're integrating with the HTTP API and need every endpoint, shape, and status code.
docs/payload-reference.md	You're composing the request payload — every field, option, and example.
docs/eda-events.md	You're subscribing to the `flycanon.ingest` / `flycanon.knowledge` / `flycanon.audit` / `canon.workspaces.v1` topics.
docs/consumers.md	You're building or auditing a service that consumes flycanon -- agent token scopes, workspace events, retry posture, wire-contract stability.
docs/deployment.md	You're running this in production — env vars, topologies, OCR engines, embedding providers, auth, observability, sizing.
docs/cicd.md	You're cutting a release or wiring CI/CD — the three GitHub Actions workflows, release cookbook, required secrets.
docs/troubleshooting.md	The service / ingest / search / answer surface is misbehaving — symptom → root cause → fix.
docs/glossary.md	You need a precise definition for a term the API or docs use.
sdks/python/README.md	You're integrating from Python — async-first SDK with Pydantic typing.
sdks/java/README.md	You're integrating from Java / Spring Boot — Spring Boot 3.5.9, `com.firefly` groupId, `@AutoConfiguration`.

The OpenAPI 3.1 document is served live by the running service at /openapi.json, with Swagger UI at /docs and Redoc at /redoc.

Repository layout

flycanon/
├─ Dockerfile                # Multi-stage build with the binary-normaliser system deps
├─ Taskfile.yml              # Canonical dev-loop interface
├─ docker-compose.yml        # api + worker + postgres (pgvector) + redis (optional gotenberg)
├─ docker-compose.test.yml   # Adds the mock LLM for integration tests
├─ pyfly.yaml                # pyfly application configuration
├─ alembic.ini               # Migration runner config
├─ env_template              # Reference environment file (.env is gitignored)
├─ migrations/               # Alembic versions
├─ src/flycanon/
│  ├─ app.py                 # @pyfly_application + scan_packages
│  ├─ main.py                # ASGI entry consumed by uvicorn
│  ├─ cli.py                 # `flycanon {serve,worker,migrate}`
│  ├─ config.py              # CanonSettings (FLYCANON_* env)
│  ├─ core/                  # @configuration + services + binary normaliser + mappers
│  ├─ interfaces/            # Public DTOs + enums
│  ├─ models/                # SQLAlchemy entities + repositories
│  ├─ resources/prompts/     # YAML prompt templates
│  └─ web/                   # @rest_controller + @controller_advice
├─ sdks/
│  ├─ python/                # Async-first Python SDK (Apache-2.0)
│  └─ java/                  # Spring Boot Java SDK (Apache-2.0, com.firefly)
├─ docs/                     # Architecture, payload reference, API reference, EDA events, glossary
└─ tests/
   ├─ unit/
   └─ integration/

License

The bundled SDKs under sdks/python and sdks/java ship their own Apache License 2.0 files.

Part of Firefly OperationOS. Platform-agnostic by design.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Operational Knowledge Repository

Why this service exists

What you get back

Universal ingestion

Postgres-native retrieval

Public surface

Quickstart

Local development

SDKs

Documentation

Repository layout

License

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 171 Commits
.github/workflows		.github/workflows
docs		docs
migrations		migrations
scripts		scripts
sdks		sdks
src/flycanon		src/flycanon
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
Taskfile.yml		Taskfile.yml
alembic.ini		alembic.ini
docker-compose.test.yml		docker-compose.test.yml
docker-compose.yml		docker-compose.yml
docker-entrypoint.sh		docker-entrypoint.sh
env_template		env_template
openapi.json		openapi.json
pyfly.yaml		pyfly.yaml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Operational Knowledge Repository

Why this service exists

What you get back

Universal ingestion

Postgres-native retrieval

Public surface

Quickstart

Local development

SDKs

Documentation

Repository layout

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages