Skip to content

firefly-operationOS/flycanon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

171 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

flycanon — operational knowledge repository

Operational Knowledge Repository

The living source of truth for canonical operational knowledge. Universal ingestion, hybrid retrieval, retrieval-augmented answering with citations — all behind a single HTTP service.

Python 3.13 Java 25 pyfly agentic OpenAPI vector store Version License: Apache 2.0


In a hurry?  Jump to the 10-minute Quickstart →  ·  SDK paths: Python · Java / Spring Boot  ·  Wire payloads: Payload reference →


Why this service exists

Every operations team builds the same workflow underneath process docs, compliance runbooks, vendor agreements, internal policies, and post-incident retrospectives:

"Take this document, file it under the right thing, keep the version chain honest, and let me ask it questions later — with citations I can audit."

Doing that with a wiki is a losing game: layouts change, formats mutate (DOCX, PDF, scanned PDF, HEIC, ZIP bundles, .eml threads…), and the team ends up hand-copying snippets into another tool every time a single thing moves.

flycanon collapses the whole workflow into a single HTTP service. You ship any file format, declare the metadata you care about, and the service hands back a structured SourceRecord whose content is parsed, normalised, chunked, embedded, and indexed — ready for hybrid retrieval and grounded RAG answers. Knowledge is never edited in place: every revision appends a new version row, the previous one transitions to superseded, and the provenance graph travels with it.

It is built to drop into a production back-office stack: idempotent APIs, event-driven downstream notifications via a durable Postgres outbox, observability out of the box, and clean failure isolation per pipeline stage.


What you get back

You give the service one HTTP request. The response is a single JSON object that carries, for every interaction:

Layer What it tells you
Sources SourceRecord per ingested artefact (id, kind, status, content sha256, chunk count, ancestry chain when the artefact came out of a bundle). Bulk + async ingest variants return per-item results / a job id; PUT /api/v1/sources/{id} re-ingests in place and preserves the row id.
Knowledge items Canonical pointer (status, current version, domain, jurisdiction). Updates append a new version; the previous one flips to superseded. GET /api/v1/knowledge/{id}/diff returns a unified diff + field changes + citation set deltas between any two versions.
Knowledge versions Append-only revisions of a knowledge item. Citations to source chunks travel with the edge.
Knowledge graph Typed edges between items (related / depends_on / conflicts_with / replaces) over /api/v1/knowledge/{id}/relations, plus a whole-canon view at /api/v1/knowledge:graph (JSON or Accept: text/vnd.mermaid). Conflict detection materialises conflicts_with edges automatically.
Candidates Pre-canonical LLM proposals tied to a source. Accept / reject lifecycle materialises them into the knowledge chain.
Hybrid retrieval SearchResponse with BM25 (Postgres tsvector + GIN) + dense vectors (pgvector) fused via Reciprocal Rank Fusion (RRF), optional cross-encoder rerank (Cohere / Voyage), and optional LLM query expansion. Each hit carries chunk_id, source_id, source_filename, source_title, source_kind, source_uri, section_path, page, the matching content, and the fused score — UIs can render citation labels without a second GET /api/v1/sources/{id}.
Grounded RAG answers AnswerResponse with the answer + citation list (same enriched Hit shape — filename / title / kind / section / page populated), model, elapsed_ms. POST /api/v1/query:stream emits the same payload as Server-Sent Events. A grounded "I don't know" is answer == "" with empty citations — flycanon never hallucinates.
Conversations Multi-turn threads at /api/v1/conversations/... with rolling summary + last-N-turn context windowing. Each turn returns the same enriched citation set as /query; :suggest proposes 3-5 grounded follow-up questions.
Provenance Resolved citation graph for one knowledge version plus the source summaries it touches plus the version chain of its item.
Async ingest jobs IngestJob row + SSE event stream for any large or bulk ingest. Status / stage / progress / source id / RFC 7807 error envelope all surface through GET /api/v1/ingest-jobs/{id} and GET /api/v1/ingest-jobs/{id}/stream (cursor-resumable).
Knowledge quality GET /api/v1/knowledge:stale returns per-item staleness scores (cosine vs fresh sources, 6h cached); POST /api/v1/knowledge:detect-conflicts runs an LLM-judged pairwise conflict scan, queues confirmed conflicts as candidates, and auto-creates the matching conflicts_with edges.
PII guardrail Configurable regex scanner with four policies (disabled / warn / redact / reject). Runs on every intake path (initial submit, bulk, async, replace). reject returns RFC 7807 + findings[] so callers can surface a precise diagnostic.
Billing + cost stream /api/v1/billing aggregates spend; /events drills into per-call breadcrumbs (correlation id, subject, latency); /summary returns 24h / 7d / 30d snapshots; /top and /by-subject answer "who" and "where did it go"; /latency returns p50 / p95 / p99 from the same cost-event stream.
Corpus inventory GET /api/v1/stats -- one-shot snapshot covering sources (by kind + status + bytes), knowledge items (by status + domain), versions, candidates, chunks (embedded coverage), ingest jobs (by status + avg attempts), and the cost headline (24h / 30d).
Append-only audit log Every mutation (/api/v1/audit) with correlation id, actor, payload, and W3C trace context.
EDA topics Three durable topics published via the Postgres outbox: flycanon.ingest, flycanon.knowledge, flycanon.audit.
RFC 7807 error envelope Every non-2xx response is a ProblemDetail (singular) payload from flycanon.web.conventions with a stable code field for branching. type URI base: https://firefly.dev/problems/....
OpenAPI 3.1 Multi-paragraph DTO descriptions mixing business and technical context, served live at /openapi.json (Swagger UI at /docs, ReDoc at /redoc).

Universal ingestion

Submit any file format. flycanon detects the media type from the magic bytes (stdlib mimetypes + a curated header table + ZIP central-directory inspection to disambiguate Office formats from generic archives) and routes the payload through a fixed routing matrix before the parse / chunk / embed / index pipeline runs:

Class Examples Strategy
Plain text text/plain, text/markdown, text/csv, JSON, XML Pass-through.
PDF — Full Digital Text Born-digital PDFs (Word / LibreOffice / LaTeX exports, browser "Save as PDF", reporting pipelines) Phase 1 (PyMuPDF text-layer): pymupdf.get_text() per page returns the encoded text stream in reading order. No rendering, microseconds per page.
PDF — Image (scanned) Scanned contracts, fax output, photographed pages, mobile-camera captures — pages are raster images of the original Phase 2 (OCR fallback): pages under _MIN_CHARS_PER_PAGE rasterised by PyMuPDF at _OCR_DPI (200) and OCR'd via Tesseract (pytesseract.image_to_string) — engine selectable via FLYCANON_PDF_OCR_ENGINE (tesseract default; docling for layout-aware OCR with the docling extra). Languages default to eng+spa, override via FLYCANON_OCR_LANG.
PDF — Hybrid Mixed: typed body + scanned signature page, or any blend of digital and image pages Phase 1 runs on every page; Phase 2 only fires for pages flagged as image-only. The two phases compose page-by-page.
PDF — guard rail Encrypted or corrupt PDFs Rejected up-front by PdfGuard (lightweight pypdf pre-flight) with error_code=encrypted_pdf / corrupt_source.
Office DOCX / XLSX / PPTX / ODT / ODS / ODP / RTF office_converter=none (default) uses native per-format loaders (python-docx / openpyxl / python-pptx / odfpy / striprtf); gotenberg (HTTP sidecar) or libreoffice (in-container soffice) render to PDF first.
Raster images PNG / JPG / WEBP Pass-through to OCR (Tesseract, multi-language).
Converted images HEIC / AVIF / TIFF / SVG / BMP Pillow + pillow-heif + cairosvg → PNG, then OCR.
Archives ZIP / 7Z / TAR / TAR.GZ / TAR.BZ2 / EPUB Expanded recursively (capped at binary_max_recursion_depth and binary_max_expanded_files). Each child re-enters the normaliser.
Emails EML / MSG Body + each attachment exposed as a separate artefact carrying parent_artifact ancestry.
Web HTML / XHTML BeautifulSoup-backed HtmlLoader.
Transcripts WebVTT / SRT Cue-aware loader.
Unknown everything else UnsupportedBinaryErrorIngestionFailed event with stable code.

Multi-artefact intakes (archives, multi-attachment emails) are merged into a single Markdown document with ## Artifact: <filename> section markers, so chunks remain attributable via metadata.parent_artifact.


Postgres-native retrieval

The BM25 corpus is co-located with the dense projection so hybrid retrieval is a single-host, Postgres-native operation:

  • BM25 rides on a tsvector + GIN index on canon_chunks.tsv (a Postgres GENERATED column derived from content). No extra service, no SQLite file. Text-search config is simple by default (multilingual); switch to english / spanish / … via FLYCANON_BM25_TEXT_SEARCH_CONFIG.
  • Dense vectors live in a pluggable backend. pgvector (default) keeps them in the same operational Postgres — an HNSW index on vector_cosine_ops, tuneable m / ef_construction. qdrant and chroma use the adapters that ship in fireflyframework-agentic.

FLYCANON_VECTOR_STORE selects the dense backend (BM25 always stays on Postgres):

Backend Use case
pgvector (default) PostgreSQL + pgvector extension. HNSW on vector_cosine_ops, tuneable m / ef_construction. Same operational Postgres as the canonical store AND the BM25 projection, with DB-enforced Row-Level Security per scope.
qdrant Self-hosted or Qdrant Cloud (uv sync --extra qdrant). Good filtering + scaling when you want the dense index off Postgres.
chroma ChromaDB, in-process or server (uv sync --extra chroma). Simplest external option.

Every backend is wrapped in a tenant/workspace-scoped layer, so reads and writes are confined to (tenant_id, workspace_id) via a canonical t/<tenant>/w/<workspace> namespace. Fusion always happens via Reciprocal Rank Fusion over the two channels.


Public surface

Concern Endpoint(s)
Source intake (any format, bytes / base64 / URL) POST /api/v1/sources
Bulk + async intake (jobs + SSE progress) POST /api/v1/sources:bulk, :async, GET /api/v1/ingest-jobs/{id}/stream
Source re-ingest (preserves the row id) PUT /api/v1/sources/{id}
Source lookup / pagination GET /api/v1/sources[/{id}]
Knowledge-item lifecycle (draft / published / superseded / retired) /api/v1/knowledge/...
Versioned diff between two knowledge versions GET /api/v1/knowledge/{id}/diff
Knowledge graph (typed edges + JSON / Mermaid view) /api/v1/knowledge/{id}/relations, GET /api/v1/knowledge:graph
Hybrid retrieval (+ optional rerank + query expansion) POST /api/v1/search
RAG answer with citations (+ token streaming) POST /api/v1/query, POST /api/v1/query:stream
Multi-turn conversations + suggested follow-ups /api/v1/conversations/...
Candidate proposals (pre-canonical) /api/v1/candidates/...
Provenance graph GET /api/v1/knowledge/{id}/provenance
Quality scans (staleness + conflict detection) GET /api/v1/knowledge:stale, POST /api/v1/knowledge:detect-conflicts
Cost / billing rollups GET /api/v1/billing (aggregate)
Cost drill-down -- per-call events GET /api/v1/billing/events
Cost drill-down -- 24h / 7d / 30d snapshot GET /api/v1/billing/summary
Cost drill-down -- top-N consumers GET /api/v1/billing/top
Cost drill-down -- per-subject attribution GET /api/v1/billing/by-subject
Cost drill-down -- latency percentiles (p50/p95/p99) GET /api/v1/billing/latency
Corpus + queue + cost inventory snapshot GET /api/v1/stats
Append-only audit log GET /api/v1/audit
Taxonomy (domain + jurisdiction) /api/v1/taxonomy/...
Agent-token CRUD (user-tier; mint returns secret ONCE) /api/v1/agent-tokens
Agent surface (X-Agent-Token-protected, 8 endpoints) /api/v1/agent/sources, .../query, .../query/stream, .../search, .../knowledge/{id}, .../knowledge/{id}/provenance, .../candidates:propose
Identity / model info GET /api/v1/version
Health / readiness / liveness /actuator/health/...
OpenAPI 3.1 /openapi.json, /docs, /redoc

Quickstart

Want the 10-minute curl tour instead?  See QUICKSTART.mdtask docker:up:test + one curl call against a mock LLM, no API keys.

git clone https://github.com/firefly-operationOS/flycanon.git
cd flycanon
task deps:install          # uv sync --extra dev (pins .venv)
task docker:up             # api + worker + postgres(pgvector) + redis
curl -fsS http://localhost:8500/actuator/health | jq .

Ingest a sample DOCX (the binary normaliser handles every format — this is just the simplest curl):

curl -fsS -X POST http://localhost:8500/api/v1/sources \
  -F "file=@./tests/fixtures/sample.docx" \
  -F 'metadata={"title":"Sample","domain":"process_owner"};type=application/json' \
  | jq .

Search the corpus:

curl -fsS -X POST http://localhost:8500/api/v1/search \
  -H 'Content-Type: application/json' \
  -d '{"query":"what does the document say about scope","top_k":5}' | jq .

Ask a grounded question:

curl -fsS -X POST http://localhost:8500/api/v1/query \
  -H 'Content-Type: application/json' \
  -d '{"question":"Summarise the scope section in three sentences."}' | jq .

A grounded "I don't know" looks like {"answer":"","citations":[]} — flycanon never hallucinates.


Local development

task dev:db              # Postgres (pgvector/pg16) + Redis only
task dev:migrate         # alembic upgrade head
task dev:serve           # FastAPI hot-reload on :8500
task dev:worker          # EDA worker in a separate terminal

Smoke the running service:

task health              # /actuator/health
task version             # /api/v1/version
task openapi             # /openapi.json

SDKs

Both SDKs pin their version to the service's CalVer (26.5.6), so the client and server upgrade in lockstep.

SDK Highlights
Python Async-first, httpx + Pydantic. Python ≥ 3.11.
Java Spring Boot 3.5.9 + Spring RestClient + Jackson. Java 25 (LTS). groupId = com.firefly. Ships an @AutoConfiguration so a CanonClient bean is wired straight from flycanon.* properties.

Java consumers just declare the dependency and inject the bean:

@Service
public class CopilotService {
    private final CanonClient canon;
    public CopilotService(CanonClient canon) { this.canon = canon; }
    // canon.submitSource(...), canon.search(...), canon.answer(...)
}

Documentation

Document Read it when…
QUICKSTART.md You want your first ingest + search + answer in ten minutes (HTTP / curl).
docs/architecture.md You need the data model, the binary-normaliser routing matrix, the Postgres-native retrieval design, the dependency arrows.
docs/pipeline.md You're touching the orchestrator, adding a new stage, or chasing a slow ingest.
docs/api-reference.md You're integrating with the HTTP API and need every endpoint, shape, and status code.
docs/payload-reference.md You're composing the request payload — every field, option, and example.
docs/eda-events.md You're subscribing to the flycanon.ingest / flycanon.knowledge / flycanon.audit / canon.workspaces.v1 topics.
docs/consumers.md You're building or auditing a service that consumes flycanon -- agent token scopes, workspace events, retry posture, wire-contract stability.
docs/deployment.md You're running this in production — env vars, topologies, OCR engines, embedding providers, auth, observability, sizing.
docs/cicd.md You're cutting a release or wiring CI/CD — the three GitHub Actions workflows, release cookbook, required secrets.
docs/troubleshooting.md The service / ingest / search / answer surface is misbehaving — symptom → root cause → fix.
docs/glossary.md You need a precise definition for a term the API or docs use.
sdks/python/README.md You're integrating from Python — async-first SDK with Pydantic typing.
sdks/java/README.md You're integrating from Java / Spring Boot — Spring Boot 3.5.9, com.firefly groupId, @AutoConfiguration.

The OpenAPI 3.1 document is served live by the running service at /openapi.json, with Swagger UI at /docs and Redoc at /redoc.


Repository layout

flycanon/
├─ Dockerfile                # Multi-stage build with the binary-normaliser system deps
├─ Taskfile.yml              # Canonical dev-loop interface
├─ docker-compose.yml        # api + worker + postgres (pgvector) + redis (optional gotenberg)
├─ docker-compose.test.yml   # Adds the mock LLM for integration tests
├─ pyfly.yaml                # pyfly application configuration
├─ alembic.ini               # Migration runner config
├─ env_template              # Reference environment file (.env is gitignored)
├─ migrations/               # Alembic versions
├─ src/flycanon/
│  ├─ app.py                 # @pyfly_application + scan_packages
│  ├─ main.py                # ASGI entry consumed by uvicorn
│  ├─ cli.py                 # `flycanon {serve,worker,migrate}`
│  ├─ config.py              # CanonSettings (FLYCANON_* env)
│  ├─ core/                  # @configuration + services + binary normaliser + mappers
│  ├─ interfaces/            # Public DTOs + enums
│  ├─ models/                # SQLAlchemy entities + repositories
│  ├─ resources/prompts/     # YAML prompt templates
│  └─ web/                   # @rest_controller + @controller_advice
├─ sdks/
│  ├─ python/                # Async-first Python SDK (Apache-2.0)
│  └─ java/                  # Spring Boot Java SDK (Apache-2.0, com.firefly)
├─ docs/                     # Architecture, payload reference, API reference, EDA events, glossary
└─ tests/
   ├─ unit/
   └─ integration/

License

flycanon is released under the Apache License 2.0 — copyright 2024-2026 Firefly Software Foundation.

The bundled SDKs under sdks/python and sdks/java ship their own Apache License 2.0 files.


Part of Firefly OperationOS.  Platform-agnostic by design.

About

Operational Knowledge Repository -- versioned, provenance-tracked canonical knowledge with hybrid retrieval (BM25 + vector + RRF + rerank), grounded RAG with citations, knowledge graph, multi-turn conversations, async ingest, PII guardrail and a billing + corpus snapshot. Part of Firefly OperationOS.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages