Meridian

A forensic measurement framework for RAG systems — deterministic retrieval diagnostics that separate "the retriever missed" from "the model misread," validated on legal and medical benchmarks. The 10-phase agent is the proving ground; the measurement layer is the contribution.

Four silent measurement bugs caught by the framework's own verify-before-trust discipline before they shipped wrong numbers. That's the proof it works — a ruler that checks itself.

Choose your depth

Format	Time	What it is
Condensed Autopsy (PDF)	~6 min	The 5-page record: thesis, the four caught bugs, headline numbers, architecture, limitations. Start here.
Full Report	~15 min	Complete methodology: all 10 phases evaluated, measurement layer design, end-to-end statistics, caveats.
47 Findings	reference	The evidence trail — every A/B, correction, and preclusion. The raw record.

The headline

4 corpora × 194 queries × 2 arms on a combined 11,524-chunk index.
+7.9pp correctness / +4.2pp faithfulness (controlled config-stack delta).
Ruler calibrated against LegalBench-RAG (arXiv 2408.10343) on 3/4 corpora.
4 silent measurement bugs caught before shipping — the thesis in practice.
47 findings with 7 corrections — the evidence trail IS the methodology.

Architecture

Sequential pipeline spine: query enters at the top, flows through ingestion, indexing, routing (document top-k filter), retrieval (dense + sparse channels), CC fusion (score-weighted merge), selector (LLM chunk promotion from the wider pool), synthesis (LLM answer + citations), verification (deterministic citation check), and out as an answer. The selector taps the wider retrieval pool (ranks 9-30) and promotes chunks into the top-8 context on evidence of unsupported claims. The diagram shows a verification-to-synthesis loop — this is a capability; all headline numbers use single-shot (Finding 18: loop is marginal, +1.8pp at +68% compute).

The measurement layer observes at two points via one-way arrows: Layer 1 (deterministic span taxonomy) observes the fusion/retrieval output, Layer 2 (LLM-judged correctness and faithfulness) observes the synthesized answer. The pipeline never reads from measurement — this separation is the architecture's point. When a number moves, you know which side changed.

The measurement layer

This is the contribution. Everything else supports it.

Layer 1 — Deterministic retrieval taxonomy (no LLM judges)

Every retrieved span is classified against ground-truth character offsets into one of six failure types:

Code	Failure type	What it tells you
DRM	Document retrieval miss	Retrieved from the wrong document entirely
CBF	Chunk boundary failure	Right document, answer straddles a chunk split
SGP	Span gap	Right document, only part of a multi-span answer found
ICR	Incorrect region	Right document, wrong section
OVR	Over-retrieval	Right region, chunk much coarser than the evidence
OK	Correct	Retrieved span covers the ground-truth evidence

Classification is pure arithmetic on character indices — no embeddings, no model calls, no judgment. Scoring is per-span (not merged-character-set) to correctly handle multi-span evidence (43% of LegalBench-RAG queries). P@k and R@k are character-overlap ratios computed from these spans.

Why deterministic matters: LLM judges drift with model updates, temperature, and prompt wording. A deterministic anchor eliminates that variable. When a number moves, you know whether the pipeline changed or the measurement changed — a capability LLM-only evaluation lacks.

Layer 2 — LLM-judged answer quality (separate, clearly labeled)

Two metrics, both judged by a pinned model (DeepSeek-v4-flash, temperature 0, thinking disabled) held constant across all comparisons:

Correctness — does the answer convey the same information as the ground-truth evidence? Span-informed: the judge sees both the system's answer and the golden evidence text.
Faithfulness — is each answer claim entailed by the retrieved context? Holistic groundedness (RAGAS definition): each claim judged against the full retrieved context.

Layer 2 never contaminates Layer 1. They are reported separately and measure different things: Layer 1 measures retrieval, Layer 2 measures reasoning/generation.

Ruler calibration

Before any external comparison, the measurement layer was empirically calibrated against the published LegalBench-RAG baseline (arXiv 2408.10343, Table 5). The paper's exact stack was replicated (RCTS 500-char, text-embedding-3-large, dense-only cosine, sqlite-vec) and run through our measurement layer. Three of four corpora reproduced within a ~2-3pp embedding-drift floor. The ruler computes correctly (Finding 44).

The four caught bugs — the thesis in practice

The measurement framework's value is proven by what it caught:

Pro model-ID alias — PRO_MODEL="deepseek-chat" silently routed to flash, not Pro. Every prior "Pro" test was actually flash-vs-flash. Caught by billing check + served-model logging. (Finding 34)
BM25 channel mismatch — combined-index dense searched mini-doc chunks while BM25 searched the full 96K-chunk parquet. Two channels measuring different pools. Caught by crash on CUAD + post-mortem. (Finding 45)
Zero-span offset bug — Qdrant payloads don't store character spans; the combined-index builder defaulted to (0,0). All P@k/R@8 computed as ~0 — a plausible-looking near-zero, not an obvious crash. Caught by sanity-check against calibration. (Finding 45)
Chunk-size inconsistency — MAUD uses 2048-char chunks while the other three use 512-char. Previously undocumented. Confounds MAUD's external P@k comparison. Caught during external-baseline verification. (Finding 46)

Each was caught by systematic verification, not by the numbers looking wrong. A wrong number from a silent bug looks exactly like a correct number from a real measurement — the only defense is checking the instrument, not just the reading.

Results

Config-stack delta (controlled, same index both sides)

Corpus	Correctness Arm 0 → Arm 1	Faithfulness Arm 0 → Arm 1
ContractNLI	62.9% → 71.1% (+8.2pp)	89.1% → 94.1% (+5.0pp)
PrivacyQA	49.0% → 55.7% (+6.7pp)	94.4% → 97.9% (+3.5pp)
CUAD	61.9% → 73.7% (+11.8pp)	92.1% → 96.0% (+3.9pp)
MAUD	68.0% → 72.7% (+4.7pp)	91.3% → 95.5% (+4.2pp)
Average	60.5% → 68.3% (+7.9pp)	91.7% → 95.9% (+4.2pp)

Both arms use the same combined index (11,524 chunks, 72 documents, 4 corpora pooled), same embedder, same judge. The delta isolates CC fusion + routing + selector over RRF.

External comparison (system-vs-system)

Corpus	Meridian P@1	RCTS P@1	Meridian R@8	RCTS R@8
ContractNLI	0.422	0.066	0.810	0.250
CUAD	0.394	0.020	0.814	0.317
PrivacyQA	0.297	0.144	0.579	0.424

Published baselines from arXiv 2408.10343 Table 5 (RCTS, text-embedding-3-large, dense-only). This compares the full Meridian stack against a bare baseline — the advantage bundles embedder + hybrid retrieval + fusion + routing. It is system-level evidence, not a single-component ablation. MAUD excluded (chunk-granularity confound). ContractNLI caveated (benchmark-file provenance).

Routing boundary

Routing is a concentrated-relevance technique, not universal. It holds at 100% on distinctive corpora (CUAD, MAUD), degrades to 76% on homogeneous cross-corpus retrieval (ContractNLI NDAs confused with CUAD contracts), and is correctly self-disabled on dispersed-relevance medical text (NFCorpus). The sweep auto-detects when routing helps — consistent across Findings 23, 45, and 47.

NFCorpus transfer (retrieval stack only)

nDCG@10 = 0.399, above the classic BEIR baselines (BM25 0.325, BM25+CE 0.350, contriever 0.328). CC fusion transfers as a general improvement (+5.6pp over RRF). The span-forensic measurement framework was not exercised — BEIR provides document-level relevance, not character spans. This is a retrieval-transfer result; the routing-boundary characterization is the real finding.

What Meridian is not

Not an autonomous research agent. That direction was explored in v1 and deliberately abandoned. Phase 10's loop is per-query only.
Not a SOTA-everywhere claim. External multipliers are system-vs-system (full stack vs bare baseline). NFCorpus beats classic baselines, not modern dense SOTA.
Not a framework-transfer claim. The span-forensic taxonomy requires character-span ground truth. It ran on LegalBench-RAG (which has spans) but not on NFCorpus (which doesn't). The framework's generality beyond legal corpora with span annotations is a claim, not yet a result.
Not a tutorial. It assumes familiarity with RAG, IR evaluation, and the LegalBench-RAG benchmark.

Honest limitations

Routing degrades to 76% on ContractNLI in the combined regime (NDA/commercial-contract confusion). 47/194 queries get zero correct-document chunks.
Routing hurts on dispersed relevance. Confirmed on NFCorpus — routing is a concentrated-relevance technique only.
Span framework requires character-span ground truth. Does not apply to document-level relevance benchmarks (most of BEIR, MS MARCO).
MAUD external comparison confounded by 2048-char vs 500-char chunk granularity.
Affirmative-only evaluation. All four LegalBench-RAG corpora measure recall of evidence that exists, not false-positive rate.
Synthesis variance band is ±2-4pp (Finding 40). Quality deltas below this are indistinguishable from nondeterminism.

Quickstart

cp .env.example .env
# Fill in: VOYAGE_API_KEY, DEEPSEEK_API_KEY, QDRANT_URL, QDRANT_API_KEY

# Per-corpus eval
python scripts/run_corpus_eval.py --corpus contractnli --chunk-alpha 0.2 \
  --routing-topk 3 --routing-alpha 0.3 --workers 12 --output data/eval.jsonl

# Combined-index headline
python scripts/run_headline_combined.py --workers 8

# NFCorpus transfer
python scripts/run_nfcorpus_scout.py

# Judges
python scripts/judge_answers_v2.py --corpus contractnli --input data/eval.jsonl
python scripts/judge_faithfulness.py --corpus contractnli --input data/eval.jsonl

# Calibration (reproduces paper baseline through our ruler)
python scripts/run_calibration.py

Python 3.11+. Core: qdrant-client, voyageai, openai, instructor, rank-bm25, pandas, numpy, beir.

Repository layout

meridian/
├── Meridian_Autopsy.pdf               ← 5-page condensed record (start here)
├── figures/                            ← programmatic figures (reproducible)
│   └── make_figures.py
├── docs/
│   ├── REPORT.md                       ← full methodology + results
│   └── DECISIONS.md                    ← 47 findings, the evidence trail
├── core/
│   ├── measurement/                    ← Layer 1: taxonomy, metrics, span_overlap
│   ├── retrieval/                      ← dense, sparse, fusion, routing
│   ├── supervisor/                     ← pipeline phases 3-10
│   ├── ingestion/                      ← chunking, embedding
│   └── evaluation/                     ← ground-truth adapters, fingerprinting
├── scripts/
│   ├── run_corpus_eval.py              ← per-corpus eval harness
│   ├── run_headline_combined.py        ← combined-index headline
│   ├── run_nfcorpus_scout.py           ← NFCorpus transfer
│   ├── run_calibration.py              ← ruler calibration vs paper
│   ├── judge_answers_v2.py             ← Layer 2: correctness
│   └── judge_faithfulness.py           ← Layer 2: faithfulness
├── CLAUDE.md                           ← full project brief
└── .env.example                        ← config template (no secrets)

Related work

Production RAG Stack Forensics — same forensic methodology applied to a FastAPI documentation corpus. 4 failure modes documented, 1 judge bug caught.
aether — workflow reasoning engine with grounded output and refuse-rather-than-fabricate discipline. Retrieval primitives built bottom-up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Meridian

Choose your depth

The headline

Architecture

The measurement layer

Layer 1 — Deterministic retrieval taxonomy (no LLM judges)

Layer 2 — LLM-judged answer quality (separate, clearly labeled)

Ruler calibration

The four caught bugs — the thesis in practice

Results

Config-stack delta (controlled, same index both sides)

External comparison (system-vs-system)

Routing boundary

NFCorpus transfer (retrieval stack only)

What Meridian is not

Honest limitations

Quickstart

Repository layout

Related work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.claude		.claude
core		core
data		data
docs		docs
figures		figures
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Meridian_Autopsy.pdf		Meridian_Autopsy.pdf
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Meridian

Choose your depth

The headline

Architecture

The measurement layer

Layer 1 — Deterministic retrieval taxonomy (no LLM judges)

Layer 2 — LLM-judged answer quality (separate, clearly labeled)

Ruler calibration

The four caught bugs — the thesis in practice

Results

Config-stack delta (controlled, same index both sides)

External comparison (system-vs-system)

Routing boundary

NFCorpus transfer (retrieval stack only)

What Meridian is not

Honest limitations

Quickstart

Repository layout

Related work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages