Skip to content

feat(m3): glyph extractor, generation profile, and generate pipeline#21

Merged
shaypal5 merged 3 commits into
mainfrom
feat/m3-extractor
May 24, 2026
Merged

feat(m3): glyph extractor, generation profile, and generate pipeline#21
shaypal5 merged 3 commits into
mainfrom
feat/m3-extractor

Conversation

@shaypal5
Copy link
Copy Markdown
Contributor

M3 sub-PR 2 — the first end-to-end extraction pipeline.

What's in this PR

extractor.py — CCA glyph extractor

  • extract_glyphs(image_path) — Otsu binarise → CCA → filter by MIN_GLYPH_PX=16 floor and 10%-of-page area ceiling → sort (y, -x) (Hebrew reading order)
  • crop_glyph(image_path, glyph) — crops from the binarised image, returns PNG bytes
  • ExtractionError base; opencv-python-headless is optional — raises with a clear install message if absent, never at import time

generate_profile.py — Human-curated generation config

  • GlyphAnnotation / ScanAnnotation / WriterAnnotation / GenerateProfile frozen dataclasses
  • Annotated bboxes + letter labels in the profile are the source of determinism — no CCA in the generate path
  • load_generate_profile(path) → (GenerateProfile, raw_dict); full validation, Hebrew letter check, unique writer/entry ID enforcement

generator.py — End-to-end pipeline

  • generate(profile, raw, output_dir, *, generated_at): pin → load entries → crop → write PNGs → build + validate letter_set.v1 → write JSON
  • Output: <output_dir>/<writer_id>/letter_set.json + glyphs/<letter>/<entry_id>__<letter>__<x>_<y>_<w>_<h>.png
  • variant_id and asset_path derived deterministically from entry_id + letter + bbox
  • _resolve_scan_path() probes files[role='original']; includes a TODO hook for future ALTO/hOCR sidecar probe (open question from sub-PR 1)
  • Non-fatal issues emit GeneratorWarning; zero-glyph-extracted raises GeneratorError

cli.py — Wired commands

  • generate --profile <path> --output <dir> [--generated-at ISO8601]
  • New scan-blobs <image> [--min-dim N] [--max-area N] [--format text|json] — discovery tool for populating profiles

Resolves sub-PR 1 open questions

Question Resolution
Upstream annotation probe _resolve_scan_path() has a TODO hook; currently all 60 entries have no sidecars so CCA is always used
Deskew pre-processing Omitted for M3 (low-severity failure mode); documented in code
Upper-area ceiling 10% of image area default; overridable via --max-area
Nikud handling Emitted as separate blobs; merging deferred to M4

Tests

  • 144 tests pass, 91% coverage, mypy strict clean
  • test_extractor.py — 15 tests (detection, filtering, sorting, crop, determinism)
  • test_generate_profile.py — 27 tests (full validation, all error paths, Hebrew letter parametrize)
  • test_generator.py — 13 tests (end-to-end pipeline, schema validation, checksum integrity, determinism, CLI)

🤖 Generated with Claude Code

shaypal5 and others added 3 commits May 24, 2026 23:21
Implements M3 sub-PR 2 — the first end-to-end extraction pipeline.

extractor.py
- MIN_GLYPH_PX = 16 (issue #16 D2); _DEFAULT_MAX_AREA_FRACTION = 0.10
- Glyph frozen dataclass (x, y, width, height)
- extract_glyphs(): CCA via Otsu binarisation; filters by min dimension
  and upper-area ceiling; sorts blobs (y, -x) — Hebrew reading order
- crop_glyph(): crops from binarised image; returns PNG bytes
- ExtractionError base; opencv-python-headless is an optional dep;
  clear install message when absent; raises rather than ImportError at
  import time so the rest of the package stays importable without CV

generate_profile.py
- GlyphAnnotation / ScanAnnotation / WriterAnnotation / GenerateProfile
  frozen dataclasses; human-curated bbox+letter annotations are the
  source of determinism — no CCA output in the generate path
- load_generate_profile(path) -> (GenerateProfile, raw_dict): full
  structural validation; Hebrew letter check; unique writer_id and
  entry_id enforcement; upstream_checkout resolved relative to profile
- GenerateProfileError(ValueError) with path context

generator.py
- generate(profile, raw, output_dir, *, generated_at): pins upstream
  checkout, loads+indexes eligible entries, crops each annotated glyph,
  writes PNG assets, builds and validates letter_set.v1 document, writes
  letter_set.json; returns list[Path] of written letter_set.json files
- Output tree: <output_dir>/<writer_id>/letter_set.json and
  glyphs/<letter>/<entry_id>__<letter>__<x>_<y>_<w>_<h>.png
- variant_id and asset_path are fully deterministic from bbox+entry_id
- _resolve_scan_path() probes upstream file list for role='original';
  includes a TODO hook for future ALTO/hOCR sidecar probe (deferred open
  question from sub-PR 1)
- GeneratorError / GeneratorWarning; skipped entries emit warnings, not
  hard failures, unless zero glyphs extracted for a writer
- generated_at defaults to utcnow(); pass --generated-at for
  reproducible builds

cli.py
- generate subcommand wired: --profile (required), --output (required),
  --generated-at (optional override for deterministic builds)
- New scan-blobs subcommand: runs extract_glyphs on a scan, outputs
  blob list as JSON or text; --min-dim, --max-area flags; useful for
  populating a generation profile manually
- Lazy imports of extractor/generator/generate_profile inside command
  handlers so the CLI stays importable without the CV stack

tests
- test_extractor.py: 15 tests covering blob detection, filtering,
  sorting, crop dimensions, binarisation, determinism, error cases
- test_generate_profile.py: 27 tests covering valid load, all
  validation paths, Hebrew letter acceptance, upstream_checkout
  resolution
- test_generator.py: 13 tests including full end-to-end pipeline,
  validation of output against schema, checksum integrity, determinism,
  and CLI integration for generate and scan-blobs
- test_cli.py: updated test_generate_subcommand_* to reflect that
  generate now requires --profile/--output (argparse usage error)

pyproject.toml: add [[tool.mypy.overrides]] for cv2 (no stubs)

144 tests pass, 91 % coverage, mypy strict clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix 1 (crop_glyph perf): add binarize_scan()/crop_binary(); update
generator to call binarize_scan once per scan and crop_binary per glyph,
avoiding redundant image I/O and Otsu re-threshold on every crop.

Fix 2 (cv2 import dup): extract _require_cv2() helper used by all
extractor functions; eliminates the duplicated try/except ImportError.

Fix 3 (dead var glyph_dir): remove unused 'glyph_dir' assignment from
generator._process_writer (now _extract_variants).

Fix 4 (notes dropped): GlyphAnnotation.notes now propagated to
variant['notes'] in the output document when set.

Fix 5 (broad except): replace bare 'except Exception' in
generate() with 'except (UpstreamError, OSError)'.

Fix 6 (tuple return): load_generate_profile() now returns a plain
GenerateProfile (not a tuple); config_hash is embedded as a field
computed at load time.  All callers (cli.py, tests) updated.

Fix 7 (_process_writer decomp): split 160-line function into
_extract_variants() (I/O + PNG writes) and _build_document() (JSON
assembly); _process_writer() orchestrates them.

Fix 8 (microseconds): generated_at default now uses
datetime.now(UTC).replace(microsecond=0).isoformat() for reproducibility.

Fix 9 (variant_id separator): change separator between entry_id,
letter, and coords from __ to @ to avoid ambiguity with entry_ids
that themselves use __ as separators.

Fix 10 (import warnings inside fn): move 'import warnings as _warnings'
to top-level; rename local list variable to pending_warnings throughout.

Fix 11 (sort docstring): update Glyph docstring to say 'ascending y
then descending x' instead of the imprecise 'Hebrew reading order'.

Fix 12 (max area fraction public): rename _DEFAULT_MAX_AREA_FRACTION to
DEFAULT_MAX_AREA_FRACTION; add to __all__.

Fix 13 (connectivity comment): add inline comment explaining why
connectivity=8 is chosen over 4 (diagonal contacts in Hebrew cursive).

Fix 14 (mock tests): add 5 mock-based tests in test_generator.py that
exercise document assembly, validation, PNG writes, notes propagation,
and config_hash embedding without requiring real cv2 image I/O.

Fix 15 (CI smoke): install [test,cv] in CI test job (was [test] only)
so cv-dependent tests actually run; add scan-blobs CLI smoke step.

Pre-existing ruff issues also cleaned up: sorted __all__, unsorted
import block in generate_profile, ambiguous × in comments/strings,
unused ScanAnnotation/EXIT_NOT_IMPLEMENTED imports.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@shaypal5 shaypal5 merged commit 9fd7d1b into main May 24, 2026
7 checks passed
@shaypal5 shaypal5 deleted the feat/m3-extractor branch May 24, 2026 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant