feat(m3): glyph extractor, generation profile, and generate pipeline#21
Merged
Conversation
Implements M3 sub-PR 2 — the first end-to-end extraction pipeline. extractor.py - MIN_GLYPH_PX = 16 (issue #16 D2); _DEFAULT_MAX_AREA_FRACTION = 0.10 - Glyph frozen dataclass (x, y, width, height) - extract_glyphs(): CCA via Otsu binarisation; filters by min dimension and upper-area ceiling; sorts blobs (y, -x) — Hebrew reading order - crop_glyph(): crops from binarised image; returns PNG bytes - ExtractionError base; opencv-python-headless is an optional dep; clear install message when absent; raises rather than ImportError at import time so the rest of the package stays importable without CV generate_profile.py - GlyphAnnotation / ScanAnnotation / WriterAnnotation / GenerateProfile frozen dataclasses; human-curated bbox+letter annotations are the source of determinism — no CCA output in the generate path - load_generate_profile(path) -> (GenerateProfile, raw_dict): full structural validation; Hebrew letter check; unique writer_id and entry_id enforcement; upstream_checkout resolved relative to profile - GenerateProfileError(ValueError) with path context generator.py - generate(profile, raw, output_dir, *, generated_at): pins upstream checkout, loads+indexes eligible entries, crops each annotated glyph, writes PNG assets, builds and validates letter_set.v1 document, writes letter_set.json; returns list[Path] of written letter_set.json files - Output tree: <output_dir>/<writer_id>/letter_set.json and glyphs/<letter>/<entry_id>__<letter>__<x>_<y>_<w>_<h>.png - variant_id and asset_path are fully deterministic from bbox+entry_id - _resolve_scan_path() probes upstream file list for role='original'; includes a TODO hook for future ALTO/hOCR sidecar probe (deferred open question from sub-PR 1) - GeneratorError / GeneratorWarning; skipped entries emit warnings, not hard failures, unless zero glyphs extracted for a writer - generated_at defaults to utcnow(); pass --generated-at for reproducible builds cli.py - generate subcommand wired: --profile (required), --output (required), --generated-at (optional override for deterministic builds) - New scan-blobs subcommand: runs extract_glyphs on a scan, outputs blob list as JSON or text; --min-dim, --max-area flags; useful for populating a generation profile manually - Lazy imports of extractor/generator/generate_profile inside command handlers so the CLI stays importable without the CV stack tests - test_extractor.py: 15 tests covering blob detection, filtering, sorting, crop dimensions, binarisation, determinism, error cases - test_generate_profile.py: 27 tests covering valid load, all validation paths, Hebrew letter acceptance, upstream_checkout resolution - test_generator.py: 13 tests including full end-to-end pipeline, validation of output against schema, checksum integrity, determinism, and CLI integration for generate and scan-blobs - test_cli.py: updated test_generate_subcommand_* to reflect that generate now requires --profile/--output (argparse usage error) pyproject.toml: add [[tool.mypy.overrides]] for cv2 (no stubs) 144 tests pass, 91 % coverage, mypy strict clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix 1 (crop_glyph perf): add binarize_scan()/crop_binary(); update generator to call binarize_scan once per scan and crop_binary per glyph, avoiding redundant image I/O and Otsu re-threshold on every crop. Fix 2 (cv2 import dup): extract _require_cv2() helper used by all extractor functions; eliminates the duplicated try/except ImportError. Fix 3 (dead var glyph_dir): remove unused 'glyph_dir' assignment from generator._process_writer (now _extract_variants). Fix 4 (notes dropped): GlyphAnnotation.notes now propagated to variant['notes'] in the output document when set. Fix 5 (broad except): replace bare 'except Exception' in generate() with 'except (UpstreamError, OSError)'. Fix 6 (tuple return): load_generate_profile() now returns a plain GenerateProfile (not a tuple); config_hash is embedded as a field computed at load time. All callers (cli.py, tests) updated. Fix 7 (_process_writer decomp): split 160-line function into _extract_variants() (I/O + PNG writes) and _build_document() (JSON assembly); _process_writer() orchestrates them. Fix 8 (microseconds): generated_at default now uses datetime.now(UTC).replace(microsecond=0).isoformat() for reproducibility. Fix 9 (variant_id separator): change separator between entry_id, letter, and coords from __ to @ to avoid ambiguity with entry_ids that themselves use __ as separators. Fix 10 (import warnings inside fn): move 'import warnings as _warnings' to top-level; rename local list variable to pending_warnings throughout. Fix 11 (sort docstring): update Glyph docstring to say 'ascending y then descending x' instead of the imprecise 'Hebrew reading order'. Fix 12 (max area fraction public): rename _DEFAULT_MAX_AREA_FRACTION to DEFAULT_MAX_AREA_FRACTION; add to __all__. Fix 13 (connectivity comment): add inline comment explaining why connectivity=8 is chosen over 4 (diagonal contacts in Hebrew cursive). Fix 14 (mock tests): add 5 mock-based tests in test_generator.py that exercise document assembly, validation, PNG writes, notes propagation, and config_hash embedding without requiring real cv2 image I/O. Fix 15 (CI smoke): install [test,cv] in CI test job (was [test] only) so cv-dependent tests actually run; add scan-blobs CLI smoke step. Pre-existing ruff issues also cleaned up: sorted __all__, unsorted import block in generate_profile, ambiguous × in comments/strings, unused ScanAnnotation/EXIT_NOT_IMPLEMENTED imports. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
M3 sub-PR 2 — the first end-to-end extraction pipeline.
What's in this PR
extractor.py— CCA glyph extractorextract_glyphs(image_path)— Otsu binarise → CCA → filter byMIN_GLYPH_PX=16floor and 10%-of-page area ceiling → sort(y, -x)(Hebrew reading order)crop_glyph(image_path, glyph)— crops from the binarised image, returns PNG bytesExtractionErrorbase;opencv-python-headlessis optional — raises with a clear install message if absent, never at import timegenerate_profile.py— Human-curated generation configGlyphAnnotation / ScanAnnotation / WriterAnnotation / GenerateProfilefrozen dataclassesload_generate_profile(path) → (GenerateProfile, raw_dict); full validation, Hebrew letter check, unique writer/entry ID enforcementgenerator.py— End-to-end pipelinegenerate(profile, raw, output_dir, *, generated_at): pin → load entries → crop → write PNGs → build + validateletter_set.v1→ write JSON<output_dir>/<writer_id>/letter_set.json+glyphs/<letter>/<entry_id>__<letter>__<x>_<y>_<w>_<h>.pngvariant_idandasset_pathderived deterministically from entry_id + letter + bbox_resolve_scan_path()probesfiles[role='original']; includes a TODO hook for future ALTO/hOCR sidecar probe (open question from sub-PR 1)GeneratorWarning; zero-glyph-extracted raisesGeneratorErrorcli.py— Wired commandsgenerate --profile <path> --output <dir> [--generated-at ISO8601]scan-blobs <image> [--min-dim N] [--max-area N] [--format text|json]— discovery tool for populating profilesResolves sub-PR 1 open questions
_resolve_scan_path()has a TODO hook; currently all 60 entries have no sidecars so CCA is always used--max-areaTests
test_extractor.py— 15 tests (detection, filtering, sorting, crop, determinism)test_generate_profile.py— 27 tests (full validation, all error paths, Hebrew letter parametrize)test_generator.py— 13 tests (end-to-end pipeline, schema validation, checksum integrity, determinism, CLI)🤖 Generated with Claude Code