Skip to content

feat(data): seed corpus with 6 Bialik letter crops#3

Merged
shaypal5 merged 4 commits into
mainfrom
feat/data-seed-bialik-crops
May 12, 2026
Merged

feat(data): seed corpus with 6 Bialik letter crops#3
shaypal5 merged 4 commits into
mainfrom
feat/data-seed-bialik-crops

Conversation

@shaypal5
Copy link
Copy Markdown
Contributor

Summary

  • Adds the first real writer (chaim_nachman_bialik, verified) and 6 manual single-letter crops drawn from upstream scan commons__bialik_el_hazippor__p0001 — Bialik's autograph manuscript draft of El Hatzippor (upstream df07bd3).
  • Letters: lamed (ל), dalet (ד), vav (ו), he (ה), kaf (כ), bet (ב) — all cut from line 4 of the page, one variant per letter.
  • Rights inherited verbatim from upstream: license_expression: PDM-1.0, rights_basis: public_domain, no attribution required. verification_status: inherited_from_upstream.
  • Regenerated NOTICE.md, CITATION.cff, datapackage.json. datapackage.json::released_at advances to 2026-05-12T22:30:00Z (max of new extraction.extracted_at); CITATION.cff::date-released stays at 2026-05-12 per release_recipe.json::version_released_date, as designed.
  • Crop images committed under data/letters/chaim_nachman_bialik/<letter>/<entry_id>.png, all LFS-tracked (verified via git lfs ls-files).

Type of change

  • New writer(s) / new per-letter image entries (ingest)
  • Schema or validator change
  • Release tooling / CI change
  • Documentation / policy
  • Refactor / chore (no behaviour change)

Pre-merge checklist

  • python3 scripts/validate_indexes.py passes locally — ok: 1 writers, 6 entries, 6 files verified.
  • python3 scripts/validate_indexes.py --upstream-path ../public-domain-hand-written-hebrew-scans passes — ok: 1 writers, 6 entries, 6 files verified, 6 upstream-cross-checked.
  • python3 scripts/generate_release_artifacts.py was re-run; regenerated NOTICE.md / CITATION.cff / datapackage.json are staged.
  • python3 scripts/generate_release_artifacts.py --check passes (ok: release artefacts are up to date).
  • python3 -m pytest passes locally — 62 passed, 1 skipped.
  • git diff --check shows no whitespace issues.
  • New image files are LFS-tracked (git lfs ls-files shows all six PNGs).

Rights / licensing

The upstream entry commons__bialik_el_hazippor__p0001 records license_expression: PDM-1.0, rights_basis: public_domain, attribution_required: false. Bialik died in 1934, so the work is public domain in Israel and other life+70 jurisdictions. The same rights propagate to all six derivative crops; per the inheritance table in LICENSE.md, no attribution metadata is required and verification_status: inherited_from_upstream is correct.

license_expression/rights_basis pair is PDM-1.0public_domain, which matches LICENSE_BASIS_MAP in scripts/validate_indexes.py. Since no entry carries an attribution-required license, NOTICE.md's attribution section renders _No entries in this release require attribution._, which is the expected output for this corpus state.

Notes for reviewers

  • Letter identification confidence. lamed, kaf, bet are unambiguous from their cursive shapes. dalet, vav, he are best-guess identifications from cursive Hebrew — I could not fully read the surrounding words with certainty. Bboxes themselves are clean isolated single-letter crops, so any mislabel can be fixed by relabeling the entry without re-cropping. Cursive style flagged on each entry as cursive_ashkenazi.
  • Crop methodology. Coordinates picked manually by inspecting the upstream scan with Pillow (x, y, w, h of each bbox is recorded in upstream.bbox); each crop saved as PNG at native upstream resolution, then sha256/bytes/dims captured from disk. extraction.tool: "manual", extraction.tool_version: "0.0.0-manual" per the SemVer-prerelease regex.
  • Upstream pinning. All entries reference upstream commit df07bd3825405ed93c15fd61fe4d7967fc60885e (40-char SHA, not a tag).
  • Scope. Deliberately small (6 letters) to validate the end-to-end pipeline before bulk ingest. Out of scope: tagging v0.0.0-rc, hletterscriptgen integration, additional writers/scans.

Validator output:

$ python3 scripts/validate_indexes.py --upstream-path ../public-domain-hand-written-hebrew-scans
ok: 1 writers, 6 entries, 6 files verified, 6 upstream-cross-checked

🤖 Generated with Claude Code

shaypal5 and others added 4 commits May 12, 2026 22:56
First real ingest PR: adds the writer chaim_nachman_bialik and 6 manual
single-letter crops (lamed, dalet, vav, he, kaf, bet) cut from line 4 of
the upstream Bialik manuscript scan commons__bialik_el_hazippor__p0001.
Rights inherited from upstream (PDM-1.0; no attribution required).
Regenerated NOTICE.md, CITATION.cff, and datapackage.json.

Validates end-to-end against the upstream clone:
  ok: 1 writers, 6 entries, 6 files verified, 6 upstream-cross-checked

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Serves a self-contained review page at http://localhost:8765/ that shows:
- upstream scan with bbox overlays at 3× zoom (click a bbox to jump to card)
- each cropped letter at native size with metadata
- per-entry verdict form (correct / wrong / uncertain / drop) + free-text notes

Feedback is auto-saved to .review_feedback.json (gitignored) via a
POST /feedback handler so Claude can read it back in-session.

Run:
  python3 scripts/review_crops.py --upstream-path /path/to/upstream-scans

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Local reviewer feedback file written by scripts/review_crops.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Human crop review (via scripts/review_crops.py) revealed:
- dalet → resh (dalet/resh confusion in Ashkenazi cursive)
- he → tav
- vav → mem (collapsed mem form; low legibility; hard HTR example)
- kaf bbox actually contained kaf+yod side by side

Split the kaf+yod bbox at the natural ink gap (x=329–330):
- kaf: x=330,y=203,w=12,h=16
- yod: x=324,y=203,w=7,h=16

Net result: 6→7 entries, all validated (validate_indexes + pytest green).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shaypal5 shaypal5 merged commit 26397f3 into main May 12, 2026
1 check passed
@shaypal5 shaypal5 deleted the feat/data-seed-bialik-crops branch May 12, 2026 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant