feat(data): seed corpus with 6 Bialik letter crops#3
Merged
Conversation
First real ingest PR: adds the writer chaim_nachman_bialik and 6 manual single-letter crops (lamed, dalet, vav, he, kaf, bet) cut from line 4 of the upstream Bialik manuscript scan commons__bialik_el_hazippor__p0001. Rights inherited from upstream (PDM-1.0; no attribution required). Regenerated NOTICE.md, CITATION.cff, and datapackage.json. Validates end-to-end against the upstream clone: ok: 1 writers, 6 entries, 6 files verified, 6 upstream-cross-checked Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Serves a self-contained review page at http://localhost:8765/ that shows: - upstream scan with bbox overlays at 3× zoom (click a bbox to jump to card) - each cropped letter at native size with metadata - per-entry verdict form (correct / wrong / uncertain / drop) + free-text notes Feedback is auto-saved to .review_feedback.json (gitignored) via a POST /feedback handler so Claude can read it back in-session. Run: python3 scripts/review_crops.py --upstream-path /path/to/upstream-scans Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Local reviewer feedback file written by scripts/review_crops.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Human crop review (via scripts/review_crops.py) revealed: - dalet → resh (dalet/resh confusion in Ashkenazi cursive) - he → tav - vav → mem (collapsed mem form; low legibility; hard HTR example) - kaf bbox actually contained kaf+yod side by side Split the kaf+yod bbox at the natural ink gap (x=329–330): - kaf: x=330,y=203,w=12,h=16 - yod: x=324,y=203,w=7,h=16 Net result: 6→7 entries, all validated (validate_indexes + pytest green). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
chaim_nachman_bialik, verified) and 6 manual single-letter crops drawn from upstream scancommons__bialik_el_hazippor__p0001— Bialik's autograph manuscript draft of El Hatzippor (upstreamdf07bd3).license_expression: PDM-1.0,rights_basis: public_domain, no attribution required.verification_status: inherited_from_upstream.NOTICE.md,CITATION.cff,datapackage.json.datapackage.json::released_atadvances to2026-05-12T22:30:00Z(max of newextraction.extracted_at);CITATION.cff::date-releasedstays at2026-05-12perrelease_recipe.json::version_released_date, as designed.data/letters/chaim_nachman_bialik/<letter>/<entry_id>.png, all LFS-tracked (verified viagit lfs ls-files).Type of change
Pre-merge checklist
python3 scripts/validate_indexes.pypasses locally —ok: 1 writers, 6 entries, 6 files verified.python3 scripts/validate_indexes.py --upstream-path ../public-domain-hand-written-hebrew-scanspasses —ok: 1 writers, 6 entries, 6 files verified, 6 upstream-cross-checked.python3 scripts/generate_release_artifacts.pywas re-run; regeneratedNOTICE.md/CITATION.cff/datapackage.jsonare staged.python3 scripts/generate_release_artifacts.py --checkpasses (ok: release artefacts are up to date).python3 -m pytestpasses locally —62 passed, 1 skipped.git diff --checkshows no whitespace issues.git lfs ls-filesshows all six PNGs).Rights / licensing
The upstream entry
commons__bialik_el_hazippor__p0001recordslicense_expression: PDM-1.0,rights_basis: public_domain,attribution_required: false. Bialik died in 1934, so the work is public domain in Israel and other life+70 jurisdictions. The same rights propagate to all six derivative crops; per the inheritance table inLICENSE.md, no attribution metadata is required andverification_status: inherited_from_upstreamis correct.license_expression/rights_basispair isPDM-1.0→public_domain, which matchesLICENSE_BASIS_MAPinscripts/validate_indexes.py. Since no entry carries an attribution-required license,NOTICE.md's attribution section renders_No entries in this release require attribution._, which is the expected output for this corpus state.Notes for reviewers
cursive_ashkenazi.x,y,w,hof each bbox is recorded inupstream.bbox); each crop saved as PNG at native upstream resolution, then sha256/bytes/dims captured from disk.extraction.tool: "manual",extraction.tool_version: "0.0.0-manual"per the SemVer-prerelease regex.df07bd3825405ed93c15fd61fe4d7967fc60885e(40-char SHA, not a tag).v0.0.0-rc, hletterscriptgen integration, additional writers/scans.Validator output:
🤖 Generated with Claude Code