feat(setup): scaffold dataset structure, schemas, CI, and release tooling#1
Merged
Conversation
…oling Initial scaffolding for the hletterscript repository — a dataset of sets of per-letter image crops of handwritten Hebrew letters grouped by writer. The corpus is empty in this PR; the goal is to land the contracts and machinery needed before per-letter ingestion begins. What this PR adds: - `docs/dataset_structure.md` — layout, serialisation, rights-inheritance model, ingestion flow, and `upstream` cross-reference rules. - `docs/letters.md` — canonical 27-form Hebrew letter enumeration (22 base letters + 5 final forms) with codepoints, glyphs, and slugs. - `schemas/writer.schema.json` — writer/scribe record contract; defines the "set" of letter images attributed to one person. - `schemas/entry.schema.json` — per-letter image record contract, including the `upstream` block that pins each crop to a specific scan in `public-domain-hand-written-hebrew-scans` by source_id, entry_id, sha256, commit, and bbox. - `scripts/validate_indexes.py` — schema validation, referential integrity, Hebrew letter codepoint/name/form consistency, upstream repo URL pinning, local_path conventions, and on-disk file size/sha256 verification. - `scripts/generate_release_artifacts.py` + `release_recipe.json` — deterministic generation of `NOTICE.md`, `CITATION.cff`, and `datapackage.json`. `released_at` derives from `max(extraction.extracted_at)`; for the empty initial-setup state it falls back to `release_recipe.json::initial_release_date`. - `LICENSE` (CC0 1.0) + `LICENSE.md` — compound licensing policy: CC0 for repo-authored metadata; per-image rights inherited from upstream scan, with explicit handling of CC-BY-SA-4.0 ShareAlike inheritance on crops as adaptations. - `AGENTS.md` — pre-PR checklist, ingest rules, accepted/rejected licenses, naming conventions. - `.github/workflows/ci.yml` — runs validate_indexes.py, generate_release_artifacts.py --check, and pytest on every PR. - `tests/` — 33 unit tests covering schema rejection, referential integrity, letter consistency, upstream pinning, attribution gating, file-integrity checks, empty-corpus fallbacks, and Frictionless Data Package conformance. Validation evidence (run from repo root): $ python3 scripts/validate_indexes.py ok: 0 writers, 0 entries, 0 files verified $ python3 scripts/generate_release_artifacts.py --check ok: release artefacts are up to date $ python3 -m pytest 33 passed in 1.80s Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Addresses the 18-item code review on the initial scaffolding PR.
Critical (merge-blocking) fixes:
- `upstream.commit` is now sha-only (40 hex); tag-style refs that were
silently producing broken GitHub /blob/release:vX.Y.Z/ URLs in
NOTICE.md are rejected by the schema. Human-readable tag info moves
to an optional `upstream.release_tag` field.
- Split `CITATION.cff::date-released` (stable per version, from
recipe::version_released_date) from `datapackage.json::released_at`
(corpus state, from max(extraction.extracted_at)). Citations no
longer drift as entries accumulate.
- `data/letters/**/*.{png,jpg,jpeg,webp,tif,tiff}` are now tracked
via Git LFS from day one (`.gitattributes`). CI fetches LFS bytes
before validating.
- Generator now has fixture-driven tests on a synthetic non-empty
corpus (including one CC-BY-SA-4.0 attribution-required entry). The
entire NOTICE.md stanza-building path is now exercised in CI.
- `upstream.repo` per-row field is gone; the canonical upstream URL
lives in `scripts/release_recipe.json::upstream_repo` and is read
from there.
- `writer.schema.json::references` is no longer `minItems: 1`
unconditionally — the requirement is now conditional on
`status in {verified, rejected}`, so `candidate` and `needs_review`
writers can ship without fabricated references.
- `validate_indexes.py` now enforces `LICENSE_BASIS_MAP`:
`rights.rights_basis` must match `rights.license_expression`
(e.g. CC-BY-SA-4.0 ↔ cc_by_sa). Mismatched pairs hard-fail.
Important fixes:
- `extraction.tool_version` regex now accepts SemVer build metadata
AND `git describe --tags --dirty` output (e.g. `v1.2.3-3-gabc1234`).
- New `--upstream-path` flag: when set, the validator cross-checks
each entry's `upstream.sha256` against the live upstream file
record and verifies `upstream.bbox` fits inside the upstream scan
dimensions. CI sets this flag against a fresh checkout of
`HeOCR/public-domain-hand-written-hebrew-scans`.
- `image.background = "transparent"` now requires an alpha-capable
mime type (PNG / WebP / TIFF) — enforced by an `if/then` in the
schema. The background enum is expanded to include `black`, `gray`,
and `binarized`.
Polish:
- Drop the vestigial `letter.codepoint` regex (the cross-check against
the canonical LETTER_TABLE in the validator is the source of truth).
- Add optional `letter.style` enum for downstream syngen/HTR consumers
that filter on Hebrew handwriting style (Ashkenazi cursive, Sephardi
block, Rashi, etc.).
- `--repo-root` flag is now documented in AGENTS.md (it's a tests-only
override of the file-integrity check's repo root).
- Document Python 3.11+ floor in requirements-dev.txt and README.
- Add writer-disambiguation policy to AGENTS.md and
docs/dataset_structure.md: `<name>_<birth_year>`, with fallbacks
to death year / period start / authority ID.
- Standardise validator error format to
`<file>:<line>: <id>: <message>`.
Infrastructure:
- `.gitattributes` (LFS rules + LF eol enforcement).
- `.editorconfig` (UTF-8, LF, final newline; tabs-vs-spaces convention).
- `.github/pull_request_template.md` with the pre-merge checklist.
- `CHANGELOG.md` (Keep a Changelog, v0.0.0-rc entry).
- `docs/release_process.md` documents the two-timestamp model and the
version-bump runbook.
Validation evidence:
$ python3 scripts/validate_indexes.py
ok: 0 writers, 0 entries, 0 files verified
$ python3 scripts/generate_release_artifacts.py --check
ok: release artefacts are up to date
$ python3 -m pytest
63 passed in 4.57s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
Author
Review-feedback follow-up commitPushed Critical fixes
Important fixes
Polish
Infrastructure
Validation evidence```text $ python3 scripts/generate_release_artifacts.py --check $ python3 -m pytest The PR title is renamed from 🤖 Generated with Claude Code |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Initial scaffolding for
hletterscript— a dataset of sets of per-letter image crops of handwritten Hebrew letters, grouped by writer. The corpus is empty in this PR; the goal is to land the contracts and machinery needed before per-letter ingestion begins. The structure mirrors the upstreampublic-domain-hand-written-hebrew-scansrepo so tooling and conventions stay parallel across the HeOCR org.What this PR adds
Data model
schemas/writer.schema.json— writer/scribe record contract. Each row defines the set of letter images attributed to one person (the "set" in the problem statement).schemas/entry.schema.json— per-letter image record contract. The load-bearing field is theupstreamblock which pins each crop to a specific scan inpublic-domain-hand-written-hebrew-scansbysource_id,entry_id,sha256,commit, andbbox— so rights can be inherited and the link is verifiable.docs/dataset_structure.md— full layout, serialisation, rights-inheritance model, and ingestion flow.docs/letters.md— canonical 27-form Hebrew letter enumeration (22 base letters + 5 final forms) with codepoints, glyphs, and ASCII slugs.Validation
scripts/validate_indexes.pyenforces:writers.jsonlandentries.jsonl.entry.writer_idexists inwriters.jsonl).entry_idshape:<writer_id>__<letter.name>__v<NNNN>.docs/letters.md.upstream.repopinned to the upstream URL;upstream.entry_idderived fromupstream.source_id.image.local_pathmatchesdata/letters/<writer_id>/<letter.name>/<entry_id>.<ext>, with extension consistent withimage.mime_type.image.sha256andimage.bytesre-checked against the on-disk file on every run.CC-BY-4.0,CC-BY-SA-4.0) must carry non-blankattribution_textandattribution_url.Release infrastructure
scripts/generate_release_artifacts.py+scripts/release_recipe.jsongenerate three artefacts deterministically:NOTICE.md— attribution roll-up for entries whose license requires attribution.CITATION.cff— Citation File Format 1.2.0.datapackage.json— Frictionless Data Package manifest with license breakdown, per-writer counts, per-letter counts, and writer-status breakdown.released_atderives frommax(extraction.extracted_at); for the empty initial-setup state it falls back torelease_recipe.json::initial_release_date.--checkmode is non-mutating and is what CI runs.Licensing policy
LICENSE— Creative Commons CC0 1.0 Universal full legal text.LICENSE.md— compound licensing policy:CI
.github/workflows/ci.ymlruns on every push tomainand every PR:validate_indexes.pygenerate_release_artifacts.py --checkpytestTests
--checkmode, recipe validation, and Frictionless Data Package conformance.Agent guidance
AGENTS.md— pre-PR checklist, ingest rules, naming conventions, sha256/dimension helpers for macOS and Linux, accepted/rejected license lists, and what NOT to commit.Initial release state
This PR ships
v0.0.0-rc— an empty-corpus initial-setup release.writers.jsonlandentries.jsonlare empty. The release-artifact generator falls back torelease_recipe.json::initial_release_date(2026-05-12T00:00:00Z) for the timestamp. Subsequent PRs (produced byhletterscriptgen) will populate the indexes.Validation evidence
```text
$ python3 scripts/validate_indexes.py
ok: 0 writers, 0 entries, 0 files verified
$ python3 scripts/generate_release_artifacts.py --check
ok: release artefacts are up to date
$ python3 -m pytest
33 passed in 1.80s
```
Test plan
upstreamblock design against the kind of provenance datahletterscriptgencan realistically emit.LICENSE.mdmatches the org's intent for adaptation-bearing crops.initial_release_datefallback is acceptable for the v0.0.0-rc state, or request a different fallback policy.🤖 Generated with Claude Code