Skip to content

feat(setup): scaffold dataset structure, schemas, CI, and release tooling#1

Merged
shaypal5 merged 2 commits into
mainfrom
chore/initial-setup
May 12, 2026
Merged

feat(setup): scaffold dataset structure, schemas, CI, and release tooling#1
shaypal5 merged 2 commits into
mainfrom
chore/initial-setup

Conversation

@shaypal5
Copy link
Copy Markdown
Contributor

Summary

Initial scaffolding for hletterscript — a dataset of sets of per-letter image crops of handwritten Hebrew letters, grouped by writer. The corpus is empty in this PR; the goal is to land the contracts and machinery needed before per-letter ingestion begins. The structure mirrors the upstream public-domain-hand-written-hebrew-scans repo so tooling and conventions stay parallel across the HeOCR org.

What this PR adds

Data model

  • schemas/writer.schema.json — writer/scribe record contract. Each row defines the set of letter images attributed to one person (the "set" in the problem statement).
  • schemas/entry.schema.json — per-letter image record contract. The load-bearing field is the upstream block which pins each crop to a specific scan in public-domain-hand-written-hebrew-scans by source_id, entry_id, sha256, commit, and bbox — so rights can be inherited and the link is verifiable.
  • docs/dataset_structure.md — full layout, serialisation, rights-inheritance model, and ingestion flow.
  • docs/letters.md — canonical 27-form Hebrew letter enumeration (22 base letters + 5 final forms) with codepoints, glyphs, and ASCII slugs.

Validation

  • scripts/validate_indexes.py enforces:
    • JSON Schema conformance for both writers.jsonl and entries.jsonl.
    • Referential integrity (entry.writer_id exists in writers.jsonl).
    • entry_id shape: <writer_id>__<letter.name>__v<NNNN>.
    • Hebrew letter codepoint/name/form cross-field consistency against the canonical table in docs/letters.md.
    • upstream.repo pinned to the upstream URL; upstream.entry_id derived from upstream.source_id.
    • image.local_path matches data/letters/<writer_id>/<letter.name>/<entry_id>.<ext>, with extension consistent with image.mime_type.
    • image.sha256 and image.bytes re-checked against the on-disk file on every run.
    • Attribution-required licenses (CC-BY-4.0, CC-BY-SA-4.0) must carry non-blank attribution_text and attribution_url.

Release infrastructure

  • scripts/generate_release_artifacts.py + scripts/release_recipe.json generate three artefacts deterministically:
    • NOTICE.md — attribution roll-up for entries whose license requires attribution.
    • CITATION.cff — Citation File Format 1.2.0.
    • datapackage.json — Frictionless Data Package manifest with license breakdown, per-writer counts, per-letter counts, and writer-status breakdown.
  • released_at derives from max(extraction.extracted_at); for the empty initial-setup state it falls back to release_recipe.json::initial_release_date.
  • --check mode is non-mutating and is what CI runs.

Licensing policy

  • LICENSE — Creative Commons CC0 1.0 Universal full legal text.
  • LICENSE.md — compound licensing policy:
    • Repository-authored metadata: CC0 1.0.
    • Per-image crops: rights inherited from the upstream scan with an explicit inheritance table.
    • CC-BY-SA-4.0 ShareAlike propagates: a crop of a CC-BY-SA scan is an adaptation, so the crop carries CC-BY-SA-4.0. Bundles that merely aggregate without modifying do not need to be relicensed.
    • NC / ND / research-only / unknown-rights crops are not ingestable — stricter than upstream's exclusion list because this corpus only delivers value if downstream synthetic generators can redistribute and remix it.

CI

  • .github/workflows/ci.yml runs on every push to main and every PR:
    • validate_indexes.py
    • generate_release_artifacts.py --check
    • pytest

Tests

  • 33 unit tests cover schema rejection, referential integrity, letter codepoint/name/form consistency, upstream URL/entry pinning, attribution gating, file-integrity checks, empty-corpus fallbacks, --check mode, recipe validation, and Frictionless Data Package conformance.

Agent guidance

  • AGENTS.md — pre-PR checklist, ingest rules, naming conventions, sha256/dimension helpers for macOS and Linux, accepted/rejected license lists, and what NOT to commit.

Initial release state

This PR ships v0.0.0-rc — an empty-corpus initial-setup release. writers.jsonl and entries.jsonl are empty. The release-artifact generator falls back to release_recipe.json::initial_release_date (2026-05-12T00:00:00Z) for the timestamp. Subsequent PRs (produced by hletterscriptgen) will populate the indexes.

Validation evidence

```text
$ python3 scripts/validate_indexes.py
ok: 0 writers, 0 entries, 0 files verified

$ python3 scripts/generate_release_artifacts.py --check
ok: release artefacts are up to date

$ python3 -m pytest
33 passed in 1.80s
```

Test plan

  • CI passes (validate, generate --check, pytest).
  • Reviewers sanity-check the upstream block design against the kind of provenance data hletterscriptgen can realistically emit.
  • Reviewers confirm the CC-BY-SA-4.0 inheritance rule in LICENSE.md matches the org's intent for adaptation-bearing crops.
  • Reviewers confirm the empty-corpus initial_release_date fallback is acceptable for the v0.0.0-rc state, or request a different fallback policy.

🤖 Generated with Claude Code

shaypal5 and others added 2 commits May 12, 2026 15:49
…oling

Initial scaffolding for the hletterscript repository — a dataset of
sets of per-letter image crops of handwritten Hebrew letters grouped by
writer. The corpus is empty in this PR; the goal is to land the
contracts and machinery needed before per-letter ingestion begins.

What this PR adds:

- `docs/dataset_structure.md` — layout, serialisation, rights-inheritance
  model, ingestion flow, and `upstream` cross-reference rules.
- `docs/letters.md` — canonical 27-form Hebrew letter enumeration
  (22 base letters + 5 final forms) with codepoints, glyphs, and slugs.
- `schemas/writer.schema.json` — writer/scribe record contract; defines
  the "set" of letter images attributed to one person.
- `schemas/entry.schema.json` — per-letter image record contract,
  including the `upstream` block that pins each crop to a specific
  scan in `public-domain-hand-written-hebrew-scans` by source_id,
  entry_id, sha256, commit, and bbox.
- `scripts/validate_indexes.py` — schema validation, referential
  integrity, Hebrew letter codepoint/name/form consistency, upstream
  repo URL pinning, local_path conventions, and on-disk file
  size/sha256 verification.
- `scripts/generate_release_artifacts.py` + `release_recipe.json` —
  deterministic generation of `NOTICE.md`, `CITATION.cff`, and
  `datapackage.json`. `released_at` derives from
  `max(extraction.extracted_at)`; for the empty initial-setup state it
  falls back to `release_recipe.json::initial_release_date`.
- `LICENSE` (CC0 1.0) + `LICENSE.md` — compound licensing policy:
  CC0 for repo-authored metadata; per-image rights inherited from
  upstream scan, with explicit handling of CC-BY-SA-4.0 ShareAlike
  inheritance on crops as adaptations.
- `AGENTS.md` — pre-PR checklist, ingest rules, accepted/rejected
  licenses, naming conventions.
- `.github/workflows/ci.yml` — runs validate_indexes.py,
  generate_release_artifacts.py --check, and pytest on every PR.
- `tests/` — 33 unit tests covering schema rejection, referential
  integrity, letter consistency, upstream pinning, attribution gating,
  file-integrity checks, empty-corpus fallbacks, and Frictionless
  Data Package conformance.

Validation evidence (run from repo root):

  $ python3 scripts/validate_indexes.py
  ok: 0 writers, 0 entries, 0 files verified

  $ python3 scripts/generate_release_artifacts.py --check
  ok: release artefacts are up to date

  $ python3 -m pytest
  33 passed in 1.80s

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Addresses the 18-item code review on the initial scaffolding PR.
Critical (merge-blocking) fixes:

- `upstream.commit` is now sha-only (40 hex); tag-style refs that were
  silently producing broken GitHub /blob/release:vX.Y.Z/ URLs in
  NOTICE.md are rejected by the schema. Human-readable tag info moves
  to an optional `upstream.release_tag` field.
- Split `CITATION.cff::date-released` (stable per version, from
  recipe::version_released_date) from `datapackage.json::released_at`
  (corpus state, from max(extraction.extracted_at)). Citations no
  longer drift as entries accumulate.
- `data/letters/**/*.{png,jpg,jpeg,webp,tif,tiff}` are now tracked
  via Git LFS from day one (`.gitattributes`). CI fetches LFS bytes
  before validating.
- Generator now has fixture-driven tests on a synthetic non-empty
  corpus (including one CC-BY-SA-4.0 attribution-required entry). The
  entire NOTICE.md stanza-building path is now exercised in CI.
- `upstream.repo` per-row field is gone; the canonical upstream URL
  lives in `scripts/release_recipe.json::upstream_repo` and is read
  from there.
- `writer.schema.json::references` is no longer `minItems: 1`
  unconditionally — the requirement is now conditional on
  `status in {verified, rejected}`, so `candidate` and `needs_review`
  writers can ship without fabricated references.
- `validate_indexes.py` now enforces `LICENSE_BASIS_MAP`:
  `rights.rights_basis` must match `rights.license_expression`
  (e.g. CC-BY-SA-4.0 ↔ cc_by_sa). Mismatched pairs hard-fail.

Important fixes:

- `extraction.tool_version` regex now accepts SemVer build metadata
  AND `git describe --tags --dirty` output (e.g. `v1.2.3-3-gabc1234`).
- New `--upstream-path` flag: when set, the validator cross-checks
  each entry's `upstream.sha256` against the live upstream file
  record and verifies `upstream.bbox` fits inside the upstream scan
  dimensions. CI sets this flag against a fresh checkout of
  `HeOCR/public-domain-hand-written-hebrew-scans`.
- `image.background = "transparent"` now requires an alpha-capable
  mime type (PNG / WebP / TIFF) — enforced by an `if/then` in the
  schema. The background enum is expanded to include `black`, `gray`,
  and `binarized`.

Polish:

- Drop the vestigial `letter.codepoint` regex (the cross-check against
  the canonical LETTER_TABLE in the validator is the source of truth).
- Add optional `letter.style` enum for downstream syngen/HTR consumers
  that filter on Hebrew handwriting style (Ashkenazi cursive, Sephardi
  block, Rashi, etc.).
- `--repo-root` flag is now documented in AGENTS.md (it's a tests-only
  override of the file-integrity check's repo root).
- Document Python 3.11+ floor in requirements-dev.txt and README.
- Add writer-disambiguation policy to AGENTS.md and
  docs/dataset_structure.md: `<name>_<birth_year>`, with fallbacks
  to death year / period start / authority ID.
- Standardise validator error format to
  `<file>:<line>: <id>: <message>`.

Infrastructure:

- `.gitattributes` (LFS rules + LF eol enforcement).
- `.editorconfig` (UTF-8, LF, final newline; tabs-vs-spaces convention).
- `.github/pull_request_template.md` with the pre-merge checklist.
- `CHANGELOG.md` (Keep a Changelog, v0.0.0-rc entry).
- `docs/release_process.md` documents the two-timestamp model and the
  version-bump runbook.

Validation evidence:

  $ python3 scripts/validate_indexes.py
  ok: 0 writers, 0 entries, 0 files verified

  $ python3 scripts/generate_release_artifacts.py --check
  ok: release artefacts are up to date

  $ python3 -m pytest
  63 passed in 4.57s

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@shaypal5 shaypal5 changed the title chore(setup): scaffold dataset structure, schemas, CI, and release tooling feat(setup): scaffold dataset structure, schemas, CI, and release tooling May 12, 2026
@shaypal5
Copy link
Copy Markdown
Contributor Author

Review-feedback follow-up commit

Pushed dc94d45 addressing the self-review on this PR.

Critical fixes

  • Broken NOTICE.md URLs: upstream.commit now requires a 40-char SHA. Tag refs that produced /blob/release:v0.1.0-rc/ 404s are rejected by the schema. Human-readable tag info moves to optional upstream.release_tag.
  • Citation date drift: split CITATION.cff::date-released (now stable per version, from release_recipe.json::version_released_date) from datapackage.json::released_at (corpus state, from max(extraction.extracted_at)). Citations no longer drift as entries land.
  • Git LFS from day one: .gitattributes tracks data/letters/**/*.{png,jpg,jpeg,webp,tif,tiff} via LFS; CI does git lfs pull before validating; AGENTS.md documents the first-time setup.
  • No tests on non-empty corpus: added synthetic_corpus fixture-driven tests (one PDM + one CC-BY-SA entry). The NOTICE-stanza/URL/license-breakdown paths are now exercised in CI, including an explicit regression test that release: never appears inside a NOTICE URL.
  • Per-row upstream.repo: removed. The canonical upstream URL lives only in release_recipe.json::upstream_repo. Schema's additionalProperties: false now rejects a stray repo field.
  • Conditional references.minItems: candidate and needs_review writers can ship with zero references; only verified / rejected writers must have ≥1 reference.
  • rights_basis vs license_expression mismatch: validator now hard-fails when the pair doesn't match LICENSE_BASIS_MAP (e.g. CC-BY-SA-4.0cc_by_sa). Unknown licenses are rejected with a pointer to the map.

Important fixes

  • extraction.tool_version accepts SemVer build metadata and git describe --tags --dirty output (v1.2.3-3-gabc1234).
  • New --upstream-path PATH flag cross-checks each entry's upstream.sha256 against the live upstream file record and verifies upstream.bbox fits inside the upstream scan dimensions. CI runs the validator with this flag against a fresh checkout of the upstream repo.
  • image.background = transparent requires PNG/WebP/TIFF (enforced by schema if/then). The background enum is expanded to include black, gray, binarized.

Polish

  • Drop vestigial letter.codepoint regex (the cross-check against LETTER_TABLE is the source of truth).
  • Add optional letter.style enum for downstream syngen/HTR filtering by handwriting style.
  • --repo-root documented in AGENTS.md as a tests-only override.
  • Document Python 3.11+ floor in requirements-dev.txt and README.
  • Writer-disambiguation policy in AGENTS.md and docs/dataset_structure.md: <name>_<birth_year>, with fallbacks.
  • Validator error format standardised to <file>:<line>: <id>: <message>.

Infrastructure

  • .gitattributes (LFS + LF eol enforcement).
  • .editorconfig.
  • .github/pull_request_template.md with the pre-merge checklist.
  • CHANGELOG.md (Keep a Changelog, v0.0.0-rc entry).
  • docs/release_process.md documents the two-timestamp model and version-bump runbook.

Validation evidence

```text
$ python3 scripts/validate_indexes.py
ok: 0 writers, 0 entries, 0 files verified

$ python3 scripts/generate_release_artifacts.py --check
ok: release artefacts are up to date

$ python3 -m pytest
63 passed in 4.57s
```

The PR title is renamed from chore to feat per the same review (conventional-commits semantics — this PR creates the project surface, it isn't a chore).

🤖 Generated with Claude Code

@shaypal5 shaypal5 merged commit 45b4d53 into main May 12, 2026
1 check passed
@shaypal5 shaypal5 deleted the chore/initial-setup branch May 12, 2026 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant