diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 4cfa52e..6b5e473 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -29,7 +29,7 @@ jobs: - name: Check out upstream scans repo for cross-validation uses: actions/checkout@v4 with: - repository: HeOCR/public-domain-hand-written-hebrew-scans + repository: HeOCR/hash path: .upstream lfs: false - name: Install dev dependencies diff --git a/AGENTS.md b/AGENTS.md index 1cb9394..e7a3d11 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -11,7 +11,7 @@ policy. A dataset of **sets of per-letter images of handwritten Hebrew letters**, grouped by writer. Each set = one person/scribe. Each per-letter image is a **crop** of a permissively-licensed upstream scan from -[HeOCR/public-domain-hand-written-hebrew-scans][upstream], with rights +[HeOCR/hash][upstream], with rights inherited and recorded per image. Canonical layout, schema motivation, and ingestion model live in [`docs/dataset_structure.md`]\ (docs/dataset_structure.md). The Hebrew letter enumeration is in @@ -22,7 +22,7 @@ per-image rights inheritance) is described in [`schemas/entry.schema.json`](schemas/entry.schema.json). The release runbook is [`docs/release_process.md`](docs/release_process.md). -[upstream]: https://github.com/HeOCR/public-domain-hand-written-hebrew-scans +[upstream]: https://github.com/HeOCR/hash ## First-time setup @@ -72,7 +72,7 @@ against the live upstream entry records: ```bash python3 scripts/validate_indexes.py \ - --upstream-path ../public-domain-hand-written-hebrew-scans + --upstream-path ../hash ``` CI checks out the upstream repo as a sibling and runs the validator with diff --git a/CHANGELOG.md b/CHANGELOG.md index 64f9cc2..9ddc380 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -20,7 +20,7 @@ licensing policy needed to start ingesting. - Writer-level (`schemas/writer.schema.json`) and entry-level (`schemas/entry.schema.json`) record contracts. Each entry references - an upstream scan in `HeOCR/public-domain-hand-written-hebrew-scans` + an upstream scan in `HeOCR/hash` by `source_id`, `entry_id`, `sha256` (mutable-tag-free), `commit` (40-char SHA), and `bbox`. - `scripts/validate_indexes.py`: schema validation, referential diff --git a/CITATION.cff b/CITATION.cff index 2833cbd..2e93530 100644 --- a/CITATION.cff +++ b/CITATION.cff @@ -3,7 +3,7 @@ cff-version: 1.2.0 message: Please cite this dataset using the metadata below. type: dataset title: Hebrew Handwritten Per-Letter Image Dataset -abstract: Per-letter image crops of handwritten Hebrew letters, grouped into sets by writer. Each crop is a derivative of a permissively-licensed upstream scan in HeOCR/public-domain-hand-written-hebrew-scans, with per-image rights inherited and attribution recorded. The index is line-oriented JSON (JSONL). Release 0.0.0-rc contains 48 per-letter image entries drawn from 2 verified writers (18 LicenseRef-Public-Domain-Israel, 30 PDM-1.0). +abstract: Per-letter image crops of handwritten Hebrew letters, grouped into sets by writer. Each crop is a derivative of a permissively-licensed upstream scan in HeOCR/hash, with per-image rights inherited and attribution recorded. The index is line-oriented JSON (JSONL). Release 0.0.0-rc contains 48 per-letter image entries drawn from 2 verified writers (18 LicenseRef-Public-Domain-Israel, 30 PDM-1.0). authors: - name: Shay Palachy-Affek version: 0.0.0-rc diff --git a/LICENSE.md b/LICENSE.md index cb8e99e..82e5b3a 100644 --- a/LICENSE.md +++ b/LICENSE.md @@ -1,9 +1,9 @@ # Licensing Policy This repository is structured for compound licensing — the same model used -by [HeOCR/public-domain-hand-written-hebrew-scans][upstream]. +by [HeOCR/hash][upstream]. -[upstream]: https://github.com/HeOCR/public-domain-hand-written-hebrew-scans +[upstream]: https://github.com/HeOCR/hash ## Repository-authored metadata @@ -32,7 +32,7 @@ material is separately released under compatible terms. ## Per-letter image crops Per-letter image bytes are **derivatives** of upstream scans hosted in -[HeOCR/public-domain-hand-written-hebrew-scans][upstream]. They are not +[HeOCR/hash][upstream]. They are not automatically covered by the metadata license. Each crop carries its own entry-level rights record in `data/index/entries.jsonl`: diff --git a/NOTICE.md b/NOTICE.md index 586b31b..ae13fdf 100644 --- a/NOTICE.md +++ b/NOTICE.md @@ -4,7 +4,7 @@ This file is generated by `scripts/generate_release_artifacts.py` from `data/ind Repository-authored metadata is dedicated to the public domain under CC0 1.0 Universal. See [`LICENSE`](LICENSE) and [`LICENSE.md`](LICENSE.md) for the full compound-licensing policy. -Per-letter image crops are derivatives of upstream scans in [HeOCR/public-domain-hand-written-hebrew-scans](https://github.com/HeOCR/public-domain-hand-written-hebrew-scans) and carry per-entry rights inherited from the source page. The entries listed below carry a license that requires attribution (currently CC-BY-4.0, CC-BY-SA-4.0). Anyone redistributing or reusing these crops must keep the listed credit and link to the source page on which the rights claim was verified. +Per-letter image crops are derivatives of upstream scans in [HeOCR/hash](https://github.com/HeOCR/hash) and carry per-entry rights inherited from the source page. The entries listed below carry a license that requires attribution (currently CC-BY-4.0, CC-BY-SA-4.0). Anyone redistributing or reusing these crops must keep the listed credit and link to the source page on which the rights claim was verified. - Corpus release: `0.0.0-rc` - Released at (corpus state): `2026-05-13T21:37:46Z` diff --git a/README.md b/README.md index 2d97347..2a4f709 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ cut from different scans by that writer. This repository is the downstream of: -- [HeOCR/public-domain-hand-written-hebrew-scans][upstream] — the +- [HeOCR/hash][upstream] (HASH — Hebrew Archive of Scanned Handwriting) — the canonical, permissively-licensed source of page-level scans. Every entry here cites its upstream scan. - [HeOCR/hletterscriptgen][gen] — the framework that turns page scans @@ -19,7 +19,7 @@ The intended downstream consumers are synthetic-document generators corpora they feed into ([HeOCR/HeOCRsynth][heocrsynth], [HeOCR/HeOCR][heocr]). -[upstream]: https://github.com/HeOCR/public-domain-hand-written-hebrew-scans +[upstream]: https://github.com/HeOCR/hash [gen]: https://github.com/HeOCR/hletterscriptgen [syngen]: https://github.com/HeOCR/hocrsyngen [heocrsynth]: https://github.com/HeOCR/HeOCRsynth diff --git a/datapackage.json b/datapackage.json index f31ecd0..4249d61 100644 --- a/datapackage.json +++ b/datapackage.json @@ -5,7 +5,7 @@ "title": "Shay Palachy-Affek" } ], - "description": "Per-letter image crops of handwritten Hebrew letters, grouped into sets by writer. Each crop is a derivative of a permissively-licensed upstream scan in HeOCR/public-domain-hand-written-hebrew-scans, with per-image rights inherited and attribution recorded. The index is line-oriented JSON (JSONL).", + "description": "Per-letter image crops of handwritten Hebrew letters, grouped into sets by writer. Each crop is a derivative of a permissively-licensed upstream scan in HeOCR/hash, with per-image rights inherited and attribution recorded. The index is line-oriented JSON (JSONL).", "homepage": "https://github.com/HeOCR/hletterscript", "keywords": [ "Hebrew", @@ -113,7 +113,7 @@ } }, "title": "Hebrew Handwritten Per-Letter Image Dataset", - "upstream_repo": "https://github.com/HeOCR/public-domain-hand-written-hebrew-scans", + "upstream_repo": "https://github.com/HeOCR/hash", "version": "0.0.0-rc", "version_released_date": "2026-05-12" } diff --git a/docs/dataset_structure.md b/docs/dataset_structure.md index 345fea5..3ffa8da 100644 --- a/docs/dataset_structure.md +++ b/docs/dataset_structure.md @@ -10,7 +10,7 @@ document or scan written by that writer. The corpus is the *downstream* product of two upstream things: -- [HeOCR/public-domain-hand-written-hebrew-scans] is the canonical source +- [HeOCR/hash] is the canonical source of page-level scans. Every per-letter image entry in this repo cites the upstream scan (`source_id`, `entry_id`, `sha256`) it was cut from. - [HeOCR/hletterscriptgen] is the framework that turns those page scans @@ -21,7 +21,7 @@ The intended downstream consumers are synthetic-document generators ([HeOCR/hocrsyngen]) and the synthetic / real Hebrew handwriting datasets they feed into ([HeOCR/HeOCRsynth], [HeOCR/HeOCR]). -[HeOCR/public-domain-hand-written-hebrew-scans]: https://github.com/HeOCR/public-domain-hand-written-hebrew-scans +[HeOCR/hash]: https://github.com/HeOCR/hash [HeOCR/hletterscriptgen]: https://github.com/HeOCR/hletterscriptgen [HeOCR/hocrsyngen]: https://github.com/HeOCR/hocrsyngen [HeOCR/HeOCRsynth]: https://github.com/HeOCR/HeOCRsynth @@ -155,7 +155,7 @@ operational form of this rule. Every per-letter image is a **crop / derivative** of an upstream scan whose rights have already been recorded in -`public-domain-hand-written-hebrew-scans/data/index/entries.jsonl`. +`hash/data/index/entries.jsonl`. Repository policy: - **Repository-authored metadata** in this repo is dedicated to the public diff --git a/schemas/entry.schema.json b/schemas/entry.schema.json index 004717d..99442c3 100644 --- a/schemas/entry.schema.json +++ b/schemas/entry.schema.json @@ -2,7 +2,7 @@ "$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "https://github.com/HeOCR/hletterscript/schemas/entry.schema.json", "title": "Handwritten Hebrew Per-Letter Image Entry", - "description": "One row per cropped per-letter image. Each entry is a derivative of a specific upstream scan in HeOCR/public-domain-hand-written-hebrew-scans.", + "description": "One row per cropped per-letter image. Each entry is a derivative of a specific upstream scan in HeOCR/hash.", "type": "object", "required": [ "entry_id", @@ -175,7 +175,7 @@ }, "upstream": { "type": "object", - "description": "Reference back to the scan in HeOCR/public-domain-hand-written-hebrew-scans this crop was extracted from. The upstream repository URL is recorded once in scripts/release_recipe.json (`upstream_repo`); it is not duplicated on every entry.", + "description": "Reference back to the scan in HeOCR/hash this crop was extracted from. The upstream repository URL is recorded once in scripts/release_recipe.json (`upstream_repo`); it is not duplicated on every entry.", "required": ["source_id", "entry_id", "sha256", "commit", "bbox"], "additionalProperties": false, "properties": { diff --git a/scripts/generate_release_artifacts.py b/scripts/generate_release_artifacts.py index a168d3f..29aa6f1 100644 --- a/scripts/generate_release_artifacts.py +++ b/scripts/generate_release_artifacts.py @@ -234,7 +234,7 @@ def _notice_stanza( for the full compound-licensing policy. Per-letter image crops are derivatives of upstream scans in \ -[HeOCR/public-domain-hand-written-hebrew-scans]({upstream_repo_url}) and \ +[HeOCR/hash]({upstream_repo_url}) and \ carry per-entry rights inherited from the source page. The entries \ listed below carry a license that requires attribution (currently \ {license_set}). Anyone redistributing or reusing these crops must keep \ diff --git a/scripts/release_recipe.json b/scripts/release_recipe.json index 6cf2eb1..f926e54 100644 --- a/scripts/release_recipe.json +++ b/scripts/release_recipe.json @@ -4,10 +4,10 @@ "version": "0.0.0-rc", "version_released_date": "2026-05-12", "initial_release_date": "2026-05-12T00:00:00Z", - "description": "Per-letter image crops of handwritten Hebrew letters, grouped into sets by writer. Each crop is a derivative of a permissively-licensed upstream scan in HeOCR/public-domain-hand-written-hebrew-scans, with per-image rights inherited and attribution recorded. The index is line-oriented JSON (JSONL).", + "description": "Per-letter image crops of handwritten Hebrew letters, grouped into sets by writer. Each crop is a derivative of a permissively-licensed upstream scan in HeOCR/hash, with per-image rights inherited and attribution recorded. The index is line-oriented JSON (JSONL).", "homepage": "https://github.com/HeOCR/hletterscript", "repository_code": "https://github.com/HeOCR/hletterscript", - "upstream_repo": "https://github.com/HeOCR/public-domain-hand-written-hebrew-scans", + "upstream_repo": "https://github.com/HeOCR/hash", "authors": [ {"name": "Shay Palachy-Affek", "role": "maintainer"} ], diff --git a/scripts/review_crops.py b/scripts/review_crops.py index 040e39c..4e42e98 100644 --- a/scripts/review_crops.py +++ b/scripts/review_crops.py @@ -113,7 +113,7 @@ def _build_html(entries: list[dict], upstream_root: Path | None) -> str:

Upstream scan: {upstream_entry_id}

Upstream scan not found locally. Run with - --upstream-path /path/to/public-domain-hand-written-hebrew-scans + --upstream-path /path/to/hash to display it.

""" @@ -391,7 +391,7 @@ def do_POST(self): def main() -> None: ap = argparse.ArgumentParser(description="Serve a crop-review page locally.") ap.add_argument("--upstream-path", metavar="PATH", - help="Path to a clone of HeOCR/public-domain-hand-written-hebrew-scans") + help="Path to a clone of HeOCR/hash") ap.add_argument("--output", metavar="FILE", help="Write the HTML to this file instead of serving it") ap.add_argument("--port", type=int, default=8765, diff --git a/scripts/validate_indexes.py b/scripts/validate_indexes.py index ef001d6..18bbbd7 100644 --- a/scripts/validate_indexes.py +++ b/scripts/validate_indexes.py @@ -392,7 +392,7 @@ def _load_upstream_entries(upstream_root: Path) -> dict[str, dict[str, Any]]: raise SystemExit(_err( upstream_entries_path, None, None, "upstream entries.jsonl not found; --upstream-path must point at a clone of " - "public-domain-hand-written-hebrew-scans", + "hash", )) by_id: dict[str, dict[str, Any]] = {} with upstream_entries_path.open("r", encoding="utf-8") as handle: @@ -493,7 +493,7 @@ def main() -> None: type=Path, default=None, help=( - "Path to a local clone of HeOCR/public-domain-hand-written-hebrew-scans. " + "Path to a local clone of HeOCR/hash. " "When set, the validator additionally cross-checks each entry's " "upstream.sha256 against the upstream file record and verifies " "upstream.bbox fits inside the upstream scan dimensions."