Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ jobs:
- name: Check out upstream scans repo for cross-validation
uses: actions/checkout@v4
with:
repository: HeOCR/public-domain-hand-written-hebrew-scans
repository: HeOCR/hash
path: .upstream
lfs: false
- name: Install dev dependencies
Expand Down
6 changes: 3 additions & 3 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ policy.
A dataset of **sets of per-letter images of handwritten Hebrew letters**,
grouped by writer. Each set = one person/scribe. Each per-letter image
is a **crop** of a permissively-licensed upstream scan from
[HeOCR/public-domain-hand-written-hebrew-scans][upstream], with rights
[HeOCR/hash][upstream], with rights
inherited and recorded per image. Canonical layout, schema motivation,
and ingestion model live in [`docs/dataset_structure.md`]\
(docs/dataset_structure.md). The Hebrew letter enumeration is in
Expand All @@ -22,7 +22,7 @@ per-image rights inheritance) is described in
[`schemas/entry.schema.json`](schemas/entry.schema.json). The release
runbook is [`docs/release_process.md`](docs/release_process.md).

[upstream]: https://github.com/HeOCR/public-domain-hand-written-hebrew-scans
[upstream]: https://github.com/HeOCR/hash

## First-time setup

Expand Down Expand Up @@ -72,7 +72,7 @@ against the live upstream entry records:

```bash
python3 scripts/validate_indexes.py \
--upstream-path ../public-domain-hand-written-hebrew-scans
--upstream-path ../hash
```

CI checks out the upstream repo as a sibling and runs the validator with
Expand Down
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ licensing policy needed to start ingesting.

- Writer-level (`schemas/writer.schema.json`) and entry-level
(`schemas/entry.schema.json`) record contracts. Each entry references
an upstream scan in `HeOCR/public-domain-hand-written-hebrew-scans`
an upstream scan in `HeOCR/hash`
by `source_id`, `entry_id`, `sha256` (mutable-tag-free), `commit`
(40-char SHA), and `bbox`.
- `scripts/validate_indexes.py`: schema validation, referential
Expand Down
2 changes: 1 addition & 1 deletion CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ cff-version: 1.2.0
message: Please cite this dataset using the metadata below.
type: dataset
title: Hebrew Handwritten Per-Letter Image Dataset
abstract: Per-letter image crops of handwritten Hebrew letters, grouped into sets by writer. Each crop is a derivative of a permissively-licensed upstream scan in HeOCR/public-domain-hand-written-hebrew-scans, with per-image rights inherited and attribution recorded. The index is line-oriented JSON (JSONL). Release 0.0.0-rc contains 48 per-letter image entries drawn from 2 verified writers (18 LicenseRef-Public-Domain-Israel, 30 PDM-1.0).
abstract: Per-letter image crops of handwritten Hebrew letters, grouped into sets by writer. Each crop is a derivative of a permissively-licensed upstream scan in HeOCR/hash, with per-image rights inherited and attribution recorded. The index is line-oriented JSON (JSONL). Release 0.0.0-rc contains 48 per-letter image entries drawn from 2 verified writers (18 LicenseRef-Public-Domain-Israel, 30 PDM-1.0).
authors:
- name: Shay Palachy-Affek
version: 0.0.0-rc
Expand Down
6 changes: 3 additions & 3 deletions LICENSE.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Licensing Policy

This repository is structured for compound licensing — the same model used
by [HeOCR/public-domain-hand-written-hebrew-scans][upstream].
by [HeOCR/hash][upstream].

[upstream]: https://github.com/HeOCR/public-domain-hand-written-hebrew-scans
[upstream]: https://github.com/HeOCR/hash

## Repository-authored metadata

Expand Down Expand Up @@ -32,7 +32,7 @@ material is separately released under compatible terms.
## Per-letter image crops

Per-letter image bytes are **derivatives** of upstream scans hosted in
[HeOCR/public-domain-hand-written-hebrew-scans][upstream]. They are not
[HeOCR/hash][upstream]. They are not
automatically covered by the metadata license. Each crop carries its own
entry-level rights record in `data/index/entries.jsonl`:

Expand Down
2 changes: 1 addition & 1 deletion NOTICE.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ This file is generated by `scripts/generate_release_artifacts.py` from `data/ind

Repository-authored metadata is dedicated to the public domain under CC0 1.0 Universal. See [`LICENSE`](LICENSE) and [`LICENSE.md`](LICENSE.md) for the full compound-licensing policy.

Per-letter image crops are derivatives of upstream scans in [HeOCR/public-domain-hand-written-hebrew-scans](https://github.com/HeOCR/public-domain-hand-written-hebrew-scans) and carry per-entry rights inherited from the source page. The entries listed below carry a license that requires attribution (currently CC-BY-4.0, CC-BY-SA-4.0). Anyone redistributing or reusing these crops must keep the listed credit and link to the source page on which the rights claim was verified.
Per-letter image crops are derivatives of upstream scans in [HeOCR/hash](https://github.com/HeOCR/hash) and carry per-entry rights inherited from the source page. The entries listed below carry a license that requires attribution (currently CC-BY-4.0, CC-BY-SA-4.0). Anyone redistributing or reusing these crops must keep the listed credit and link to the source page on which the rights claim was verified.

- Corpus release: `0.0.0-rc`
- Released at (corpus state): `2026-05-13T21:37:46Z`
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ cut from different scans by that writer.

This repository is the downstream of:

- [HeOCR/public-domain-hand-written-hebrew-scans][upstream] — the
- [HeOCR/hash][upstream] (HASH — Hebrew Archive of Scanned Handwriting) — the
canonical, permissively-licensed source of page-level scans. Every
entry here cites its upstream scan.
- [HeOCR/hletterscriptgen][gen] — the framework that turns page scans
Expand All @@ -19,7 +19,7 @@ The intended downstream consumers are synthetic-document generators
corpora they feed into ([HeOCR/HeOCRsynth][heocrsynth],
[HeOCR/HeOCR][heocr]).

[upstream]: https://github.com/HeOCR/public-domain-hand-written-hebrew-scans
[upstream]: https://github.com/HeOCR/hash
[gen]: https://github.com/HeOCR/hletterscriptgen
[syngen]: https://github.com/HeOCR/hocrsyngen
[heocrsynth]: https://github.com/HeOCR/HeOCRsynth
Expand Down
4 changes: 2 additions & 2 deletions datapackage.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
"title": "Shay Palachy-Affek"
}
],
"description": "Per-letter image crops of handwritten Hebrew letters, grouped into sets by writer. Each crop is a derivative of a permissively-licensed upstream scan in HeOCR/public-domain-hand-written-hebrew-scans, with per-image rights inherited and attribution recorded. The index is line-oriented JSON (JSONL).",
"description": "Per-letter image crops of handwritten Hebrew letters, grouped into sets by writer. Each crop is a derivative of a permissively-licensed upstream scan in HeOCR/hash, with per-image rights inherited and attribution recorded. The index is line-oriented JSON (JSONL).",
"homepage": "https://github.com/HeOCR/hletterscript",
"keywords": [
"Hebrew",
Expand Down Expand Up @@ -113,7 +113,7 @@
}
},
"title": "Hebrew Handwritten Per-Letter Image Dataset",
"upstream_repo": "https://github.com/HeOCR/public-domain-hand-written-hebrew-scans",
"upstream_repo": "https://github.com/HeOCR/hash",
"version": "0.0.0-rc",
"version_released_date": "2026-05-12"
}
6 changes: 3 additions & 3 deletions docs/dataset_structure.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ document or scan written by that writer.

The corpus is the *downstream* product of two upstream things:

- [HeOCR/public-domain-hand-written-hebrew-scans] is the canonical source
- [HeOCR/hash] is the canonical source
of page-level scans. Every per-letter image entry in this repo cites
the upstream scan (`source_id`, `entry_id`, `sha256`) it was cut from.
- [HeOCR/hletterscriptgen] is the framework that turns those page scans
Expand All @@ -21,7 +21,7 @@ The intended downstream consumers are synthetic-document generators
([HeOCR/hocrsyngen]) and the synthetic / real Hebrew handwriting datasets
they feed into ([HeOCR/HeOCRsynth], [HeOCR/HeOCR]).

[HeOCR/public-domain-hand-written-hebrew-scans]: https://github.com/HeOCR/public-domain-hand-written-hebrew-scans
[HeOCR/hash]: https://github.com/HeOCR/hash
[HeOCR/hletterscriptgen]: https://github.com/HeOCR/hletterscriptgen
[HeOCR/hocrsyngen]: https://github.com/HeOCR/hocrsyngen
[HeOCR/HeOCRsynth]: https://github.com/HeOCR/HeOCRsynth
Expand Down Expand Up @@ -155,7 +155,7 @@ operational form of this rule.

Every per-letter image is a **crop / derivative** of an upstream scan whose
rights have already been recorded in
`public-domain-hand-written-hebrew-scans/data/index/entries.jsonl`.
`hash/data/index/entries.jsonl`.
Repository policy:

- **Repository-authored metadata** in this repo is dedicated to the public
Expand Down
4 changes: 2 additions & 2 deletions schemas/entry.schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://github.com/HeOCR/hletterscript/schemas/entry.schema.json",
"title": "Handwritten Hebrew Per-Letter Image Entry",
"description": "One row per cropped per-letter image. Each entry is a derivative of a specific upstream scan in HeOCR/public-domain-hand-written-hebrew-scans.",
"description": "One row per cropped per-letter image. Each entry is a derivative of a specific upstream scan in HeOCR/hash.",
"type": "object",
"required": [
"entry_id",
Expand Down Expand Up @@ -175,7 +175,7 @@
},
"upstream": {
"type": "object",
"description": "Reference back to the scan in HeOCR/public-domain-hand-written-hebrew-scans this crop was extracted from. The upstream repository URL is recorded once in scripts/release_recipe.json (`upstream_repo`); it is not duplicated on every entry.",
"description": "Reference back to the scan in HeOCR/hash this crop was extracted from. The upstream repository URL is recorded once in scripts/release_recipe.json (`upstream_repo`); it is not duplicated on every entry.",
"required": ["source_id", "entry_id", "sha256", "commit", "bbox"],
"additionalProperties": false,
"properties": {
Expand Down
2 changes: 1 addition & 1 deletion scripts/generate_release_artifacts.py
Original file line number Diff line number Diff line change
Expand Up @@ -234,7 +234,7 @@ def _notice_stanza(
for the full compound-licensing policy.

Per-letter image crops are derivatives of upstream scans in \
[HeOCR/public-domain-hand-written-hebrew-scans]({upstream_repo_url}) and \
[HeOCR/hash]({upstream_repo_url}) and \
carry per-entry rights inherited from the source page. The entries \
listed below carry a license that requires attribution (currently \
{license_set}). Anyone redistributing or reusing these crops must keep \
Expand Down
4 changes: 2 additions & 2 deletions scripts/release_recipe.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@
"version": "0.0.0-rc",
"version_released_date": "2026-05-12",
"initial_release_date": "2026-05-12T00:00:00Z",
"description": "Per-letter image crops of handwritten Hebrew letters, grouped into sets by writer. Each crop is a derivative of a permissively-licensed upstream scan in HeOCR/public-domain-hand-written-hebrew-scans, with per-image rights inherited and attribution recorded. The index is line-oriented JSON (JSONL).",
"description": "Per-letter image crops of handwritten Hebrew letters, grouped into sets by writer. Each crop is a derivative of a permissively-licensed upstream scan in HeOCR/hash, with per-image rights inherited and attribution recorded. The index is line-oriented JSON (JSONL).",
"homepage": "https://github.com/HeOCR/hletterscript",
"repository_code": "https://github.com/HeOCR/hletterscript",
"upstream_repo": "https://github.com/HeOCR/public-domain-hand-written-hebrew-scans",
"upstream_repo": "https://github.com/HeOCR/hash",
"authors": [
{"name": "Shay Palachy-Affek", "role": "maintainer"}
],
Expand Down
4 changes: 2 additions & 2 deletions scripts/review_crops.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ def _build_html(entries: list[dict], upstream_root: Path | None) -> str:
<div class="scan-block">
<h2>Upstream scan: <code>{upstream_entry_id}</code></h2>
<p class="warn">Upstream scan not found locally. Run with
<code>--upstream-path /path/to/public-domain-hand-written-hebrew-scans</code>
<code>--upstream-path /path/to/hash</code>
to display it.</p>
</div>
"""
Expand Down Expand Up @@ -391,7 +391,7 @@ def do_POST(self):
def main() -> None:
ap = argparse.ArgumentParser(description="Serve a crop-review page locally.")
ap.add_argument("--upstream-path", metavar="PATH",
help="Path to a clone of HeOCR/public-domain-hand-written-hebrew-scans")
help="Path to a clone of HeOCR/hash")
ap.add_argument("--output", metavar="FILE",
help="Write the HTML to this file instead of serving it")
ap.add_argument("--port", type=int, default=8765,
Expand Down
4 changes: 2 additions & 2 deletions scripts/validate_indexes.py
Original file line number Diff line number Diff line change
Expand Up @@ -392,7 +392,7 @@ def _load_upstream_entries(upstream_root: Path) -> dict[str, dict[str, Any]]:
raise SystemExit(_err(
upstream_entries_path, None, None,
"upstream entries.jsonl not found; --upstream-path must point at a clone of "
"public-domain-hand-written-hebrew-scans",
"hash",
))
by_id: dict[str, dict[str, Any]] = {}
with upstream_entries_path.open("r", encoding="utf-8") as handle:
Expand Down Expand Up @@ -493,7 +493,7 @@ def main() -> None:
type=Path,
default=None,
help=(
"Path to a local clone of HeOCR/public-domain-hand-written-hebrew-scans. "
"Path to a local clone of HeOCR/hash. "
"When set, the validator additionally cross-checks each entry's "
"upstream.sha256 against the upstream file record and verifies "
"upstream.bbox fits inside the upstream scan dimensions."
Expand Down
Loading