Skip to content

HeOCR/hletterscript

Repository files navigation

hletterscript

CI Metadata license Letter crops Writer sets

Created by Shay Palachy Affek.

Per-writer sets of cropped handwritten Hebrew letter images, extracted from rights-clean page scans for synthetic handwriting generation and OCR/HTR research.

Sample grid of handwritten Hebrew letter crops

At a Glance

Field Value
Corpus shape Writer-grouped Hebrew letter crops
Current seed corpus 48 crops from 2 verified writers
Covered letter forms 25 Hebrew letter-form slugs
Current writers Chaim Nachman Bialik, Rachel Bluwstein
Canonical crop index data/index/entries.jsonl
Canonical writer index data/index/writers.jsonl
Image storage data/letters/<writer_id>/<letter_name>/ via Git LFS
Metadata license CC0 1.0
Per-image rights Inherited per crop from the upstream scan

What This Repository Contains

hletterscript is a data repository for individual Hebrew letter crops, grouped by the person who wrote them. Each entry links the crop back to the source scan, bounding box, extraction method, file checksum, quality flags, and inherited rights statement.

The corpus is intentionally small and strict at this stage: it is a validated seed dataset for the surrounding HeOCR tooling, not yet a complete Hebrew handwriting corpus.

Pipeline Position

flowchart LR
    HASH["HeOCR/hash<br/>rights-clean page scans"] --> GEN["HeOCR/hletterscriptgen<br/>crop extraction"]
    GEN --> THIS["HeOCR/hletterscript<br/>per-writer letter sets"]
    THIS --> SYN["HeOCR/hocrsyngen<br/>synthetic documents"]
    SYN --> OCR["HeOCR / HeOCRsynth<br/>OCR and HTR corpora"]
Loading

Related projects:

Dataset Layout

Licensing Model

This repository uses a compound licensing model:

  • Repository-authored metadata, schemas, scripts, docs, and generated metadata exports are dedicated to the public domain under CC0 1.0.
  • Each crop carries its own inherited rights block from the upstream scan. Current seed crops are public-domain compatible, but consumers should read the per-entry rights metadata rather than assume a uniform image license.

See LICENSE for the repository metadata license and LICENSE.md for the full per-image rights policy.

Requirements

  • Python 3.11 or newer. CI currently validates 3.11, 3.12, and 3.13.
  • Git LFS for the image bytes under data/letters/**.

After cloning:

git lfs install
git lfs pull
python3 -m pip install -r requirements-dev.txt

Validate Locally

python3 scripts/validate_indexes.py
python3 scripts/generate_release_artifacts.py --check
python3 -m pytest

For the full CI-style upstream cross-check, place a checkout of HeOCR/hash at .upstream and run:

python3 scripts/validate_indexes.py --upstream-path .upstream

Current Status

v0.0.0-rc is a validated seed corpus with 48 indexed letter crops from 2 verified writers. The repository has schema validation, deterministic release-artifact generation, CI, and licensing policy in place.

Future ingestion work expands writer coverage and fills missing Hebrew letter forms through crops produced by HeOCR/hletterscriptgen from the upstream HASH scans.

Contributing

Contributors adding or reviewing crop entries should start with AGENTS.md. It captures ingest rules, naming conventions, rights constraints, and the pre-PR checklist for this data repository.

Credits

Created by Shay Palachy Affek [GitHub]

About

Per-writer Hebrew handwritten letter image sets derived from rights-clean HASH scans.

Topics

Resources

License

CC0-1.0, Unknown licenses found

Licenses found

CC0-1.0
LICENSE
Unknown
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages