A simple, agent-friendly dataset of scanned handwritten Hebrew-script documents — notes, letters, notebook pages, drafts, forms, and similar material — from the 18th century onward. The focus is everyday cursive Hebrew handwriting (כתב יד, not דפוס); Yiddish documents written in the same Hebrew round script are also in scope.
Created by Shay Palachy Affek.
The target corpus is limited to scans that can be redistributed and transformed for downstream uses, including substantial remixing and machine-learning datasets. The index therefore keeps rights evidence at both source and scan level instead of assuming that a collection-level label applies to every page.
These samples are drawn from scan records whose rights are recorded as public
domain, Public Domain Mark, or Israel public-domain terms in
data/index/entries.jsonl. Always consult the
per-scan record before redistributing or transforming a specific image.
| Field | Current value |
|---|---|
| Ingested scans | 198 |
| Verified sources | 48 |
| Candidate leads still under research | 15 |
| Provenance-only rejected source records | 46 |
| Corpus size on disk | ~283.00 MiB |
| Repository-authored metadata license | CC0 1.0 |
| Per-scan rights policy | LICENSE.md |
| Canonical scan index | data/index/entries.jsonl |
docs/sources/contains raw research notes and source leads.docs/dataset_structure.mddefines the repository layout and ingestion model.data/index/sources.jsonlis the source-level catalog, one JSON object per institution, collection, item, dataset, or source lead.data/index/entries.jsonlis the scan-level catalog, one JSON object per individual page, note, letter, or other scanned unit.schemas/source.schema.jsonandschemas/entry.schema.jsondefine the machine-readable record contracts.scripts/validate_indexes.pyvalidates JSONL records against the schemas and checks source/entry referential integrity.LICENSE.mddocuments the compound licensing policy for metadata and scans.
The canonical editable indexes are newline-delimited JSON (.jsonl).
JSONL is deliberately used instead of CSV because these records need nested rights evidence, multiple URLs, per-field provenance, quality measurements, and acquisition state. The source of truth stays line-oriented, diffable, streamable JSON.
For spreadsheet / pandas / data-engineering workflows the repo ships flat
derived views of the indexes under exports/:
exports/entries.csv— one row per scan, withfiles[role=="original"]flattened intofile_*columns andcreator_count/file_countsummary columns.exports/sources.csv— one row per source row.exports/creators.csv— one row per(entry_id, creator)pair (use this when you need creator names / death years / authority URLs without doing positional gymnastics inside a single cell).dist/entries.parquet— same shape asentries.csvwith preserved types (nullable booleans, int64 file sizes). Produced underdist/(uncommitted build artefact).
Regenerate the exports and rebuild the release tarball with:
python3 -m pip install -r requirements-dev.txt
python3 scripts/validate_indexes.py
python3 -m pytest
make exports
make releaseThe corpus currently contains 198 ingested scans drawn from 48 verified sources, totalling ~283.00 MiB on disk. The source-level index also tracks 15 candidate leads still being researched and 46 source records kept for provenance after being rejected as out of scope.
License breakdown across the 198 entries:
- 144
LicenseRef-Public-Domain-Israel(Public Domain (Israel; life + 70)) - 44
PDM-1.0(Public Domain Mark 1.0) - 9
CC-BY-SA-3.0(Creative Commons Attribution-ShareAlike 3.0 Unported) - 1
LicenseRef-Public-Domain-Ukraine(Public Domain (Ukraine; life + 70))
The repository uses a compound licensing model: repository-authored metadata
is dedicated to the public domain under CC0 1.0 (see LICENSE),
while per-scan rights are recorded individually in each entry. See
LICENSE.md for the full policy, including the CC BY-SA
ShareAlike caveat and the rules for remix-friendly release bundles.
data/index/entries.jsonlis the source of truth for the scan-level corpus — one JSON object per scan, with rights evidence, file checksums, and provenance.data/index/sources.jsonlcatalogs the upstream sources, including candidate leads and rejected records.schemas/entry.schema.jsonandschemas/source.schema.jsondefine the record contracts;scripts/validate_indexes.pyenforces them in CI.- Contributors adding new scans should start with
AGENTS.mdfor ingest rules, scope, and the pre-PR checklist.
Created by Shay Palachy Affek [GitHub]
