Packaging annotated datasets into final training data. Purpose is to have a repeatable process and well-packaged data to remove any data management in the training step.
The packer consists of the following tools:
- oellm-package-data - Take source files and apply decontamination, PII-masking, and sampling. Each file in the source directory gets a correspondent file with the data processed. The tool is idempotent, if it fails, then run it again, and it will take of where it left. This is the main tool.
- oellm-package-merge - This shall run after oellm-package-data and deduces the number of files to simplify tokenization and training.
- oellm-collect-metrics - Collect and summarize metrics from a collection directory.
- oellm-propella-structure - Structure Propella data based on source data structure. For each record in the source files, if its ID exists in the Propella data, it is written to the output. This arranges Propella data in the same oßrder as source data.
- oellm-propella-merge - If Propella data is to big to be in memory
oellm-propella-structurecan run on individual Propella-parquet files. Then use this tool to merge the results from all Propella-parquet files.
Both tools read a file metadata.yaml containing metadata about the structure and processing of the data.
You must have uv installed, see uv-homepage.
Check out this repo and do:
uv sync --extra dev
uv run pre-commit installTip is to do an uv sync evry time pulling from the repository.
The packager takes source files and applies decontamination, PII-masking, and sampling. Each file
in source directory gets a correspondent file in output_dir with processed data. The tool is idempotent,
if it fails, then run it again, and it will take of where it left.
To package the data in tests/resources/integration/non_partitioned run:
uv run oellm-package-data --input_dir tests/resources/integration/flat_release --output_dir tmpThe program checks if output files exist, if they exist new data is not regenerated.
It can also run via slurm:
sbatch --array=0-10 ./package.sh input-dir output-dirWhen using Slurm, data sharding is handled automatically across the task array.
Here is a full example running on Lumi:
sbatch --array=0-49 ./package.sh \
/scratch/project_465002530/training/collection/baby/nemotron-cc-opus-1.1
/scratch/project_465002530/training/collection/baby/nemotron-cc-opus-1.1/release_rawThe merger reduces the number of files but still keeps semantics in paths, like language or quality.
The merger uses the metadata.yaml in provided collection-directory. As input it use the subdirectory release_raw and
write the merged files to release subdirectory.
the merger run after oellm-package-data.
To run local:
uv run oellm-package-merge --collection-dir ${COLLECTION_DIR} --workers 1It can also run via slurm:
sbatch --array=0-9 ./merge.sh \
/scratch/project_465002530/training/collection/baby/nemotron-cc-opus-1.1The propella-structure tool filters source records based on IDs found in Propella parquet files. It reads
source files from collection_dir/source, looks up each record's ID in the propella directory, and
writes matching records to collection_dir/propella-4b. This is useful for structuring Propella data, according to
source data.
The Propella data is read into memory and requires about 50 times the size of the Propella parquet file. If
your Propella data does not fit into memory. Run ones per parquet file and then merge the result with
oellm-propella-merge. When pointing to an individual parquet file, the output directory will get an extra
level of directory named after the parquet file. This can be removed after oellm-propella-merge.
To run:
uv run oellm-propella-structure --collection-dir ${COLLECTION_DIR} --propella ${PROPELLA_DIR} --part ${part}To run oellm-propella-merge:
uv run oellm-propella-merge --collection-dir ${COLLECTION_DIR} --part ${part}Before checking in run tests, linting and formating:
uv run pre-commit