This is a large-scale collection of web-crawled documents in 191 world languages, produced by the HPLT project. The source of the data is mostly Internet Archive with some additions from Common Crawl. For a detailed description of the dataset, please refer to https://hplt-project.org/datasets/v2.0 The Cleaned variant of HPLT Datasets v2.0 This is the cleaned variant of the HPLT Datasets v2.0 converted to the Parquet format semi-automatically when being uploaded here. The original JSONL files… See the full description on the dataset page: https://huggingface.co/datasets/jobs-git/HPLT2.0_cleaned.
Use this model
Pull with QuantumShield
quantumshield pull jobs-git/HPLT2.0_cleaned Verify integrity
quantumshield verify jobs-git/HPLT2.0_cleaned pip install
pip install quantumshield && quantumshield pull jobs-git/HPLT2.0_cleaned Unverified Model
This model has not been PQC-verified. File integrity cannot be guaranteed against quantum threats.
README.md
HPLT2.0_cleaned
This is a large-scale collection of web-crawled documents in 191 world languages, produced by the HPLT project. The source of the data is mostly Internet Archive with some additions from Common Crawl. For a detailed description of the dataset, please refer to https://hplt-project.org/datasets/v2.0 The Cleaned variant of HPLT Datasets v2.0 This is the cleaned variant of the HPLT Datasets v2.0 converted to the Parquet format semi-automatically when being uploaded here. The original JSONL files… See the full description on the dataset page: https://huggingface.co/datasets/jobs-git/HPLT2.0_cleaned.
Intended Uses
This model is registered on the QuantaMrkt quantum-safe registry. This model has not yet been PQC-verified.
Quick Start
# Install the CLI pip install quantumshield # Pull the model quantumshield pull jobs-git/HPLT2.0_cleaned # Verify file integrity quantumshield verify jobs-git/HPLT2.0_cleaned
About
This is a large-scale collection of web-crawled documents in 191 world languages, produced by the HPLT project. The source of the data is mostly Internet Archive with some additions from Common Crawl. For a detailed description of the dataset, please refer to https://hplt-project.org/datasets/v2.0 The Cleaned variant of HPLT Datasets v2.0 This is the cleaned variant of the HPLT Datasets v2.0 converted to the Parquet format semi-automatically when being uploaded here. The original JSONL files… See the full description on the dataset page: https://huggingface.co/datasets/jobs-git/HPLT2.0_cleaned.
Get this model
Pull with QuantumShield
quantumshield pull jobs-git/HPLT2.0_cleaned Verify signatures
quantumshield verify jobs-git/HPLT2.0_cleaned