Model Hub

Browse PQC-verified AI models, datasets, and tools

jobs-git/Zyda-2 HF Unverified

Zyda-2 Zyda-2 is a 5 trillion token language modeling dataset created by collecting open and high quality datasets and combining them and cross-deduplication and model-based quality filtering. Zyda-2 comprises diverse sources of web data, highly educational content, math, code, and scientific papers. To construct Zyda-2, we took the best open-source datasets available: Zyda, FineWeb, DCLM, and Dolma. Models trained on Zyda-2 significantly outperform identical models trained on the… See the full description on the dataset page: https://huggingface.co/datasets/jobs-git/Zyda-2.

Task_categories:text-GenerationLanguage:enSize_categories:n>1T
jobs-git/HPLT2.0_cleaned HF Unverified

This is a large-scale collection of web-crawled documents in 191 world languages, produced by the HPLT project. The source of the data is mostly Internet Archive with some additions from Common Crawl. For a detailed description of the dataset, please refer to https://hplt-project.org/datasets/v2.0 The Cleaned variant of HPLT Datasets v2.0 This is the cleaned variant of the HPLT Datasets v2.0 converted to the Parquet format semi-automatically when being uploaded here. The original JSONL files… See the full description on the dataset page: https://huggingface.co/datasets/jobs-git/HPLT2.0_cleaned.

Task_categories:fill-MaskTask_categories:text-GenerationTask_ids:language-ModelingMultilinguality:multilingualLanguage:aceLanguage:af
Showing 2 of 2 items (page 1 of 1)
Prev Next