Datasets

Training datasets with quantum-safe provenance

LAION/LAION-Aesthetics V2 5+ HF PQC Verified

Subset of LAION-5B filtered for aesthetic quality. 600M image-text pairs scored >5.0 by aesthetic predictor. Standard for image generation training.

DatasetImage-Text600M pairsAesthetics CRITICAL
mozilla/Common Voice 17 HF PQC Verified

Multilingual speech dataset with 30K+ hours across 120+ languages. Crowdsourced and validated. De facto standard for ASR training.

DatasetSpeechMultilingual30K hours CRITICAL
allenai/Dolma HF PQC Verified

Open corpus of 3T tokens for language model pretraining. Sourced from web, academic papers, code, encyclopedic, and book content.

DatasetPretrainingEnglish3T tokens CRITICAL
tatsu-lab/Stanford Alpaca HF PQC Verified

Instruction-following dataset of 52K examples generated from text-davinci-003. Foundational instruction tuning dataset.

DatasetInstruction52K examples CRITICAL
bigcode/The Stack v2 HF PQC Verified

Largest open code dataset. 67.5TB of permissively licensed source code across 600+ programming languages from Software Heritage.

DatasetCode600+ Languages67.5TB CRITICAL
OpenAssistant/OpenAssistant Conversations v2 HF PQC Verified

Human-generated, human-annotated conversation trees. 91K messages across 35+ languages. RLHF training data.

DatasetConversationRLHF91K messages CRITICAL
HuggingFaceFW/FineWeb HF PQC Verified

15T token dataset of cleaned English web data. Deduplicated and filtered from CommonCrawl, outperforms C4 and RefinedWeb for LLM pretraining.

DatasetPretrainingEnglish15T tokens CRITICAL
wikimedia/Wikipedia (Nov 2023) HF PQC Verified

Complete Wikipedia dump across all languages. Standard pretraining data source. Structured articles with metadata.

DatasetTextMultilingualKnowledge CRITICAL
Showing 8 of 8 datasets (page 1 of 1)
Prev Next