Datasets
Training datasets with quantum-safe provenance
Subset of LAION-5B filtered for aesthetic quality. 600M image-text pairs scored >5.0 by aesthetic predictor. Standard for image generation training.
Multilingual speech dataset with 30K+ hours across 120+ languages. Crowdsourced and validated. De facto standard for ASR training.
Open corpus of 3T tokens for language model pretraining. Sourced from web, academic papers, code, encyclopedic, and book content.
Instruction-following dataset of 52K examples generated from text-davinci-003. Foundational instruction tuning dataset.
Largest open code dataset. 67.5TB of permissively licensed source code across 600+ programming languages from Software Heritage.
Human-generated, human-annotated conversation trees. 91K messages across 35+ languages. RLHF training data.
15T token dataset of cleaned English web data. Deduplicated and filtered from CommonCrawl, outperforms C4 and RefinedWeb for LLM pretraining.
Complete Wikipedia dump across all languages. Standard pretraining data source. Structured articles with metadata.