Datasets

Training datasets with quantum-safe provenance

LAION/LAION-Aesthetics V2 5+ HF PQC Verified

Subset of LAION-5B filtered for aesthetic quality. 600M image-text pairs scored >5.0 by aesthetic predictor. Standard for image generation training.

DatasetImage-Text600M pairsAesthetics CRITICAL
mozilla/Common Voice 17 HF PQC Verified

Multilingual speech dataset with 30K+ hours across 120+ languages. Crowdsourced and validated. De facto standard for ASR training.

DatasetSpeechMultilingual30K hours CRITICAL
allenai/Dolma HF PQC Verified

Open corpus of 3T tokens for language model pretraining. Sourced from web, academic papers, code, encyclopedic, and book content.

DatasetPretrainingEnglish3T tokens CRITICAL
tatsu-lab/Stanford Alpaca HF PQC Verified

Instruction-following dataset of 52K examples generated from text-davinci-003. Foundational instruction tuning dataset.

DatasetInstruction52K examples CRITICAL
bigcode/The Stack v2 HF PQC Verified

Largest open code dataset. 67.5TB of permissively licensed source code across 600+ programming languages from Software Heritage.

DatasetCode600+ Languages67.5TB CRITICAL
Benjy/typed_digital_signatures HF PQC Verified

Typed Digital Signatures Dataset This comprehensive dataset contains synthetic digital signatures rendered across 30 different Google Fonts, specifically selected for their handwriting and signature-style characteristics. Each font contributes unique stylistic elements, making this dataset ideal for robust signature analysis and font recognition tasks. Dataset Overview Total Fonts: 30 different Google Fonts Images per Font: 3,000 signatures Total Dataset Size: ~90,000… See the full description on the dataset page: https://huggingface.co/datasets/Benjy/typed_digital_signatures.

Task_categories:image-ClassificationTask_categories:zero-Shot-Image-ClassificationTask_categories:image-Feature-ExtractionLanguage:enSize_categories:10K<n<100KModality:image
KakologArchives/KakologArchives HF PQC Verified

ニコニコ実況 過去ログアーカイブ ニコニコ実況 過去ログアーカイブは、ニコニコ実況 のサービス開始から現在までのすべての過去ログコメントを収集したデータセットです。 去る2020年12月、ニコニコ実況は ニコニコ生放送内の一公式チャンネルとしてリニューアル されました。これに伴い、2009年11月から運用されてきた旧システムは提供終了となり(事実上のサービス終了)、torne や BRAVIA などの家電への対応が軒並み終了する中、当時の生の声が詰まった約11年分の過去ログも同時に失われることとなってしまいました。 そこで 5ch の DTV 板の住民が中心となり、旧ニコニコ実況が終了するまでに11年分の全チャンネルの過去ログをアーカイブする計画が立ち上がりました。紆余曲折あり Nekopanda 氏が約11年分のラジオや BS も含めた全チャンネルの過去ログを完璧に取得してくださったおかげで、11年分の過去ログが電子の海に消えていく事態は回避できました。しかし、旧 API が廃止されてしまったため過去ログを API… See the full description on the dataset page: https://huggingface.co/datasets/KakologArchives/KakologArchives.

Task_categories:text-ClassificationLanguage:ja
OpenAssistant/OpenAssistant Conversations v2 HF PQC Verified

Human-generated, human-annotated conversation trees. 91K messages across 35+ languages. RLHF training data.

DatasetConversationRLHF91K messages CRITICAL
Salesforce/wikitext HF PQC Verified

Dataset Card for "wikitext" Dataset Summary The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/wikitext.

Task_categories:text-GenerationTask_categories:fill-MaskTask_ids:language-ModelingTask_ids:masked-Language-ModelingAnnotations_creators:no-AnnotationLanguage_creators:crowdsourced
m-a-p/FineFineWeb HF PQC Verified

FineFineWeb: A Comprehensive Study on Fine-Grained Domain Web Corpus arXiv: Coming Soon Project Page: Coming Soon Blog: Coming Soon Data Statistics Domain (#tokens/#samples) Iteration 1 Tokens Iteration 2 Tokens Iteration 3 Tokens Total Tokens Iteration 1 Count Iteration 2 Count Iteration 3 Count Total Count aerospace 5.77B 261.63M 309.33M 6.34B 9100000 688505 611034 10399539 agronomy 13.08B 947.41M 229.04M 14.26B 15752828 2711790 649404 19114022 artistic… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/FineFineWeb.

Task_categories:text-ClassificationTask_categories:text-GenerationLanguage:enSize_categories:1B<n<10BModality:tabularModality:text
LLM360/TxT360 HF PQC Verified

TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend Changelog Version Details v1.1 Added new data sources: TxT360_BestOfWeb, TxT360_QA, europarl-aligned, and wikipedia_extended. Details of v1.1 Additions TxT360_BestOfWeb: This is a filtered version of the TxT360 dataset, created using the ProX document filtering model. The model is similar to the FineWeb-Edu classifier, but also assigns an additional format score that… See the full description on the dataset page: https://huggingface.co/datasets/LLM360/TxT360.

Task_categories:text-GenerationLanguage:enSize_categories:n>1T
codeparrot/github-code HF Unverified

The GitHub Code dataest consists of 115M code files from GitHub in 32 programming languages with 60 extensions totalling in 1TB of text data. The dataset was created from the GitHub dataset on BiqQuery.

Task_categories:text-GenerationTask_ids:language-ModelingLanguage_creators:crowdsourcedLanguage_creators:expert-GeneratedMultilinguality:multilingualLanguage:code
fineinstructions/fineinstructions_nemotron HF Unverified

✨ Note: For all FineInstructions resources please visit: https://huggingface.co/fineinstructions This dataset is ~1B+ synthetic instruction-answer pairs or ~300B tokens created using the FineInstructions pipeline. The FineInstructions pipeline was run over the raw pre-training documents in the Nemotron-CC pre-training corpus (a subset of high-quality documents from CommonCrawl). See our paper for more details. Each .parquet file in the data folderhas a corresponding judge-*.json file that… See the full description on the dataset page: https://huggingface.co/datasets/fineinstructions/fineinstructions_nemotron.

Language:enSize_categories:1B<n<10BFormat:parquetModality:tabularModality:textLibrary:datasets
allenai/c4 HF PQC Verified

C4 Dataset Summary A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's C4 dataset We prepared five variants of the data: en, en.noclean, en.noblocklist, realnewslike, and multilingual (mC4). For reference, these are the sizes of the variants: en: 305GB en.noclean: 2.3TB en.noblocklist: 380GB realnewslike: 15GB multilingual (mC4): 9.7TB (108 subsets, one per… See the full description on the dataset page: https://huggingface.co/datasets/allenai/c4.

Task_categories:text-GenerationTask_categories:fill-MaskTask_ids:language-ModelingTask_ids:masked-Language-ModelingAnnotations_creators:no-AnnotationLanguage_creators:found
genrobot2025/10Kh-RealOmin-OpenData HF Unverified

Boasting over 13,000 hours of cumulative data and 5 million+ clips, it ranks as the largest open-source embodied intelligence dataset in the industry. Update Notes:Stage 3 data upload completed. 13,000+ hours of pure dual-hand data with frame-level alignment latency < 1ms Full high-precision trajectory reconstruction, breaking the limit of superficial open source, fully ready-to-use 3,000+ contributors and 10,000+ real household scenarios with exceptional diversity Comprehensive… See the full description on the dataset page: https://huggingface.co/datasets/genrobot2025/10Kh-RealOmin-OpenData.

Task_categories:roboticsTask_categories:reinforcement-LearningLanguage:enLanguage:zhSize_categories:n>1TModality:video
openai/gsm8k HF PQC Verified

Dataset Card for GSM8K Dataset Summary GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning. These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

Benchmark:officialBenchmark:eval-YamlTask_categories:text-GenerationAnnotations_creators:crowdsourcedLanguage_creators:crowdsourcedMultilinguality:monolingual
NTU-NLP-sg/xCodeEval HF PQC Verified

The ability to solve problems is a hallmark of intelligence and has been an enduring goal in AI. AI systems that can create programs as solutions to problems or assist developers in writing programs can increase productivity and make programming more accessible. Recently, pre-trained large language models have shown impressive abilities in generating new codes from natural language descriptions, repairing buggy codes, translating codes between languages, and retrieving relevant code segments. However, the evaluation of these models has often been performed in a scattered way on only one or two specific tasks, in a few languages, at a partial granularity (e.g., function) level and in many cases without proper training data. Even more concerning is that in most cases the evaluation of generated codes has been done in terms of mere lexical overlap rather than actual execution whereas semantic similarity (or equivalence) of two code segments depends only on their ``execution similarity'', i.e., being able to get the same output for a given input.

Task_categories:translationTask_categories:token-ClassificationTask_categories:text-RetrievalTask_categories:text-GenerationTask_categories:text-ClassificationTask_categories:feature-Extraction
mteb/results HF Unverified

Size_categories:1M<n<10MFormat:parquetFormat:optimized-ParquetModality:textLibrary:datasetsLibrary:dask
ad1t7a/10Kh-RealOmin-OpenData HF Unverified

Boasting over 10,000 hours of cumulative data and 1 million+ clips, it ranks as the largest open-source embodied intelligence dataset in the industry. Compared with other datasets, it has the following advantages: Ample Data Volume & Strong Generalization Each skill is supported by sufficient data, collected from over 3,000 households and nearly 10,000 distinct fine-grained targets. It avoids simple repetitions and ensures robust generalization. Authentic Scenarios & Focused… See the full description on the dataset page: https://huggingface.co/datasets/ad1t7a/10Kh-RealOmin-OpenData.

Task_categories:roboticsTask_categories:reinforcement-LearningLanguage:enLanguage:zhSize_categories:n>1TModality:video
mvp-lab/LLaVA-OneVision-1.5-Mid-Training-85M HF Unverified

🚀 LLaVA-One-Vision-1.5-Mid-Training-85M Dataset is being uploaded 🚀 Upload Status All Completed: ImageNet-21k、LAIONCN、DataComp-1B、Zero250M、COYO700M、SA-1B、MINT、Obelics 📜 Cite If you find LLaVA-One-Vision-1.5-Mid-Training-85M useful in your research, please consider to cite the following related papers: @misc{an2025llavaonevision15fullyopenframework, title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training}… See the full description on the dataset page: https://huggingface.co/datasets/mvp-lab/LLaVA-OneVision-1.5-Mid-Training-85M.

Size_categories:10M<n<100MFormat:parquetModality:imageModality:textLibrary:datasetsLibrary:dask
Showing 20 of 178 datasets (page 1 of 9)
Prev Next