Datasets

Training datasets with quantum-safe provenance

LAION/LAION-Aesthetics V2 5+ HF PQC Verified

Subset of LAION-5B filtered for aesthetic quality. 600M image-text pairs scored >5.0 by aesthetic predictor. Standard for image generation training.

DatasetImage-Text600M pairsAesthetics CRITICAL
mozilla/Common Voice 17 HF PQC Verified

Multilingual speech dataset with 30K+ hours across 120+ languages. Crowdsourced and validated. De facto standard for ASR training.

DatasetSpeechMultilingual30K hours CRITICAL
allenai/Dolma HF PQC Verified

Open corpus of 3T tokens for language model pretraining. Sourced from web, academic papers, code, encyclopedic, and book content.

DatasetPretrainingEnglish3T tokens CRITICAL
tatsu-lab/Stanford Alpaca HF PQC Verified

Instruction-following dataset of 52K examples generated from text-davinci-003. Foundational instruction tuning dataset.

DatasetInstruction52K examples CRITICAL
bigcode/The Stack v2 HF PQC Verified

Largest open code dataset. 67.5TB of permissively licensed source code across 600+ programming languages from Software Heritage.

DatasetCode600+ Languages67.5TB CRITICAL
OpenAssistant/OpenAssistant Conversations v2 HF PQC Verified

Human-generated, human-annotated conversation trees. 91K messages across 35+ languages. RLHF training data.

DatasetConversationRLHF91K messages CRITICAL
KakologArchives/KakologArchives HF PQC Verified

ニコニコ実況 過去ログアーカイブ ニコニコ実況 過去ログアーカイブは、ニコニコ実況 のサービス開始から現在までのすべての過去ログコメントを収集したデータセットです。 去る2020年12月、ニコニコ実況は ニコニコ生放送内の一公式チャンネルとしてリニューアル されました。これに伴い、2009年11月から運用されてきた旧システムは提供終了となり(事実上のサービス終了)、torne や BRAVIA などの家電への対応が軒並み終了する中、当時の生の声が詰まった約11年分の過去ログも同時に失われることとなってしまいました。 そこで 5ch の DTV 板の住民が中心となり、旧ニコニコ実況が終了するまでに11年分の全チャンネルの過去ログをアーカイブする計画が立ち上がりました。紆余曲折あり Nekopanda 氏が約11年分のラジオや BS も含めた全チャンネルの過去ログを完璧に取得してくださったおかげで、11年分の過去ログが電子の海に消えていく事態は回避できました。しかし、旧 API が廃止されてしまったため過去ログを API… See the full description on the dataset page: https://huggingface.co/datasets/KakologArchives/KakologArchives.

Task_categories:text-ClassificationLanguage:ja
Salesforce/wikitext HF PQC Verified

Dataset Card for "wikitext" Dataset Summary The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/wikitext.

Task_categories:text-GenerationTask_categories:fill-MaskTask_ids:language-ModelingTask_ids:masked-Language-ModelingAnnotations_creators:no-AnnotationLanguage_creators:crowdsourced
LLM360/TxT360 HF PQC Verified

TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend Changelog Version Details v1.1 Added new data sources: TxT360_BestOfWeb, TxT360_QA, europarl-aligned, and wikipedia_extended. Details of v1.1 Additions TxT360_BestOfWeb: This is a filtered version of the TxT360 dataset, created using the ProX document filtering model. The model is similar to the FineWeb-Edu classifier, but also assigns an additional format score that… See the full description on the dataset page: https://huggingface.co/datasets/LLM360/TxT360.

Task_categories:text-GenerationLanguage:enSize_categories:n>1T
fineinstructions/fineinstructions_nemotron HF Unverified

✨ Note: For all FineInstructions resources please visit: https://huggingface.co/fineinstructions This dataset is ~1B+ synthetic instruction-answer pairs or ~300B tokens created using the FineInstructions pipeline. The FineInstructions pipeline was run over the raw pre-training documents in the Nemotron-CC pre-training corpus (a subset of high-quality documents from CommonCrawl). See our paper for more details. Each .parquet file in the data folderhas a corresponding judge-*.json file that… See the full description on the dataset page: https://huggingface.co/datasets/fineinstructions/fineinstructions_nemotron.

Language:enSize_categories:1B<n<10BFormat:parquetModality:tabularModality:textLibrary:datasets
openai/gsm8k HF PQC Verified

Dataset Card for GSM8K Dataset Summary GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning. These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

Benchmark:officialBenchmark:eval-YamlTask_categories:text-GenerationAnnotations_creators:crowdsourcedLanguage_creators:crowdsourcedMultilinguality:monolingual
allenai/c4 HF PQC Verified

C4 Dataset Summary A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's C4 dataset We prepared five variants of the data: en, en.noclean, en.noblocklist, realnewslike, and multilingual (mC4). For reference, these are the sizes of the variants: en: 305GB en.noclean: 2.3TB en.noblocklist: 380GB realnewslike: 15GB multilingual (mC4): 9.7TB (108 subsets, one per… See the full description on the dataset page: https://huggingface.co/datasets/allenai/c4.

Task_categories:text-GenerationTask_categories:fill-MaskTask_ids:language-ModelingTask_ids:masked-Language-ModelingAnnotations_creators:no-AnnotationLanguage_creators:found
NTU-NLP-sg/xCodeEval HF PQC Verified

The ability to solve problems is a hallmark of intelligence and has been an enduring goal in AI. AI systems that can create programs as solutions to problems or assist developers in writing programs can increase productivity and make programming more accessible. Recently, pre-trained large language models have shown impressive abilities in generating new codes from natural language descriptions, repairing buggy codes, translating codes between languages, and retrieving relevant code segments. However, the evaluation of these models has often been performed in a scattered way on only one or two specific tasks, in a few languages, at a partial granularity (e.g., function) level and in many cases without proper training data. Even more concerning is that in most cases the evaluation of generated codes has been done in terms of mere lexical overlap rather than actual execution whereas semantic similarity (or equivalence) of two code segments depends only on their ``execution similarity'', i.e., being able to get the same output for a given input.

Task_categories:translationTask_categories:token-ClassificationTask_categories:text-RetrievalTask_categories:text-GenerationTask_categories:text-ClassificationTask_categories:feature-Extraction
m-a-p/FineFineWeb HF PQC Verified

FineFineWeb: A Comprehensive Study on Fine-Grained Domain Web Corpus arXiv: Coming Soon Project Page: Coming Soon Blog: Coming Soon Data Statistics Domain (#tokens/#samples) Iteration 1 Tokens Iteration 2 Tokens Iteration 3 Tokens Total Tokens Iteration 1 Count Iteration 2 Count Iteration 3 Count Total Count aerospace 5.77B 261.63M 309.33M 6.34B 9100000 688505 611034 10399539 agronomy 13.08B 947.41M 229.04M 14.26B 15752828 2711790 649404 19114022 artistic… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/FineFineWeb.

Task_categories:text-ClassificationTask_categories:text-GenerationLanguage:enSize_categories:1B<n<10BModality:tabularModality:text
allenai/objaverse HF Unverified

Objaverse Objaverse is a Massive Dataset with 800K+ Annotated 3D Objects. More documentation is coming soon. In the meantime, please see our paper and website for additional details. License The use of the dataset as a whole is licensed under the ODC-By v1.0 license. Individual objects in Objaverse are all licensed as creative commons distributable objects, and may be under the following licenses: CC-BY 4.0 - 721K objects CC-BY-NC 4.0 - 25K objects CC-BY-NC-SA 4.0 - 52K… See the full description on the dataset page: https://huggingface.co/datasets/allenai/objaverse.

Language:en
HuggingFaceFW/fineweb-edu HF PQC Verified

📚 FineWeb-Edu 1.3 trillion tokens of the finest educational data the 🌐 web has to offer Paper: https://arxiv.org/abs/2406.17557 What is it? 📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.

Task_categories:text-GenerationLanguage:enSize_categories:1B<n<10BFormat:parquetModality:tabularModality:text
jat-project/jat-dataset HF Unverified

JAT Dataset Dataset Description The Jack of All Trades (JAT) dataset combines a wide range of individual datasets. It includes expert demonstrations by expert RL agents, image and caption pairs, textual data and more. The JAT dataset is part of the JAT project, which aims to build a multimodal generalist agent. Paper: https://huggingface.co/papers/2402.09844 Usage >>> from datasets import load_dataset >>> dataset = load_dataset("jat-project/jat-dataset"… See the full description on the dataset page: https://huggingface.co/datasets/jat-project/jat-dataset.

Task_categories:reinforcement-LearningTask_categories:text-GenerationTask_categories:question-AnsweringAnnotations_creators:foundAnnotations_creators:machine-GeneratedSource_datasets:conceptual-Captions
cais/mmlu HF Unverified

Dataset Card for MMLU Dataset Summary Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks… See the full description on the dataset page: https://huggingface.co/datasets/cais/mmlu.

Task_categories:question-AnsweringTask_ids:multiple-Choice-QaAnnotations_creators:no-AnnotationLanguage_creators:expert-GeneratedMultilinguality:monolingualSource_datasets:original
HuggingFaceFW/FineWeb HF PQC Verified

15T token dataset of cleaned English web data. Deduplicated and filtered from CommonCrawl, outperforms C4 and RefinedWeb for LLM pretraining.

DatasetPretrainingEnglish15T tokens CRITICAL
HuggingFaceFW/finephrase HF PQC Verified

Dataset Card for HuggingFaceFW/finephrase Dataset Summary Synthetic data generated by DataTrove: Model: HuggingFaceTB/SmolLM2-1.7B-Instruct (main) Source dataset: HuggingFaceFW/fineweb-edu, config sample-350BT, split train Generation config: temperature=1.0, top_p=1.0, top_k=50, max_tokens=2048, model_max_context=8192 Speculative decoding: {"method":"suffix","num_speculative_tokens":32} System prompt: None Input column: text Prompt families: faq prompt Rewrite the… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/finephrase.

Task_categories:text-GenerationTask_ids:language-ModelingAnnotations_creators:machine-GeneratedLanguage_creators:foundSource_datasets:HuggingFaceFW/fineweb-Edu/sample-350BTLanguage:en
Showing 20 of 126 datasets (page 1 of 7)
Prev Next