Datasets

Training datasets with quantum-safe provenance

stanfordnlp/imdb HF Unverified

Dataset Card for "imdb" Dataset Summary Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.

Task_categories:text-ClassificationTask_ids:sentiment-ClassificationAnnotations_creators:expert-GeneratedLanguage_creators:expert-GeneratedMultilinguality:monolingualSource_datasets:original
jhu-clsp/ettin-pretraining-data HF Unverified

Ettin Pre-training Data Phase 1 of 3: Diverse pre-training data mixture (1.7T tokens) used to train the Ettin model suite. This dataset contains the pre-training phase data used to train all Ettin encoder and decoder models. The data is provided in MDS format ready for use with Composer and the ModernBERT training repository. 📊 Data Composition Data Source Tokens (B) Percentage Description DCLM 837.2 49.1% High-quality web crawl data CC Head 356.6… See the full description on the dataset page: https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data.

Task_categories:text-GenerationTask_categories:fill-MaskTask_categories:text-ClassificationLanguage:enPretrainingLanguage-Modeling
uoft-cs/cifar10 HF Unverified

Dataset Card for CIFAR-10 Dataset Summary The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain… See the full description on the dataset page: https://huggingface.co/datasets/uoft-cs/cifar10.

Task_categories:image-ClassificationAnnotations_creators:crowdsourcedLanguage_creators:foundMultilinguality:monolingualSource_datasets:extended|other-80-Million-Tiny-ImagesLanguage:en
isaacus/open-australian-legal-corpus HF Unverified

Open Australian Legal Corpus ‍⚖️ The Open Australian Legal Corpus by Isaacus, a foundational legal AI research company, is the first and only multijurisdictional open corpus of Australian legislative and judicial documents. Comprised of 229,122 texts totalling over 60 million lines and 1.4 billion tokens, the Corpus includes every in force statute and regulation in the Commonwealth, New South Wales, Queensland, Western Australia, South Australia, Tasmania and Norfolk Island, in… See the full description on the dataset page: https://huggingface.co/datasets/isaacus/open-australian-legal-corpus.

Task_categories:text-GenerationTask_categories:fill-MaskTask_categories:text-RetrievalTask_ids:language-ModelingTask_ids:masked-Language-ModelingTask_ids:document-Retrieval
wikimedia/Wikipedia (Nov 2023) HF PQC Verified

Complete Wikipedia dump across all languages. Standard pretraining data source. Structured articles with metadata.

DatasetTextMultilingualKnowledge CRITICAL
nebius/SWE-rebench-V2-PRs HF Unverified

SWE-rebench-V2-PRs Dataset Summary SWE-rebench-V2-PRs is a large-scale dataset of real-world GitHub pull requests collected across multiple programming languages, intended for training and evaluating code-generation and software-engineering agents. The dataset contains 126,300 samples covering Go, Python, JavaScript, TypeScript, Rust, Java, C, C++, Julia, Elixir, Kotlin, PHP, Scala, Clojure, Dart, OCaml, and other languages. For log parser functions, base Dockerfiles, and… See the full description on the dataset page: https://huggingface.co/datasets/nebius/SWE-rebench-V2-PRs.

Task_categories:text-GenerationLanguage:enSize_categories:100K<n<1MFormat:parquetModality:textLibrary:datasets
epfml/FineWeb-HQ HF Unverified

FineWeb-HQ Dataset Summary FineWeb-HQ is a high-quality, model-filtered pretraining dataset derived as a subset of FineWeb. FineWeb-HQ was created by selecting the top 10% of FineWeb documents based on a deep learning classifier trained to identify structured and knowledge-rich samples. This classifier uses XLM-RoBERTa embeddings to score documents. To validate our approach, we pretrained 1B-parameter LLM models with a Llama-like architecture across multiple languages and… See the full description on the dataset page: https://huggingface.co/datasets/epfml/FineWeb-HQ.

Task_categories:text-GenerationLanguage:enSize_categories:1B<n<10BFormat:parquetModality:tabularModality:text
mlfoundations/MINT-1T-PDF-CC-2024-18 HF PQC Verified

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2024-18.

Task_categories:image-To-TextTask_categories:text-GenerationLanguage:enSize_categories:100B<n<1TMultimodal
allenai/MADLAD-400 HF PQC Verified

MADLAD-400 Dataset and Introduction MADLAD-400 (Multilingual Audited Dataset: Low-resource And Document-level) is a document-level multilingual dataset based on Common Crawl, covering 419 languages in total. This uses all snapshots of CommonCrawl available as of August 1, 2022. The primary advantage of this dataset over similar datasets is that it is more multilingual (419 languages), it is audited and more highly filtered, and it is document-level. The main disadvantage… See the full description on the dataset page: https://huggingface.co/datasets/allenai/MADLAD-400.

Task_categories:text-GenerationSize_categories:n>1T
jobs-git/Zyda-2 HF Unverified

Zyda-2 Zyda-2 is a 5 trillion token language modeling dataset created by collecting open and high quality datasets and combining them and cross-deduplication and model-based quality filtering. Zyda-2 comprises diverse sources of web data, highly educational content, math, code, and scientific papers. To construct Zyda-2, we took the best open-source datasets available: Zyda, FineWeb, DCLM, and Dolma. Models trained on Zyda-2 significantly outperform identical models trained on the… See the full description on the dataset page: https://huggingface.co/datasets/jobs-git/Zyda-2.

Task_categories:text-GenerationLanguage:enSize_categories:n>1T
TIGER-Lab/MMLU-Pro HF Unverified

MMLU-Pro Dataset MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines. |Github | 🏆Leaderboard | 📖Paper | 🚀 What's New [2026.03.11] Added more cutting-edge frontier models to the leaderboard, including the Claude-4.6 series, Seed2.0 series, Qwen3.5 series, and Gemini-3.1-Pro, among… See the full description on the dataset page: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro.

Benchmark:officialTask_categories:question-AnsweringLanguage:enSize_categories:10K<n<100KFormat:parquetModality:tabular
mlfoundations/MINT-1T-PDF-CC-2023-23 HF PQC Verified

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-23.

Task_categories:image-To-TextTask_categories:text-GenerationLanguage:enSize_categories:1M<n<10MFormat:webdatasetModality:image
aps/super_glue HF Unverified

Dataset Card for "super_glue" Dataset Summary SuperGLUE (https://super.gluebenchmark.com/) is a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances axb Size of downloaded dataset files: 0.03 MB Size of… See the full description on the dataset page: https://huggingface.co/datasets/aps/super_glue.

Task_categories:text-ClassificationTask_categories:token-ClassificationTask_categories:question-AnsweringTask_ids:natural-Language-InferenceTask_ids:word-Sense-DisambiguationTask_ids:coreference-Resolution
rajpurkar/squad HF Unverified

Dataset Card for SQuAD Dataset Summary Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 1.1 contains 100,000+ question-answer pairs on 500+ articles. Supported Tasks and Leaderboards Question Answering.… See the full description on the dataset page: https://huggingface.co/datasets/rajpurkar/squad.

Task_categories:question-AnsweringTask_ids:extractive-QaAnnotations_creators:crowdsourcedLanguage_creators:crowdsourcedLanguage_creators:foundMultilinguality:monolingual
jobs-git/HPLT2.0_cleaned HF Unverified

This is a large-scale collection of web-crawled documents in 191 world languages, produced by the HPLT project. The source of the data is mostly Internet Archive with some additions from Common Crawl. For a detailed description of the dataset, please refer to https://hplt-project.org/datasets/v2.0 The Cleaned variant of HPLT Datasets v2.0 This is the cleaned variant of the HPLT Datasets v2.0 converted to the Parquet format semi-automatically when being uploaded here. The original JSONL files… See the full description on the dataset page: https://huggingface.co/datasets/jobs-git/HPLT2.0_cleaned.

Task_categories:fill-MaskTask_categories:text-GenerationTask_ids:language-ModelingMultilinguality:multilingualLanguage:aceLanguage:af
allenai/openbookqa HF Unverified

Dataset Card for OpenBookQA Dataset Summary OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in. In particular, it contains questions that require multi-step reasoning, use of additional common and commonsense knowledge, and rich text comprehension. OpenBookQA is a new kind of… See the full description on the dataset page: https://huggingface.co/datasets/allenai/openbookqa.

Task_categories:question-AnsweringTask_ids:open-Domain-QaAnnotations_creators:crowdsourcedAnnotations_creators:expert-GeneratedLanguage_creators:expert-GeneratedMultilinguality:monolingual
anisoleai/fineweb-tokenized HF Unverified

FineWeb Tokenized > 4 trillion tokens of the pre-tokenized data the 🌐 web has to offer What is it? This is a pre-tokenized version of the HuggingFaceFW/fineweb dataset (currently in-progress, tokenization of the ~15 trillion tokens corpus is ongoing). The data is being pre-processed and tokenized using the AnisoleAI BPE tokenizer (52,022 vocabulary size) and packed into compact uint16 Parquet shards. By distributing the pre-tokenized corpus, we eliminate… See the full description on the dataset page: https://huggingface.co/datasets/anisoleai/fineweb-tokenized.

Task_categories:text-GenerationLanguage:enSize_categories:n>1TModality:tabularModality:textTabular
nvidia/Nemotron-CC-v2 HF Unverified

Nemotron-Pre-Training-Dataset-v1 Release Data Overview This pretraining dataset, for generative AI model training, preserves high-value math and code while enriching it with diverse multilingual Q&A, fueling the next generation of intelligent, globally-capable models. This dataset supports NVIDIA Nemotron Nano 2, a family of large language models (LLMs) that consists of the NVIDIA-Nemotron-Nano-9B-v2, NVIDIA-Nemotron-Nano-9B-v2-Base, and NVIDIA-Nemotron-Nano-12B-v2-Base… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-CC-v2.

Task_categories:text-GenerationSize_categories:1B<n<10BFormat:parquetModality:textLibrary:datasetsLibrary:dask
allenai/dolma3_mix-6T-1025-7B HF Unverified

⚠️ WARNING: This dataset is intended ONLY for reproducing Olmo 3 7B ⚠️ For all other training use cases, including training from scratch, please utilize our primary dolma 3 data mix: https://huggingface.co/datasets/allenai/dolma3_mix-6T. Note: Some olmOCR science PDFs in the current dataset have been redacted following the training of Olmo 3 7B. These texts are indicated with [REMOVED] in the text field. This will affect reproducibility of Olmo 3 7B. For this reason, please use our… See the full description on the dataset page: https://huggingface.co/datasets/allenai/dolma3_mix-6T-1025-7B.

Task_categories:text-GenerationLanguage:en
zekaiwang/trex_dataset HF Unverified

T-Rex Dataset A large-scale, tactile-reactive bimanual manipulation dataset, collected via teleoperation on a Dexmate Vega-1 robot with two Sharpa Wave dexterous hands. Stored as a LeRobotDataset v3.0. 🌐 Project Page · ✍️ Paper (arXiv) · 💻 Code (T-Rex) · 🚀 Dataset Quickstart · 📓 Colab notebook One episode from each of 20 motor primitives (head-camera view, cropped to the workspace), each with a different object. Teleoperation setup: Manus gloves + VIVE… See the full description on the dataset page: https://huggingface.co/datasets/zekaiwang/trex_dataset.

Task_categories:roboticsLanguage:enSize_categories:1M<n<10MFormat:parquetModality:tabularModality:text
Showing 20 of 178 datasets (page 4 of 9)