Datasets

Training datasets with quantum-safe provenance

Sort: Most Downloaded Most Liked Recently Updated

Zyda-2 Zyda-2 is a 5 trillion token language modeling dataset created by collecting open and high quality datasets and combining them and cross-deduplication and model-based quality filtering. Zyda-2 comprises diverse sources of web data, highly educational content, math, code, and scientific papers. To construct Zyda-2, we took the best open-source datasets available: Zyda, FineWeb, DCLM, and Dolma. Models trained on Zyda-2 significantly outperform identical models trained on the… See the full description on the dataset page: https://huggingface.co/datasets/Zyphra/Zyda-2.

Task_categories:text-GenerationLanguage:enSize_categories:n>1T

139K 98

Updated 2026-06-29 Source available

mlfoundations/MINT-1T-PDF-CC-2023-14 HF PQC Verified

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-14.

Task_categories:image-To-TextTask_categories:text-GenerationLanguage:enSize_categories:1M<n<10MFormat:webdatasetModality:image

135K 6

Updated 2026-04-30 Source available

ylecun/mnist HF Unverified

Dataset Card for MNIST Dataset Summary The MNIST dataset consists of 70,000 28x28 black-and-white images of handwritten digits extracted from two NIST databases. There are 60,000 images in the training dataset and 10,000 images in the validation dataset, one class per digit so a total of 10 classes, with 7,000 images (6,000 train images and 1,000 test images) per class. Half of the image were drawn by Census Bureau employees and the other half by high school students… See the full description on the dataset page: https://huggingface.co/datasets/ylecun/mnist.

Task_categories:image-ClassificationTask_ids:multi-Class-Image-ClassificationAnnotations_creators:expert-GeneratedLanguage_creators:foundMultilinguality:monolingualSource_datasets:extended|other-Nist

134K 252

Updated 2026-06-29 Source available

wyu1/Leopard-Instruct HF Unverified

Leopard-Instruct Paper | Github | Models-LLaVA | Models-Idefics2 Summaries Leopard-Instruct is a large instruction-tuning dataset, comprising 925K instances, with 739K specifically designed for text-rich, multiimage scenarios. It's been used to train Leopard-LLaVA [checkpoint] and Leopard-Idefics2 [checkpoint]. Loading dataset to load the dataset without automatically downloading and process the images (Please run the following codes with datasets==2.18.0)… See the full description on the dataset page: https://huggingface.co/datasets/wyu1/Leopard-Instruct.

Language:enSize_categories:1M<n<10MFormat:parquetModality:imageModality:textLibrary:datasets

134K 65

Updated 2026-05-08 Source available

liwu/MNBVC HF Unverified

MNBVC: Massive Never-ending BT Vast Chinese corpus

Task_categories:text-GenerationTask_categories:fill-MaskTask_ids:language-ModelingTask_ids:masked-Language-ModelingAnnotations_creators:otherLanguage_creators:other

128K 634

Updated 2026-06-29 Source available

HuggingFaceH4/MATH-500 HF Unverified

Dataset Card for MATH-500 This dataset contains a subset of 500 problems from the MATH benchmark that OpenAI created in their Let's Verify Step by Step paper. See their GitHub repo for the source file: https://github.com/openai/prm800k/tree/main?tab=readme-ov-file#math-splits

Task_categories:text-GenerationLanguage:enSize_categories:n<1KFormat:jsonModality:textLibrary:datasets

127K 317

Updated 2026-06-29 Source available

OpenSQZ/AutoMathText-V2 HF Unverified

🚀 AutoMathText-V2: A 2.46 Trillion Token AI-Curated STEM Pretraining Dataset   🎉 AutoMathText-v2 has surpassed 1.5 million downloads! We'd love to know how you're using it. Please take 1 minute to fill out our use case survey. Your feedback will directly shape the future roadmap of this dataset.👉 Share your use case here 📊 AutoMathText-V2 consists of 2.46 trillion tokens of high-quality, deduplicated text spanning web content, mathematics, code, reasoning, and… See the full description on the dataset page: https://huggingface.co/datasets/OpenSQZ/AutoMathText-V2.

Task_categories:text-GenerationTask_categories:question-AnsweringLanguage:enLanguage:zhSize_categories:100M<n<1BModality:tabular

125K 78

Updated 2026-06-29 Source available

chenxran/uspto_full HF Unverified

Dataset Card for "uspto_full" More Information needed

Size_categories:1M<n<10MFormat:parquetModality:textLibrary:datasetsLibrary:pandasLibrary:mlcroissant

125K 2

Updated 2026-05-08 Source available

HKUSTAudio/Audio-FLAN-Dataset HF Unverified

Audio-FLAN Dataset (Paper) (the FULL audio files and jsonl files are still updating) An Instruction-Tuning Dataset for Unified Audio Understanding and Generation Across Speech, Music, and Sound. 1. Dataset Structure The Audio-FLAN-Dataset has the following directory structure: Audio-FLAN-Dataset/ ├── audio_files/ │ ├── audio/ │ │ └── 177_TAU_Urban_Acoustic_Scenes_2022/ │ │ └── 179_Audioset_for_Audio_Inpainting/ │ │ └── ... │ ├── music/ │ │ └──… See the full description on the dataset page: https://huggingface.co/datasets/HKUSTAudio/Audio-FLAN-Dataset.

Task_categories:text-To-SpeechTask_categories:text-To-AudioTask_categories:automatic-Speech-RecognitionLanguage:enLanguage:zhSize_categories:10M<n<100M

123K 42

Updated 2026-04-23 Source available

HuggingFaceFW/fineweb-2 HF Unverified

🥂 FineWeb2 A sparkling update with 1000s of languages What is it? This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages. The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments. In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-2.

Task_categories:text-GenerationLanguage:aaiLanguage:aakLanguage:aauLanguage:aazLanguage:aba

121K 792

Updated 2026-05-08 Source available

Beijing-AISI/panda-bench HF Unverified

PandaBench PandaBench is a comprehensive benchmark for evaluating Large Language Model (LLM) safety, focusing on jailbreak attacks, defense mechanisms, and evaluation methodologies. The PandaGuard framework architecture illustrating the end-to-end pipeline for LLM safety evaluation. The system connects three key components: Attackers, Defenders, and Judges. Dataset Description This repository contains the benchmark results from extensive evaluations of various LLMs… See the full description on the dataset page: https://huggingface.co/datasets/Beijing-AISI/panda-bench.

Task_categories:text-GenerationLanguage:enSize_categories:100K<n<1MFormat:csvModality:tabularModality:text

121K 0

Updated 2026-05-08 Source available

phanerozoic/hi-21cm-survey HF Unverified

21cm Hydrogen Line Sky Survey Continuous 1-second-cadence power spectra of the 21cm neutral hydrogen (HI) line at 1420.405 MHz, recorded 24/7 from a fixed omnidirectional observer on the US East Coast. What this is A radio telescope pointed at the whole sky, recording one spectrum per second, indefinitely. The Earth's rotation scans the beam across the galactic plane daily, producing a natural drift scan. Every row is a self-timestamped power spectrum spanning 2… See the full description on the dataset page: https://huggingface.co/datasets/phanerozoic/hi-21cm-survey.

Task_categories:time-Series-ForecastingSize_categories:1M<n<10MFormat:parquetModality:tabularModality:textLibrary:datasets

121K 3

Updated 2026-06-29 Source available

google/IFEval HF Unverified

Dataset Card for IFEval Dataset Summary This dataset contains the prompts used in the Instruction-Following Eval (IFEval) benchmark for large language models. It contains around 500 "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times" which can be verified by heuristics. To load the dataset, run: from datasets import load_dataset ifeval = load_dataset("google/IFEval") Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/google/IFEval.

Task_categories:text-GenerationLanguage:enSize_categories:n<1KFormat:jsonModality:textLibrary:datasets

121K 148

Updated 2026-05-08 Source available

fancyzhx/ag_news HF Unverified

Dataset Card for "ag_news" Dataset Summary AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml… See the full description on the dataset page: https://huggingface.co/datasets/fancyzhx/ag_news.

Task_categories:text-ClassificationTask_ids:topic-ClassificationAnnotations_creators:foundLanguage_creators:foundMultilinguality:monolingualSource_datasets:original

121K 190

Updated 2026-06-29 Source available

FLARE-MedFM/FLARE-Task4-CT-FM HF Unverified

MICCAI FLARE25 Task 4: Foundation Models for 3D CT and MRI Scans (Homepage) This is the official dataset for CT image foundation model development. We provide 10,000+ CT scans for model pretraining. Downstream tasks include: Abdominal disease classification Abdominal lesion segmentation Abdominal organ segmentation Lung lesion segmentation Dataset Dataset Name Task Metric Source License Abdominal Disease Classification multi-label… See the full description on the dataset page: https://huggingface.co/datasets/FLARE-MedFM/FLARE-Task4-CT-FM.

Task_categories:image-ClassificationTask_categories:image-SegmentationLanguage:enMedical

121K 2

Updated 2026-06-29 Source available

legacy-datasets/wikipedia HF Unverified

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

Task_categories:text-GenerationTask_categories:fill-MaskTask_ids:language-ModelingTask_ids:masked-Language-ModelingAnnotations_creators:no-AnnotationLanguage_creators:crowdsourced

120K 645

Updated 2026-06-28 Source available

airtrain-ai/fineweb-edu-fortified HF Unverified

Fineweb-Edu-Fortified The composition of fineweb-edu-fortified, produced by automatically clustering a 500k row sample in Airtrain What is it? Fineweb-Edu-Fortified is a dataset derived from Fineweb-Edu by applying exact-match deduplication across the whole dataset and producing an embedding for each row. The number of times the text from each row appears is also included as a count column. The embeddings were produced using TaylorAI/bge-micro Fineweb and… See the full description on the dataset page: https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fortified.

Task_categories:text-GenerationLanguage:enSize_categories:100M<n<1BFormat:parquetModality:tabularModality:text

120K 65

Updated 2026-06-27 Source available

Helsinki-NLP/fineweb-edu-translated HF PQC Verified

Helsinki-NLP/fineweb-edu-translated fineweb-edu-tanslated is a collection of automatically translated documents from fineweb-edu. Translations are based on OPUS-MT and HPLT-MT models. The data covers 36,704,000 documents with over 28 billion space-searated tokens of English data translated into 36 languages. The total data set is incudes of over 960 billion tokens and the translated documents are aligned across all languages. More information about how the data has been produced can… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/fineweb-edu-translated.

Task_categories:translationTask_categories:text-GenerationLanguage:bosLanguage:bulLanguage:catLanguage:ces

119K 16

Updated 2026-06-26 Source available

HuggingFaceFW/finetranslations HF Unverified

💬 FineTranslations The world's knowledge in 1+1T tokens of parallel text What is it? This dataset contains over 1 trillion tokens of parallel text in English and 500+ languages. It was obtained by translating data from 🥂 FineWeb2 into English using Gemma3 27B. We relied on datatrove's inference runner to deploy a synthetic data pipeline at scale. Its checkpointing and VLLM lifecycle management features allowed us to use leftover compute from the HF cluster… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/finetranslations.

Task_categories:text-GenerationTask_categories:translationLanguage:abkLanguage:abqLanguage:absLanguage:acm

119K 286

Updated 2026-05-04 Source available

Muennighoff/multi_eurlex HF Unverified

MultiEURLEX comprises 65k EU laws in 23 official EU languages (some low-ish resource). Each EU law has been annotated with EUROVOC concepts (labels) by the Publication Office of EU. As with the English EURLEX, the goal is to predict the relevant EUROVOC concepts (labels); this is multi-label classification task (given the text, predict multiple labels).

Size_categories:10M<n<100MModality:textLibrary:datasetsLibrary:mlcroissant

118K 6

Updated 2026-05-08 Source available

Showing 20 of 178 datasets (page 5 of 9)

Prev Next