Datasets | QuantaMrkt

mlfoundations/MINT-1T-HTML HF Unverified

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-HTML.

Task_categories:image-To-TextTask_categories:text-GenerationLanguage:enSize_categories:100M<n<1BFormat:parquetModality:text

139K 94

Updated 2026-05-08 Source available

mlfoundations/MINT-1T-PDF-CC-2023-14 HF PQC Verified

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-14.

Task_categories:image-To-TextTask_categories:text-GenerationLanguage:enSize_categories:1M<n<10MFormat:webdatasetModality:image

135K 6

Updated 2026-04-30 Source available

uoft-cs/cifar10 HF Unverified

Dataset Card for CIFAR-10 Dataset Summary The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain… See the full description on the dataset page: https://huggingface.co/datasets/uoft-cs/cifar10.

Task_categories:image-ClassificationAnnotations_creators:crowdsourcedLanguage_creators:foundMultilinguality:monolingualSource_datasets:extended|other-80-Million-Tiny-ImagesLanguage:en

135K 104

Updated 2026-05-08 Source available

locuslab/TOFU HF Unverified

TOFU: Task of Fictitious Unlearning 🍢 The TOFU dataset serves as a benchmark for evaluating unlearning performance of large language models on realistic tasks. The dataset comprises question-answer pairs based on autobiographies of 200 different authors that do not exist and are completely fictitiously generated by the GPT-4 model. The goal of the task is to unlearn a fine-tuned model on various fractions of the forget set. Quick Links Website: The landing page for TOFU… See the full description on the dataset page: https://huggingface.co/datasets/locuslab/TOFU.

Task_categories:question-AnsweringTask_ids:closed-Domain-QaAnnotations_creators:machine-GeneratedLanguage_creators:machine-GeneratedMultilinguality:monolingualSource_datasets:original

134K 51

Updated 2026-05-08 Source available

wyu1/Leopard-Instruct HF Unverified

Leopard-Instruct Paper | Github | Models-LLaVA | Models-Idefics2 Summaries Leopard-Instruct is a large instruction-tuning dataset, comprising 925K instances, with 739K specifically designed for text-rich, multiimage scenarios. It's been used to train Leopard-LLaVA [checkpoint] and Leopard-Idefics2 [checkpoint]. Loading dataset to load the dataset without automatically downloading and process the images (Please run the following codes with datasets==2.18.0)… See the full description on the dataset page: https://huggingface.co/datasets/wyu1/Leopard-Instruct.

Language:enSize_categories:1M<n<10MFormat:parquetModality:imageModality:textLibrary:datasets

134K 65

Updated 2026-05-08 Source available

allenai/openbookqa HF Unverified

Dataset Card for OpenBookQA Dataset Summary OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in. In particular, it contains questions that require multi-step reasoning, use of additional common and commonsense knowledge, and rich text comprehension. OpenBookQA is a new kind of… See the full description on the dataset page: https://huggingface.co/datasets/allenai/openbookqa.

Task_categories:question-AnsweringTask_ids:open-Domain-QaAnnotations_creators:crowdsourcedAnnotations_creators:expert-GeneratedLanguage_creators:expert-GeneratedMultilinguality:monolingual

129K 129

Updated 2026-05-08 Source available

Idavidrein/gpqa HF Unverified

Dataset Card for GPQA GPQA is a multiple-choice, Q&A dataset of very hard questions written and validated by experts in biology, physics, and chemistry. When attempting questions out of their own domain (e.g., a physicist answers a chemistry question), these experts get only 34% accuracy, despite spending >30m with full access to Google. We request that you do not reveal examples from this dataset in plain text or images online, to reduce the risk of leakage into foundation model… See the full description on the dataset page: https://huggingface.co/datasets/Idavidrein/gpqa.

Benchmark:officialBenchmark:eval-YamlTask_categories:question-AnsweringTask_categories:text-GenerationLanguage:enSize_categories:1K<n<10K

128K 428

Updated 2026-05-08 Source available

chenxran/uspto_full HF Unverified

Dataset Card for "uspto_full" More Information needed

Size_categories:1M<n<10MFormat:parquetModality:textLibrary:datasetsLibrary:pandasLibrary:mlcroissant

125K 2

Updated 2026-05-08 Source available

HKUSTAudio/Audio-FLAN-Dataset HF Unverified

Audio-FLAN Dataset (Paper) (the FULL audio files and jsonl files are still updating) An Instruction-Tuning Dataset for Unified Audio Understanding and Generation Across Speech, Music, and Sound. 1. Dataset Structure The Audio-FLAN-Dataset has the following directory structure: Audio-FLAN-Dataset/ ├── audio_files/ │ ├── audio/ │ │ └── 177_TAU_Urban_Acoustic_Scenes_2022/ │ │ └── 179_Audioset_for_Audio_Inpainting/ │ │ └── ... │ ├── music/ │ │ └──… See the full description on the dataset page: https://huggingface.co/datasets/HKUSTAudio/Audio-FLAN-Dataset.

Task_categories:text-To-SpeechTask_categories:text-To-AudioTask_categories:automatic-Speech-RecognitionLanguage:enLanguage:zhSize_categories:10M<n<100M

123K 42

Updated 2026-04-23 Source available

HuggingFaceFW/fineweb-2 HF Unverified

🥂 FineWeb2 A sparkling update with 1000s of languages What is it? This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages. The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments. In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-2.

Task_categories:text-GenerationLanguage:aaiLanguage:aakLanguage:aauLanguage:aazLanguage:aba

121K 792

Updated 2026-05-08 Source available

Beijing-AISI/panda-bench HF Unverified

PandaBench PandaBench is a comprehensive benchmark for evaluating Large Language Model (LLM) safety, focusing on jailbreak attacks, defense mechanisms, and evaluation methodologies. The PandaGuard framework architecture illustrating the end-to-end pipeline for LLM safety evaluation. The system connects three key components: Attackers, Defenders, and Judges. Dataset Description This repository contains the benchmark results from extensive evaluations of various LLMs… See the full description on the dataset page: https://huggingface.co/datasets/Beijing-AISI/panda-bench.

Task_categories:text-GenerationLanguage:enSize_categories:100K<n<1MFormat:csvModality:tabularModality:text

121K 0

Updated 2026-05-08 Source available

google/IFEval HF Unverified

Dataset Card for IFEval Dataset Summary This dataset contains the prompts used in the Instruction-Following Eval (IFEval) benchmark for large language models. It contains around 500 "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times" which can be verified by heuristics. To load the dataset, run: from datasets import load_dataset ifeval = load_dataset("google/IFEval") Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/google/IFEval.

Task_categories:text-GenerationLanguage:enSize_categories:n<1KFormat:jsonModality:textLibrary:datasets

121K 148

Updated 2026-05-08 Source available

ibrahimhamamci/CT-RATE HF Unverified

The CT-RATE Team organizes the VLM3D Challenge VLM3D 2026 (2nd Edition) → Challenge Finals at MICCAI 2026 VLM3D 2025 (1st Edition) → Challenge Finals at MICCAI 2025 • Workshop at ICCV 2025 The CT-RATE Team is developing the MR-RATE Dataset A large-scale brain MRI dataset with paired radiology reports for training 3D vision-language models. GitHub | Dataset | Metadata Dashboard Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography… See the full description on the dataset page: https://huggingface.co/datasets/ibrahimhamamci/CT-RATE.

Task_categories:image-To-TextTask_categories:text-To-ImageTask_categories:image-ClassificationTask_categories:question-AnsweringTask_categories:visual-Question-AnsweringTask_categories:zero-Shot-Classification

121K 242

Updated 2026-05-08 Source available

HuggingFaceFW/finetranslations HF Unverified

💬 FineTranslations The world's knowledge in 1+1T tokens of parallel text What is it? This dataset contains over 1 trillion tokens of parallel text in English and 500+ languages. It was obtained by translating data from 🥂 FineWeb2 into English using Gemma3 27B. We relied on datatrove's inference runner to deploy a synthetic data pipeline at scale. Its checkpointing and VLLM lifecycle management features allowed us to use leftover compute from the HF cluster… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/finetranslations.

Task_categories:text-GenerationTask_categories:translationLanguage:abkLanguage:abqLanguage:absLanguage:acm

119K 286

Updated 2026-05-04 Source available

jobs-git/Zyda-2 HF Unverified

Zyda-2 Zyda-2 is a 5 trillion token language modeling dataset created by collecting open and high quality datasets and combining them and cross-deduplication and model-based quality filtering. Zyda-2 comprises diverse sources of web data, highly educational content, math, code, and scientific papers. To construct Zyda-2, we took the best open-source datasets available: Zyda, FineWeb, DCLM, and Dolma. Models trained on Zyda-2 significantly outperform identical models trained on the… See the full description on the dataset page: https://huggingface.co/datasets/jobs-git/Zyda-2.

Task_categories:text-GenerationLanguage:enSize_categories:n>1T

119K 1

Updated 2026-05-06 Source available

Muennighoff/multi_eurlex HF Unverified

MultiEURLEX comprises 65k EU laws in 23 official EU languages (some low-ish resource). Each EU law has been annotated with EUROVOC concepts (labels) by the Publication Office of EU. As with the English EURLEX, the goal is to predict the relevant EUROVOC concepts (labels); this is multi-label classification task (given the text, predict multiple labels).

Size_categories:10M<n<100MModality:textLibrary:datasetsLibrary:mlcroissant

118K 6

Updated 2026-05-08 Source available

fancyzhx/ag_news HF Unverified

Dataset Card for "ag_news" Dataset Summary AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml… See the full description on the dataset page: https://huggingface.co/datasets/fancyzhx/ag_news.

Task_categories:text-ClassificationTask_ids:topic-ClassificationAnnotations_creators:foundLanguage_creators:foundMultilinguality:monolingualSource_datasets:original

113K 189

Updated 2026-05-07 Source available

MMMU/MMMU HF Unverified

MMMU (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI) 🌐 Homepage | 🏆 Leaderboard | 🤗 Dataset | 🤗 Paper | 📖 arXiv | GitHub 🔔News 🛠️[2026-04-21]: Fixed option issue in test_Psychology_15. ‼️[2026-02-12]: We have released the answers for the test set! You can now evaluate your models on the test set locally! 🎉 🛠️[2024-05-30]: Fixed duplicate option issues in Materials dataset items (validation_Materials_25;… See the full description on the dataset page: https://huggingface.co/datasets/MMMU/MMMU.

Task_categories:question-AnsweringTask_categories:visual-Question-AnsweringTask_categories:multiple-ChoiceLanguage:enSize_categories:10K<n<100KFormat:parquet

112K 325

Updated 2026-05-08 Source available

ServiceNow/GroundCUA HF Unverified

GroundCUA: Grounding Computer Use Agents on Human Demonstrations 🌐 Website | 📑 Paper | 🤗 Dataset | 🤖 Models GroundCUA Dataset GroundCUA is a large and diverse dataset of real UI screenshots paired with structured annotations for building multimodal computer use agents. It covers 87 software platforms across productivity tools, browsers, creative tools, communication apps, development environments, and system utilities. GroundCUA is designed for research on GUI… See the full description on the dataset page: https://huggingface.co/datasets/ServiceNow/GroundCUA.

Task_categories:image-To-TextLanguage:enSize_categories:1M<n<10MModality:imageComputer_useAgents

106K 34

Updated 2026-05-08 Source available

ILSVRC/imagenet-1k HF Unverified

Dataset Card for ImageNet Dataset Summary ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). ImageNet aims to provide on average 1000 images to illustrate each synset. Images of each concept are… See the full description on the dataset page: https://huggingface.co/datasets/ILSVRC/imagenet-1k.

Task_categories:image-ClassificationTask_ids:multi-Class-Image-ClassificationAnnotations_creators:crowdsourcedLanguage_creators:crowdsourcedMultilinguality:monolingualSource_datasets:original

103K 791

Updated 2026-05-08 Source available