Datasets

Training datasets with quantum-safe provenance

MedQA-Darija-MultiLingual The largest open trilingual medical Q&A dataset with directly-playable speech audio for English, French, and Moroccan Darija. A research dataset for the BRAIN HEALTH initiative, designed for multilingual medical NLP, low-resource speech recognition, healthcare chatbots, and clinical education tools targeting Morocco and the broader Maghreb region. Dataset is currently in scientific validation phase. After programmatic validation (Stage 1 LOF outlier… See the full description on the dataset page: https://huggingface.co/datasets/Williamsanderson/MedQA-Darija-MultiLingual.

Task_categories:question-AnsweringTask_categories:automatic-Speech-RecognitionTask_categories:text-To-SpeechLanguage:arLanguage:frLanguage:en

112K 4

Updated 2026-06-29 Source available

MMMU/MMMU HF Unverified

MMMU (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI) 🌐 Homepage | 🏆 Leaderboard | 🤗 Dataset | 🤗 Paper | 📖 arXiv | GitHub 🔔News 🛠️[2026-04-21]: Fixed option issue in test_Psychology_15. ‼️[2026-02-12]: We have released the answers for the test set! You can now evaluate your models on the test set locally! 🎉 🛠️[2024-05-30]: Fixed duplicate option issues in Materials dataset items (validation_Materials_25;… See the full description on the dataset page: https://huggingface.co/datasets/MMMU/MMMU.

Task_categories:question-AnsweringTask_categories:visual-Question-AnsweringTask_categories:multiple-ChoiceLanguage:enSize_categories:10K<n<100KFormat:parquet

112K 325

Updated 2026-05-08 Source available

HuggingFaceM4/FineVision HF Unverified

Fine Vision FineVision is a massive collection of datasets with 17.3M images, 24.3M samples, 88.9M turns, and 9.5B answer tokens, designed for training state-of-the-art open Vision-Language-Models. More detail can be found in the blog post: https://huggingface.co/spaces/HuggingFaceM4/FineVision Load the data from datasets import load_dataset, get_dataset_config_names # Get all subset names and load the first one available_subsets =… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/FineVision.

Size_categories:10M<n<100MFormat:parquetModality:imageModality:textLibrary:datasetsLibrary:dask

108K 498

Updated 2026-06-29 Source available

angie-chen55/python-github-code HF Unverified

Size_categories:1M<n<10MFormat:parquetModality:textLibrary:datasetsLibrary:daskLibrary:polars

107K 36

Updated 2026-06-29 Source available

sebastiandizon/genius-song-lyrics HF Unverified

Size_categories:1M<n<10MFormat:csvModality:tabularModality:textLibrary:datasetsLibrary:pandas

106K 33

Updated 2026-06-27 Source available

ServiceNow/GroundCUA HF Unverified

GroundCUA: Grounding Computer Use Agents on Human Demonstrations 🌐 Website | 📑 Paper | 🤗 Dataset | 🤖 Models GroundCUA Dataset GroundCUA is a large and diverse dataset of real UI screenshots paired with structured annotations for building multimodal computer use agents. It covers 87 software platforms across productivity tools, browsers, creative tools, communication apps, development environments, and system utilities. GroundCUA is designed for research on GUI… See the full description on the dataset page: https://huggingface.co/datasets/ServiceNow/GroundCUA.

Task_categories:image-To-TextLanguage:enSize_categories:1M<n<10MModality:imageComputer_useAgents

106K 34

Updated 2026-05-08 Source available

ILSVRC/imagenet-1k HF Unverified

Dataset Card for ImageNet Dataset Summary ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). ImageNet aims to provide on average 1000 images to illustrate each synset. Images of each concept are… See the full description on the dataset page: https://huggingface.co/datasets/ILSVRC/imagenet-1k.

Task_categories:image-ClassificationTask_ids:multi-Class-Image-ClassificationAnnotations_creators:crowdsourcedLanguage_creators:crowdsourcedMultilinguality:monolingualSource_datasets:original

106K 844

Updated 2026-06-29 Source available

google-research-datasets/paws HF Unverified

Dataset Card for PAWS: Paraphrase Adversaries from Word Scrambling Dataset Summary PAWS: Paraphrase Adversaries from Word Scrambling This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification. The dataset has two subsets, one based on Wikipedia and the other one based on the Quora Question Pairs (QQP) dataset. For further… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/paws.

Task_categories:text-ClassificationTask_ids:semantic-Similarity-ClassificationTask_ids:semantic-Similarity-ScoringTask_ids:text-ScoringTask_ids:multi-Input-Text-ClassificationAnnotations_creators:expert-Generated

99K 40

Updated 2026-06-29 Source available

ibrahimhamamci/CT-RATE HF Unverified

The CT-RATE Team organizes the VLM3D Challenge VLM3D 2026 (2nd Edition) → Challenge Finals at MICCAI 2026 VLM3D 2025 (1st Edition) → Challenge Finals at MICCAI 2025 • Workshop at ICCV 2025 The CT-RATE Team is developing the MR-RATE Dataset A large-scale brain MRI dataset with paired radiology reports for training 3D vision-language models. GitHub | Dataset | Metadata Dashboard Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography… See the full description on the dataset page: https://huggingface.co/datasets/ibrahimhamamci/CT-RATE.

Task_categories:image-To-TextTask_categories:text-To-ImageTask_categories:image-ClassificationTask_categories:question-AnsweringTask_categories:visual-Question-AnsweringTask_categories:zero-Shot-Classification

98K 261

Updated 2026-06-29 Source available

mlfoundations/MINT-1T-PDF-CC-2023-50 HF PQC Verified

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-50.

Task_categories:image-To-TextTask_categories:text-GenerationLanguage:enSize_categories:1M<n<10MFormat:webdatasetModality:image

98K 13

Updated 2026-05-03 Source available

open-index/open-github HF Unverified

OpenGitHub What is it? This dataset contains every public event on GitHub: every push, pull request, issue, star, fork, code review, release, and discussion across all public repositories. GitHub is the world's largest software development platform, home to over 200 million repositories and the daily work of tens of millions of developers, from individual open-source contributors to the engineering teams behind the most widely used software on earth. The archive currently… See the full description on the dataset page: https://huggingface.co/datasets/open-index/open-github.

Task_categories:text-GenerationTask_categories:text-ClassificationTask_categories:feature-ExtractionLanguage:enLanguage:mulSize_categories:100K<n<1M

95K 9

Updated 2026-06-29 Source available

Idavidrein/gpqa HF Unverified

Dataset Card for GPQA GPQA is a multiple-choice, Q&A dataset of very hard questions written and validated by experts in biology, physics, and chemistry. When attempting questions out of their own domain (e.g., a physicist answers a chemistry question), these experts get only 34% accuracy, despite spending >30m with full access to Google. We request that you do not reveal examples from this dataset in plain text or images online, to reduce the risk of leakage into foundation model… See the full description on the dataset page: https://huggingface.co/datasets/Idavidrein/gpqa.

Benchmark:officialBenchmark:eval-YamlTask_categories:question-AnsweringTask_categories:text-GenerationLanguage:enSize_categories:1K<n<10K

94K 471

Updated 2026-06-29 Source available

fixie-ai/common_voice_17_0 HF Unverified

Size_categories:10M<n<100MFormat:parquetModality:audioModality:textLibrary:datasetsLibrary:dask

92K 17

Updated 2026-06-29 Source available

AlgorithmicResearchGroup/arxiv_s2orc_parsed HF Unverified

Dataset Card for "ArtifactAI/arxiv_s2orc_parsed" Dataset Description https://huggingface.co/datasets/AlgorithmicResearchGroup/arxiv_s2orc_parsed Dataset Summary AlgorithmicResearchGroup/arxiv_s2orc_parsed is a subset of the AllenAI S2ORC dataset, a general-purpose corpus for NLP and text mining research over scientific papers, The dataset is filtered strictly for ArXiv papers, including the full text for each paper. Github links have been extracted from each… See the full description on the dataset page: https://huggingface.co/datasets/AlgorithmicResearchGroup/arxiv_s2orc_parsed.

Task_categories:text-GenerationTask_categories:zero-Shot-ClassificationLanguage:enSize_categories:1M<n<10MFormat:parquetModality:text

92K 27

Updated 2026-06-27 Source available

Forithmus/MR-RATE HF Unverified

MR-RATE: A Vision-Language Foundation Model and Dataset for Magnetic Resonance Imaging Welcome to the official page for MR-RATE, a pioneering vision-language model and 3D medical imaging dataset that pairs textual reports with brain and spine MRI volumes. Following the approach of CT-RATE, the first 3D medical imaging dataset to pair images with textual reports, MR-RATE offers brain and spine MRI volumes matched with… See the full description on the dataset page: https://huggingface.co/datasets/Forithmus/MR-RATE.

92K 85

Updated 2026-06-29 Source available

bluuebunny/arxiv_metadata_by_year HF Unverified

Dataset Card for Dataset Name This dataset card aims to be a base template for new datasets. It has been generated using this raw template. Dataset Details Dataset Description Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed] Dataset Sources [optional] Repository: [More… See the full description on the dataset page: https://huggingface.co/datasets/bluuebunny/arxiv_metadata_by_year.

Language:enSize_categories:1M<n<10MFormat:parquetModality:textLibrary:datasetsLibrary:dask

91K 9

Updated 2026-06-29 Source available

mvp-lab/LLaVA-OneVision-1.5-Instruct-Data HF Unverified

LLaVA-OneVision-1.5 Instruction Data Paper | Code 📌 Introduction This dataset, LLaVA-OneVision-1.5-Instruct, was collected and integrated during the development of LLaVA-OneVision-1.5. LLaVA-OneVision-1.5 is a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. This meticulously curated 22M instruction dataset (LLaVA-OneVision-1.5-Instruct) is part of a comprehensive and… See the full description on the dataset page: https://huggingface.co/datasets/mvp-lab/LLaVA-OneVision-1.5-Instruct-Data.

Task_categories:image-Text-To-TextLanguage:enSize_categories:10M<n<100MModality:imageModality:textMultimodal

90K 71

Updated 2026-05-08 Source available

echodict/KakologArchives_duplicate HF Unverified

ニコニコ実況過去ログアーカイブニコニコ実況過去ログアーカイブは、ニコニコ実況のサービス開始から現在までのすべての過去ログコメントを収集したデータセットです。去る2020年12月、ニコニコ実況はニコニコ生放送内の一公式チャンネルとしてリニューアルされました。これに伴い、2009年11月から運用されてきた旧システムは提供終了となり（事実上のサービス終了）、torne や BRAVIA などの家電への対応が軒並み終了する中、当時の生の声が詰まった約11年分の過去ログも同時に失われることとなってしまいました。そこで 5ch の DTV 板の住民が中心となり、旧ニコニコ実況が終了するまでに11年分の全チャンネルの過去ログをアーカイブする計画が立ち上がりました。紆余曲折あり Nekopanda 氏が約11年分のラジオや BS も含めた全チャンネルの過去ログを完璧に取得してくださったおかげで、11年分の過去ログが電子の海に消えていく事態は回避できました。しかし、旧 API が廃止されてしまったため過去ログを API… See the full description on the dataset page: https://huggingface.co/datasets/echodict/KakologArchives_duplicate.

Task_categories:text-ClassificationLanguage:ja

88K 0

Updated 2026-06-29 Source available

labofsahil/pypi-packages-metadata-dataset HF Unverified

Size_categories:10M<n<100MModality:text

87K 0

Updated 2026-06-29 Source available

Showing 20 of 178 datasets (page 6 of 9)

Prev Next