Datasets
Training datasets with quantum-safe provenance
MedQA-Darija-MultiLingual The largest open trilingual medical Q&A dataset with directly-playable speech audio for English, French, and Moroccan Darija. A research dataset for the BRAIN HEALTH initiative, designed for multilingual medical NLP, low-resource speech recognition, healthcare chatbots, and clinical education tools targeting Morocco and the broader Maghreb region. Dataset is currently in scientific validation phase. After programmatic validation (Stage 1 LOF outlier… See the full description on the dataset page: https://huggingface.co/datasets/Williamsanderson/MedQA-Darija-MultiLingual.
MMMU (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI) 🌐 Homepage | 🏆 Leaderboard | 🤗 Dataset | 🤗 Paper | 📖 arXiv | GitHub 🔔News 🛠️[2026-04-21]: Fixed option issue in test_Psychology_15. ‼️[2026-02-12]: We have released the answers for the test set! You can now evaluate your models on the test set locally! 🎉 🛠️[2024-05-30]: Fixed duplicate option issues in Materials dataset items (validation_Materials_25;… See the full description on the dataset page: https://huggingface.co/datasets/MMMU/MMMU.
Fine Vision FineVision is a massive collection of datasets with 17.3M images, 24.3M samples, 88.9M turns, and 9.5B answer tokens, designed for training state-of-the-art open Vision-Language-Models. More detail can be found in the blog post: https://huggingface.co/spaces/HuggingFaceM4/FineVision Load the data from datasets import load_dataset, get_dataset_config_names # Get all subset names and load the first one available_subsets =… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/FineVision.
GroundCUA: Grounding Computer Use Agents on Human Demonstrations 🌐 Website | 📑 Paper | 🤗 Dataset | 🤖 Models GroundCUA Dataset GroundCUA is a large and diverse dataset of real UI screenshots paired with structured annotations for building multimodal computer use agents. It covers 87 software platforms across productivity tools, browsers, creative tools, communication apps, development environments, and system utilities. GroundCUA is designed for research on GUI… See the full description on the dataset page: https://huggingface.co/datasets/ServiceNow/GroundCUA.
Dataset Card for ImageNet Dataset Summary ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). ImageNet aims to provide on average 1000 images to illustrate each synset. Images of each concept are… See the full description on the dataset page: https://huggingface.co/datasets/ILSVRC/imagenet-1k.
Dataset Card for PAWS: Paraphrase Adversaries from Word Scrambling Dataset Summary PAWS: Paraphrase Adversaries from Word Scrambling This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification. The dataset has two subsets, one based on Wikipedia and the other one based on the Quora Question Pairs (QQP) dataset. For further… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/paws.
The CT-RATE Team organizes the VLM3D Challenge VLM3D 2026 (2nd Edition) → Challenge Finals at MICCAI 2026 VLM3D 2025 (1st Edition) → Challenge Finals at MICCAI 2025 • Workshop at ICCV 2025 The CT-RATE Team is developing the MR-RATE Dataset A large-scale brain MRI dataset with paired radiology reports for training 3D vision-language models. GitHub | Dataset | Metadata Dashboard Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography… See the full description on the dataset page: https://huggingface.co/datasets/ibrahimhamamci/CT-RATE.
🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-50.
OpenGitHub What is it? This dataset contains every public event on GitHub: every push, pull request, issue, star, fork, code review, release, and discussion across all public repositories. GitHub is the world's largest software development platform, home to over 200 million repositories and the daily work of tens of millions of developers, from individual open-source contributors to the engineering teams behind the most widely used software on earth. The archive currently… See the full description on the dataset page: https://huggingface.co/datasets/open-index/open-github.
Dataset Card for GPQA GPQA is a multiple-choice, Q&A dataset of very hard questions written and validated by experts in biology, physics, and chemistry. When attempting questions out of their own domain (e.g., a physicist answers a chemistry question), these experts get only 34% accuracy, despite spending >30m with full access to Google. We request that you do not reveal examples from this dataset in plain text or images online, to reduce the risk of leakage into foundation model… See the full description on the dataset page: https://huggingface.co/datasets/Idavidrein/gpqa.
Dataset Card for "ArtifactAI/arxiv_s2orc_parsed" Dataset Description https://huggingface.co/datasets/AlgorithmicResearchGroup/arxiv_s2orc_parsed Dataset Summary AlgorithmicResearchGroup/arxiv_s2orc_parsed is a subset of the AllenAI S2ORC dataset, a general-purpose corpus for NLP and text mining research over scientific papers, The dataset is filtered strictly for ArXiv papers, including the full text for each paper. Github links have been extracted from each… See the full description on the dataset page: https://huggingface.co/datasets/AlgorithmicResearchGroup/arxiv_s2orc_parsed.
MR-RATE: A Vision-Language Foundation Model and Dataset for Magnetic Resonance Imaging Welcome to the official page for MR-RATE, a pioneering vision-language model and 3D medical imaging dataset that pairs textual reports with brain and spine MRI volumes. Following the approach of CT-RATE, the first 3D medical imaging dataset to pair images with textual reports, MR-RATE offers brain and spine MRI volumes matched with… See the full description on the dataset page: https://huggingface.co/datasets/Forithmus/MR-RATE.
Dataset Card for Dataset Name This dataset card aims to be a base template for new datasets. It has been generated using this raw template. Dataset Details Dataset Description Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed] Dataset Sources [optional] Repository: [More… See the full description on the dataset page: https://huggingface.co/datasets/bluuebunny/arxiv_metadata_by_year.
LLaVA-OneVision-1.5 Instruction Data Paper | Code 📌 Introduction This dataset, LLaVA-OneVision-1.5-Instruct, was collected and integrated during the development of LLaVA-OneVision-1.5. LLaVA-OneVision-1.5 is a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. This meticulously curated 22M instruction dataset (LLaVA-OneVision-1.5-Instruct) is part of a comprehensive and… See the full description on the dataset page: https://huggingface.co/datasets/mvp-lab/LLaVA-OneVision-1.5-Instruct-Data.
ニコニコ実況 過去ログアーカイブ ニコニコ実況 過去ログアーカイブは、ニコニコ実況 のサービス開始から現在までのすべての過去ログコメントを収集したデータセットです。 去る2020年12月、ニコニコ実況は ニコニコ生放送内の一公式チャンネルとしてリニューアル されました。これに伴い、2009年11月から運用されてきた旧システムは提供終了となり(事実上のサービス終了)、torne や BRAVIA などの家電への対応が軒並み終了する中、当時の生の声が詰まった約11年分の過去ログも同時に失われることとなってしまいました。 そこで 5ch の DTV 板の住民が中心となり、旧ニコニコ実況が終了するまでに11年分の全チャンネルの過去ログをアーカイブする計画が立ち上がりました。紆余曲折あり Nekopanda 氏が約11年分のラジオや BS も含めた全チャンネルの過去ログを完璧に取得してくださったおかげで、11年分の過去ログが電子の海に消えていく事態は回避できました。しかし、旧 API が廃止されてしまったため過去ログを API… See the full description on the dataset page: https://huggingface.co/datasets/echodict/KakologArchives_duplicate.