Datasets

Training datasets with quantum-safe provenance

truthfulqa/truthful_qa HF Unverified

Dataset Card for truthful_qa Dataset Summary TruthfulQA is a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. Questions are crafted so that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts.… See the full description on the dataset page: https://huggingface.co/datasets/truthfulqa/truthful_qa.

Task_categories:multiple-ChoiceTask_categories:text-GenerationTask_categories:question-AnsweringTask_ids:multiple-Choice-QaTask_ids:language-ModelingTask_ids:open-Domain-Qa
HuggingFaceM4/FineVision HF Unverified

Fine Vision FineVision is a massive collection of datasets with 17.3M images, 24.3M samples, 88.9M turns, and 9.5B answer tokens, designed for training state-of-the-art open Vision-Language-Models. More detail can be found in the blog post: https://huggingface.co/spaces/HuggingFaceM4/FineVision Load the data from datasets import load_dataset, get_dataset_config_names # Get all subset names and load the first one available_subsets =… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/FineVision.

Size_categories:10M<n<100MFormat:parquetModality:imageModality:textLibrary:datasetsLibrary:dask
google-research-datasets/paws HF Unverified

Dataset Card for PAWS: Paraphrase Adversaries from Word Scrambling Dataset Summary PAWS: Paraphrase Adversaries from Word Scrambling This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification. The dataset has two subsets, one based on Wikipedia and the other one based on the Quora Question Pairs (QQP) dataset. For further… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/paws.

Task_categories:text-ClassificationTask_ids:semantic-Similarity-ClassificationTask_ids:semantic-Similarity-ScoringTask_ids:text-ScoringTask_ids:multi-Input-Text-ClassificationAnnotations_creators:expert-Generated
mlfoundations/MINT-1T-PDF-CC-2023-50 HF PQC Verified

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-50.

Task_categories:image-To-TextTask_categories:text-GenerationLanguage:enSize_categories:1M<n<10MFormat:webdatasetModality:image
roneneldan/TinyStories HF Unverified

Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary. Described in the following paper: https://arxiv.org/abs/2305.07759. The models referred to in the paper were trained on TinyStories-train.txt (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M. Additional resources: tinystories_all_data.tar.gz - contains a superset of… See the full description on the dataset page: https://huggingface.co/datasets/roneneldan/TinyStories.

Task_categories:text-GenerationLanguage:enSize_categories:1M<n<10MFormat:parquetModality:textLibrary:datasets
zai-org/LongBench HF Unverified

LongBench is a comprehensive benchmark for multilingual and multi-task purposes, with the goal to fully measure and evaluate the ability of pre-trained language models to understand long text. This dataset consists of twenty different tasks, covering key long-text application scenarios such as multi-document QA, single-document QA, summarization, few-shot learning, synthetic tasks, and code completion.

Task_categories:question-AnsweringTask_categories:text-GenerationTask_categories:summarizationTask_categories:text-ClassificationLanguage:enLanguage:zh
ylecun/mnist HF Unverified

Dataset Card for MNIST Dataset Summary The MNIST dataset consists of 70,000 28x28 black-and-white images of handwritten digits extracted from two NIST databases. There are 60,000 images in the training dataset and 10,000 images in the validation dataset, one class per digit so a total of 10 classes, with 7,000 images (6,000 train images and 1,000 test images) per class. Half of the image were drawn by Census Bureau employees and the other half by high school students… See the full description on the dataset page: https://huggingface.co/datasets/ylecun/mnist.

Task_categories:image-ClassificationTask_ids:multi-Class-Image-ClassificationAnnotations_creators:expert-GeneratedLanguage_creators:foundMultilinguality:monolingualSource_datasets:extended|other-Nist
hotpotqa/hotpot_qa HF Unverified

Dataset Card for "hotpot_qa" Dataset Summary HotpotQA is a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowingQA systems to reason… See the full description on the dataset page: https://huggingface.co/datasets/hotpotqa/hotpot_qa.

Task_categories:question-AnsweringAnnotations_creators:crowdsourcedLanguage_creators:foundMultilinguality:monolingualSource_datasets:originalLanguage:en
mvp-lab/LLaVA-OneVision-1.5-Instruct-Data HF Unverified

LLaVA-OneVision-1.5 Instruction Data Paper | Code 📌 Introduction This dataset, LLaVA-OneVision-1.5-Instruct, was collected and integrated during the development of LLaVA-OneVision-1.5. LLaVA-OneVision-1.5 is a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. This meticulously curated 22M instruction dataset (LLaVA-OneVision-1.5-Instruct) is part of a comprehensive and… See the full description on the dataset page: https://huggingface.co/datasets/mvp-lab/LLaVA-OneVision-1.5-Instruct-Data.

Task_categories:image-Text-To-TextLanguage:enSize_categories:10M<n<100MModality:imageModality:textMultimodal
bio-nlp-umass/MedThinkVQA HF Unverified

MedThinkVQA MedThinkVQA is an expert-annotated benchmark for multi-image diagnostic reasoning in radiology. Unlike prior medical VQA benchmarks that typically contain at most one image per case, MedThinkVQA requires models to extract evidence from each image, integrate cross-view information, and perform differential-diagnosis reasoning. Links GitHub: https://github.com/benluwang/MedThinkVQA Leaderboard: https://benluwang.github.io/MedThinkVQA/ Submission Guide:… See the full description on the dataset page: https://huggingface.co/datasets/bio-nlp-umass/MedThinkVQA.

Task_categories:question-AnsweringTask_categories:text-GenerationLanguage:enSize_categories:1K<n<10KFormat:parquetModality:image
fixie-ai/covost2 HF Unverified

This is a partial copy of CoVoST2 dataset. The main difference is that the audio data is included in the dataset, which makes usage easier and allows browsing the samples using HF Dataset Viewer. The limitation of this method is that all audio samples of the EN_XX subsets are duplicated, as such the size of the dataset is larger. As such, not all the data is included: Only the validation and test subsets are available. From the XX_EN subsets, only fr, es, and zh-CN are included.

Size_categories:1M<n<10MFormat:parquetModality:audioModality:textLibrary:datasetsLibrary:dask
nvidia/PhysicalAI-Robotics-GR00T-Teleop-GR1 HF Unverified

Introduction TL;DR: DreamDojo is a generalist robot world model pretrained on 44k hours of human egocentric data, showing unprecedented generalization to diverse objects and environments. Project page: https://dreamdojo-world.github.io/ Paper: https://arxiv.org/abs/2602.06949 Code: https://github.com/NVIDIA/DreamDojo How to Use Check out https://github.com/NVIDIA/DreamDojo Citation @article{gao2026dreamdojo, title={DreamDojo: A Generalist Robot… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/PhysicalAI-Robotics-GR00T-Teleop-GR1.

Size_categories:1M<n<10MFormat:parquetModality:tabularModality:videoLibrary:datasetsLibrary:dask
leosltl/Android-in-the-Wild HF Unverified

Android in the Wild (AITW) This is a mirror of Google's Android in the Wild (AITW) dataset, re-hosted on Hugging Face for easier community access. Original Source Paper: Android in the Wild: A Large-Scale Dataset for Android Device Control Original Repository: google-research/google-research/tree/master/android_in_the_wild Dataset Description Android in the Wild (AITW) is a large-scale dataset for Android device control. It contains human demonstrations of… See the full description on the dataset page: https://huggingface.co/datasets/leosltl/Android-in-the-Wild.

Task_categories:image-ClassificationTask_categories:visual-Question-AnsweringSize_categories:100M<n<1BAndroidMobileUi-Automation
CohereLabs/aya_collection HF Unverified

This dataset is uploaded in two places: here and additionally here as 'Aya Collection Language Split.' These datasets are identical in content but differ in structure of upload. This dataset is structured by folders split according to dataset name. The version here instead divides the Aya collection into folders split by language. We recommend you use the language split version if you are only interested in downloading data for a single or smaller set of languages, and this version if you… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/aya_collection.

Task_categories:text-ClassificationTask_categories:summarizationTask_categories:translationLanguage:aceLanguage:afrLanguage:amh
japanese-asr/whisper_transcriptions.reazon_speech_all HF Unverified

Size_categories:10M<n<100MFormat:parquetModality:audioModality:textLibrary:datasetsLibrary:dask
JosephusCheung/GuanacoDataset HF Unverified

Sorry, it's no longer available on Hugging Face. Please reach out to those who have already downloaded it. If you have a copy, please refrain from re-uploading it to Hugging Face. The people here don't deserve it. See also: https://twitter.com/RealJosephus/status/1779913520529707387 GuanacoDataset News: We're heading towards multimodal VQA, with blip2-flan-t5-xxl Alignment to Guannaco 7B LLM. Still under construction: GuanacoVQA weight & GuanacoVQA Dataset Notice: Effective… See the full description on the dataset page: https://huggingface.co/datasets/JosephusCheung/GuanacoDataset.

Task_categories:text-GenerationTask_categories:question-AnsweringLanguage:zhLanguage:enLanguage:jaLanguage:de
allenai/sciq HF Unverified

Dataset Card for "sciq" Dataset Summary The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed… See the full description on the dataset page: https://huggingface.co/datasets/allenai/sciq.

Task_categories:question-AnsweringTask_ids:closed-Domain-QaAnnotations_creators:no-AnnotationLanguage_creators:crowdsourcedMultilinguality:monolingualSource_datasets:original
opendatalab/Sci-Base HF Unverified

Sci-Base: The Largest AI-Ready Scientific Foundation Dataset 🌌 The Sciverse Data Foundation Sciverse is a comprehensive, multi-layered scientific data foundation designed to provide the ultimate data infrastructure for the AI for Science (AI4S) community. As scientific research becomes increasingly data-driven, Sciverse supplies the essential, high-quality data resources required to build robust scientific knowledge systems and accelerate research. Sciverse… See the full description on the dataset page: https://huggingface.co/datasets/opendatalab/Sci-Base.

Language:enSize_categories:1M<n<10MFormat:parquetModality:textLibrary:datasetsLibrary:dask
lishaoyong/latex-formulas-80M HF Unverified

For more details, please refer to the 𝐓𝐞𝐱𝐓𝐞𝐥𝐥𝐞𝐫 GitHub repository. IMPORTANT NOTE!!! The handwritten subset of this dataset was collected entirely from existing open source work, which includes all test sets. If you want to use this subset for your experimental ablation, please filter it yourself based on the latex label of the test set

Size_categories:10M<n<100MFormat:parquetModality:imageModality:textLibrary:datasetsLibrary:dask
cornell-movie-review-data/rotten_tomatoes HF Unverified

Dataset Card for "rotten_tomatoes" Dataset Summary Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005. Supported Tasks and Leaderboards More Information Needed Languages… See the full description on the dataset page: https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes.

Task_categories:text-ClassificationTask_ids:sentiment-ClassificationAnnotations_creators:crowdsourcedLanguage_creators:crowdsourcedMultilinguality:monolingualSource_datasets:original
Showing 20 of 126 datasets (page 5 of 7)