Datasets

Training datasets with quantum-safe provenance

mlfoundations/MINT-1T-PDF-CC-2023-06 HF PQC Verified

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-06.

Task_categories:image-To-TextTask_categories:text-GenerationLanguage:enSize_categories:100B<n<1TMultimodal
Rowan/hellaswag HF Unverified

Dataset Card for "hellaswag" Dataset Summary HellaSwag: Can a Machine Really Finish Your Sentence? is a new dataset for commonsense NLI. A paper was published at ACL2019. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances default Size of downloaded dataset files: 71.49 MB Size of the generated dataset: 65.32 MB Total amount of disk used: 136.81… See the full description on the dataset page: https://huggingface.co/datasets/Rowan/hellaswag.

Language:enSize_categories:10K<n<100KFormat:parquetModality:textLibrary:datasetsLibrary:pandas
allenai/objaverse HF Unverified

Objaverse Objaverse is a Massive Dataset with 800K+ Annotated 3D Objects. More documentation is coming soon. In the meantime, please see our paper and website for additional details. License The use of the dataset as a whole is licensed under the ODC-By v1.0 license. Individual objects in Objaverse are all licensed as creative commons distributable objects, and may be under the following licenses: CC-BY 4.0 - 721K objects CC-BY-NC 4.0 - 25K objects CC-BY-NC-SA 4.0 - 52K… See the full description on the dataset page: https://huggingface.co/datasets/allenai/objaverse.

Language:en
openai/openai_humaneval HF Unverified

Dataset Card for OpenAI HumanEval Dataset Summary The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models. Supported Tasks and Leaderboards Languages The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.

Annotations_creators:expert-GeneratedLanguage_creators:expert-GeneratedMultilinguality:monolingualSource_datasets:originalLanguage:enSize_categories:n<1K
Kazimir-ai/text-to-image-prompts HF Unverified

The dataset of the most popular text-to-image prompts. Dataset Details Dataset Description Curated by: kazimir.ai Funded by [optional]: [More Information Needed] Shared by [optional]: https://kazimir.ai License: apache-2.0 Dataset Sources [optional] Repository: [More Information Needed] Paper [optional]: [More Information Needed] Demo [optional]: [More Information Needed] Uses Free to use. Dataset Structure CSV file… See the full description on the dataset page: https://huggingface.co/datasets/Kazimir-ai/text-to-image-prompts.

Language:enSize_categories:10K<n<100KFormat:csvModality:textLibrary:datasetsLibrary:pandas
labelmaker/arkit_labelmaker HF Unverified

ARKit Labelmaker: A New Scale for Indoor 3D Scene Understanding [arxiv] [website] [checkpoints] [code] We complement ARKitScenes dataset with dense semantic annotations that are automatically generated at scale. This produces the first large-scale, real-world 3D dataset with dense semantic annotations. Training on this auto-generated data, we push forward the state-of-the-art performance on ScanNet and ScanNet200 with prevalent 3D semantic segmentation models.

Task_categories:image-SegmentationLanguage:enSize_categories:1K<n<10KDoi:10.57967/hf/23893D semantic segmentationIndoor 3D scene dataset
WINGNUS/ACL-OCL HF Unverified

Dataset Card for ACL Anthology Corpus This repository provides full-text and metadata to the ACL anthology collection (80k articles/posters as of September 2022) also including .pdf files and grobid extractions of the pdfs. How is this different from what ACL anthology provides and what already exists? We provide pdfs, full-text, references and other details extracted by grobid from the PDFs while ACL Anthology only provides abstracts. There exists a similar corpus… See the full description on the dataset page: https://huggingface.co/datasets/WINGNUS/ACL-OCL.

Task_categories:token-ClassificationLanguage_creators:foundMultilinguality:monolingualSource_datasets:originalLanguage:enSize_categories:10K<n<100K
stanford-vision-lab/gpic HF Unverified

GPIC: A Giant Permissive Image Corpus for Visual Generation Keshigeyan&nbsp;Chandrasegaran*1,&nbsp; Kyle&nbsp;Sargent*1,&nbsp; Suchir&nbsp;Agarwal1,&nbsp; Michael&nbsp;Jang1,&nbsp; Michael&nbsp;Poli1,2,&nbsp; Juan&nbsp;Carlos&nbsp;Niebles1,4,&nbsp; Justin&nbsp;Johnson3,&nbsp; Jiajun&nbsp;Wu1,&nbsp; Li&nbsp;Fei-Fei1 1&nbsp;Stanford University&nbsp;&nbsp; 2&nbsp;Radical Numerics&nbsp;&nbsp; 3&nbsp;University of Michigan&nbsp;&nbsp; 4&nbsp;Salesforce… See the full description on the dataset page: https://huggingface.co/datasets/stanford-vision-lab/gpic.

Language:en
jsulz/FIFA23 HF Unverified

About this dataset Context The datasets provided include the players data for the Career Mode from FIFA 15 to FIFA 23. The data allows multiple comparisons for the same players across the last 9 versions of the video game. Some ideas of possible analysis: Historical comparison between Messi and Ronaldo (what skill attributes changed the most during time - compared to real-life stats); Ideal budget to create a competitive team (at the level of top n teams in Europe) and… See the full description on the dataset page: https://huggingface.co/datasets/jsulz/FIFA23.

Task_categories:tabular-ClassificationTask_categories:tabular-RegressionLanguage:enSize_categories:10M<n<100MModality:tabularTabular
HuggingFaceFW/finepdfs_lang_classification HF Unverified

Size_categories:1M<n<10MFormat:parquetModality:tabularLibrary:datasetsLibrary:pandasLibrary:mlcroissant
Salesforce/GiftEvalPretrain HF Unverified

GIFT-Eval Pre-training Datasets Pretraining dataset aligned with GIFT-Eval that has 71 univariate and 17 multivariate datasets, spanning seven domains and 13 frequencies, totaling 4.5 million time series and 230 billion data points. Notably this collection of data has no leakage issue with the train/test split and can be used to pretrain foundation models that can be fairly evaluated on GIFT-Eval. 📄 Paper 🖥️ Code 📔 Blog Post 🏎️ Leader Board Ethical Considerations… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/GiftEvalPretrain.

Task_categories:time-Series-ForecastingSize_categories:1M<n<10MModality:timeseriesTimeseriesForecastingBenchmark
nvidia/SAGE-10k HF Unverified

SAGE-10k SAGE-10k is a large-scale interactive indoor scene dataset featuring realistic layouts, generated by the agentic-driven pipeline introduced in "SAGE: Scalable Agentic 3D Scene Generation for Embodied AI". The dataset contains 10,000 diverse scenes spanning 50 room types and styles, along with 565K uniquely generated 3D objects. 🔑 Key Features SAGE-10k integrates a wide variety of scenes, and particularly, preserves small items for… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/SAGE-10k.

Task_categories:text-To-3dLanguage:enSize_categories:10K<n<100KScene-GenerationInteractive-ScenesEmbodied-AI
InternRobotics/OmniWorld HF Unverified

[ICLR 2026] OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling         🎉NEWS [2026.3.21] 🔥 OmniWorld-Game with Metric Scale is now released! Check out our latest model Pi3X (an enhanced version of Pi3), which leverages this data to achieve better performance! [2026.1.26] 🎉 OmniWorld was accepted by ICLR 2026! [2026.1.7] Update OmniWorld-Game, release RH20T-Robot, RH20T-Human, Ego-Exo4D, EgoDex, Epic-Kitchens. [2025.11.11] The OmniWorld is… See the full description on the dataset page: https://huggingface.co/datasets/InternRobotics/OmniWorld.

Task_categories:text-To-VideoTask_categories:image-To-VideoTask_categories:image-To-3dTask_categories:roboticsTask_categories:otherLanguage:en
allenai/winogrande HF Unverified

Dataset Card for "winogrande" Dataset Summary WinoGrande is a new collection of 44k problems, inspired by Winograd Schema Challenge (Levesque, Davis, and Morgenstern 2011), but adjusted to improve the scale and robustness against the dataset-specific bias. Formulated as a fill-in-a-blank task with binary options, the goal is to choose the right option for a given sentence which requires commonsense reasoning. Supported Tasks and Leaderboards More Information… See the full description on the dataset page: https://huggingface.co/datasets/allenai/winogrande.

Language:enSize_categories:10K<n<100KFormat:parquetModality:textLibrary:datasetsLibrary:pandas
OpenGVLab/GUI-Odyssey HF Unverified

Dataset Card for GUI Odyssey News⭐️ A new and improved version of the GUIOdyssey dataset has been released! 🎉🎉 👉 Please use the latest version and refer to the updated README for the most up-to-date information. We highly recommend using the new version for all training and evaluation! Repository: https://github.com/OpenGVLab/GUI-Odyssey Latest Version of Dataset: hflqf88888/GUIOdyssey Paper: https://arxiv.org/pdf/2406.08451 Introduction GUI Odyssey is… See the full description on the dataset page: https://huggingface.co/datasets/OpenGVLab/GUI-Odyssey.

Language:enSize_categories:1K<n<10KFormat:jsonModality:imageModality:tabularModality:text
PleIAs/common_corpus HF Unverified

Common Corpus Full paper - ICLR 2026 oral Common Corpus is the largest open and permissible licensed text dataset, comprising 2.27 trillion tokens (2,267,302,720,836 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners. Common Corpus differs from existing open datasets in that it is: Truly Open: contains only data that… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.

Language:enLanguage:frLanguage:deLanguage:zhLanguage:itLanguage:es
artur-muratov/multilingual-speech-commands-15lang HF Unverified

Multilingual Speech Commands Dataset (15 Languages, Augmented) This dataset contains augmented speech command samples in 15 languages, derived from multiple public datasets. Only commands that overlap with the Google Speech Commands (GSC) vocabulary are included, making the dataset suitable for multilingual keyword spotting tasks aligned with GSC-style classification. Audio samples have been augmented using standard audio techniques to improve model robustness (e.g., time-shifting… See the full description on the dataset page: https://huggingface.co/datasets/artur-muratov/multilingual-speech-commands-15lang.

Language:enLanguage:ruLanguage:kkLanguage:ttLanguage:arLanguage:tr
jasperai/monet HF Unverified

Dataset Card for MONET MONET (Massive, Open, Non-redundant and Enriched Text-to-image dataset) is a large-scale, curated image-text dataset designed for training text-to-image (T2I) systems. It contains 104.9 million high-quality image-text pairs distilled from 2.9 billion raw pairs across nine heterogeneous open sources (6 real and 3 synthetic) through successive stages of safety filtering, domain-based filtering, exact and near-duplicate removal, and re-captioning with… See the full description on the dataset page: https://huggingface.co/datasets/jasperai/monet.

Task_categories:text-To-ImageTask_categories:image-Feature-ExtractionTask_categories:zero-Shot-Image-ClassificationLanguage:enSize_categories:100M<n<1BMultimodal
anon8231489123/ShareGPT_Vicuna_unfiltered HF Unverified

Further cleaning done. Please look through the dataset and ensure that I didn't miss anything. Update: Confirmed working method for training the model: https://huggingface.co/AlekseyKorshuk/vicuna-7b/discussions/4#64346c08ef6d5abefe42c12c Two choices: Removes instances of "I'm sorry, but": https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json Has instances of "I'm sorry, but":… See the full description on the dataset page: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered.

Language:en
google-research-datasets/mbpp HF Unverified

Dataset Card for Mostly Basic Python Problems (mbpp) Dataset Summary The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. As described in the paper, a subset of the data has been hand-verified by us. Released here as part of… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/mbpp.

Annotations_creators:crowdsourcedAnnotations_creators:expert-GeneratedLanguage_creators:crowdsourcedLanguage_creators:expert-GeneratedMultilinguality:monolingualSource_datasets:original
Showing 20 of 178 datasets (page 3 of 9)