Datasets

Training datasets with quantum-safe provenance

jat-project/jat-dataset HF Unverified

JAT Dataset Dataset Description The Jack of All Trades (JAT) dataset combines a wide range of individual datasets. It includes expert demonstrations by expert RL agents, image and caption pairs, textual data and more. The JAT dataset is part of the JAT project, which aims to build a multimodal generalist agent. Paper: https://huggingface.co/papers/2402.09844 Usage >>> from datasets import load_dataset >>> dataset = load_dataset("jat-project/jat-dataset"… See the full description on the dataset page: https://huggingface.co/datasets/jat-project/jat-dataset.

Task_categories:reinforcement-LearningTask_categories:text-GenerationTask_categories:question-AnsweringAnnotations_creators:foundAnnotations_creators:machine-GeneratedSource_datasets:conceptual-Captions
cais/mmlu HF Unverified

Dataset Card for MMLU Dataset Summary Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks… See the full description on the dataset page: https://huggingface.co/datasets/cais/mmlu.

Task_categories:question-AnsweringTask_ids:multiple-Choice-QaAnnotations_creators:no-AnnotationLanguage_creators:expert-GeneratedMultilinguality:monolingualSource_datasets:original
nyu-mll/glue HF PQC Verified

Dataset Card for GLUE Dataset Summary GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems. Supported Tasks and Leaderboards The leaderboard for the GLUE benchmark can be found at this address. It comprises the following tasks: ax A manually-curated evaluation dataset for fine-grained analysis of system… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/glue.

Task_categories:text-ClassificationTask_ids:acceptability-ClassificationTask_ids:natural-Language-InferenceTask_ids:semantic-Similarity-ScoringTask_ids:sentiment-ClassificationTask_ids:text-Scoring
allenai/ai2_arc HF Unverified

Dataset Card for "ai2_arc" Dataset Summary A new dataset of 7,787 genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering. The dataset is partitioned into a Challenge Set and an Easy Set, where the former contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. We are also including a corpus of over 14 million science sentences relevant to… See the full description on the dataset page: https://huggingface.co/datasets/allenai/ai2_arc.

Task_categories:question-AnsweringTask_ids:open-Domain-QaTask_ids:multiple-Choice-QaAnnotations_creators:foundLanguage_creators:foundMultilinguality:monolingual
XDOF/ABC-130k HF Unverified

ABC-130k ABC-130k is the largest open-source robot teleoperation dataset. It contains bimanual manipulation trajectories collected on two-arm YAM stations. Episodes are distributed as MCAP files, with subtask annotations kept as separate artifacts so they can be revised or extended independently of the underlying episode data. For details on the accompanying paper, see abc.bot. Please see the GitHub repo here for code to train and deploy with this dataset. Dataset… See the full description on the dataset page: https://huggingface.co/datasets/XDOF/ABC-130k.

Task_categories:roboticsLanguage:enSize_categories:n>1TRoboticsManipulationImitation-Learning
HuggingFaceFW/finephrase HF PQC Verified

Dataset Card for HuggingFaceFW/finephrase Dataset Summary Synthetic data generated by DataTrove: Model: HuggingFaceTB/SmolLM2-1.7B-Instruct (main) Source dataset: HuggingFaceFW/fineweb-edu, config sample-350BT, split train Generation config: temperature=1.0, top_p=1.0, top_k=50, max_tokens=2048, model_max_context=8192 Speculative decoding: {"method":"suffix","num_speculative_tokens":32} System prompt: None Input column: text Prompt families: faq prompt Rewrite the… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/finephrase.

Task_categories:text-GenerationTask_ids:language-ModelingAnnotations_creators:machine-GeneratedLanguage_creators:foundSource_datasets:HuggingFaceFW/fineweb-Edu/sample-350BTLanguage:en
robbyant/mdm_depth HF Unverified

LingBot-Depth Dataset Self-curated RGB-D dataset for training LingBot-Depth, a masked depth modeling approach (arxiv:2601.17895). Each sample contains an RGB image, raw sensor depth, and ground truth depth. Total size: 2.71 TBDepth scale: millimeters (mm), stored as 16-bit PNGLicense: CC BY-NC-SA 4.0 Sub-datasets Name Description Samples RobbyReal Real-world indoor scenes captured with multiple RGB-D cameras 1,400,000 RobbyVla Real-world data collected… See the full description on the dataset page: https://huggingface.co/datasets/robbyant/mdm_depth.

Task_categories:depth-EstimationLanguage:enModality:3d3D3dDepth
mlfoundations/MINT-1T-PDF-CC-2023-40 HF PQC Verified

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-40.

Task_categories:image-To-TextTask_categories:text-GenerationLanguage:enSize_categories:100B<n<1TMultimodal
cadene/droid HF Unverified

This dataset was created using LeRobot. DROID: A Large-Scale In-the-Wild Robot Manipulation Dataset One of the biggest open-source dataset for robotics with 27.044,326 frames, 92,223 episodes, 31,308 unique task description in natural language. Ported from Tensorflow Dataset format (2TB) to LeRobotDataset format (400GB) with the help from IPEC-COMMUNITY. Visualization: LeRobot Homepage: Droid Paper: Arxiv License: apache-2.0 Dataset Structure meta/info.json: {… See the full description on the dataset page: https://huggingface.co/datasets/cadene/droid.

Task_categories:roboticsLanguage:enSize_categories:10M<n<100MModality:videoLeRobotOpenx
HuggingFaceFW/fineweb-edu HF PQC Verified

📚 FineWeb-Edu 1.3 trillion tokens of the finest educational data the 🌐 web has to offer Paper: https://arxiv.org/abs/2406.17557 What is it? 📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.

Task_categories:text-GenerationLanguage:enSize_categories:1B<n<10BFormat:parquetModality:tabularModality:text
SWE-bench/SWE-bench_Multilingual HF Unverified

Language:enSize_categories:n<1KFormat:parquetModality:textLibrary:datasetsLibrary:pandas
PsiBotAI/SynData HF Unverified

SynData 中文说明 Demo If the video cannot be displayed in your environment, open it directly: assets/syndata-demo.mp4 1. Overview SynData is a next-generation large-scale real-world multimodal dataset newly released by PsiBot. It comprehensively covers key dimensions including vision, language, and action, and provides highly realistic, high-density, and highly usable human data as a solid foundation for embodied intelligence training. Powered by… See the full description on the dataset page: https://huggingface.co/datasets/PsiBotAI/SynData.

Language:enSize_categories:100K<n<1MFormat:parquetModality:3dModality:tabularModality:text
permutans/arxiv-papers-by-subject HF Unverified

arXiv Papers by Subject A reorganised version of the nick007x/arxiv-papers dataset, partitioned by subject code, year, and month for efficient selective access. Dataset Description This dataset contains metadata for over 2.5 million arXiv papers, organised into a hierarchical directory structure that allows users to download only the specific subjects and time periods they need, rather than the entire dataset. Motivation The original nick007x/arxiv-papers… See the full description on the dataset page: https://huggingface.co/datasets/permutans/arxiv-papers-by-subject.

Task_categories:text-GenerationTask_categories:feature-ExtractionSource_datasets:nick007x/arxiv-PapersLanguage:enSize_categories:1M<n<10MArxiv
jat-project/jat-dataset-tokenized HF Unverified

Dataset Card for "jat-dataset-tokenized" More Information needed

Size_categories:10M<n<100MFormat:parquetModality:timeseriesLibrary:datasetsLibrary:daskLibrary:mlcroissant
mlfoundations/MINT-1T-HTML HF Unverified

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-HTML.

Task_categories:image-To-TextTask_categories:text-GenerationLanguage:enSize_categories:100M<n<1BFormat:parquetModality:text
vyokky/GUI-360 HF Unverified

GUI-360°: A Comprehensive Dataset And Benchmark For Computer-Using Agents Paper | Code GUI-360° is a large-scale, comprehensive dataset and benchmark suite designed to advance Computer-Using Agents (CUAs). 🎯 Key Features 🔢 1.2M+ executed action steps across thousands of trajectories 💼 Popular Windows office applications (Word, Excel, PowerPoint) 📸 Full-resolution screenshots with accessibility metadata 🎨 Multi-modal trajectories with reasoning traces ✅ Both… See the full description on the dataset page: https://huggingface.co/datasets/vyokky/GUI-360.

Task_categories:image-Text-To-TextSize_categories:1M<n<10M
HuggingFaceM4/the_cauldron HF Unverified

Dataset Card for The Cauldron Dataset description The Cauldron is part of the Idefics2 release. It is a massive collection of 50 vision-language datasets (training sets only) that were used for the fine-tuning of the vision-language model Idefics2. Load the dataset To load the dataset, install the library datasets with pip install datasets. Then, from datasets import load_dataset ds = load_dataset("HuggingFaceM4/the_cauldron", "ai2d") to download and load the… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/the_cauldron.

Size_categories:1M<n<10MFormat:parquetModality:imageModality:textLibrary:datasetsLibrary:dask
abisee/cnn_dailymail HF Unverified

Dataset Card for CNN Dailymail Dataset Dataset Summary The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering. Supported Tasks and Leaderboards 'summarization': Versions… See the full description on the dataset page: https://huggingface.co/datasets/abisee/cnn_dailymail.

Task_categories:summarizationTask_ids:news-Articles-SummarizationAnnotations_creators:no-AnnotationLanguage_creators:foundMultilinguality:monolingualSource_datasets:original
CohereLabs/xP3x HF Unverified

Dataset Card for xP3x Dataset Summary xP3x (Crosslingual Public Pool of Prompts eXtended) is a collection of prompts & datasets across 277 languages & 16 NLP tasks. It contains all of xP3 + much more! It is used for training future contenders of mT0 & BLOOMZ at project Aya @Cohere Labs 🧡 Creation: The dataset can be recreated using instructions available here together with the file in this repository named xp3x_create.py. We provide this version to save processing… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/xP3x.

Task_categories:otherAnnotations_creators:expert-GeneratedAnnotations_creators:crowdsourcedMultilinguality:multilingualLanguage:afLanguage:ar
HuggingFaceFW/FineWeb HF PQC Verified

15T token dataset of cleaned English web data. Deduplicated and filtered from CommonCrawl, outperforms C4 and RefinedWeb for LLM pretraining.

DatasetPretrainingEnglish15T tokens CRITICAL
Showing 20 of 178 datasets (page 2 of 9)