Datasets

Training datasets with quantum-safe provenance

Skylion007/openwebtext HF Unverified

Dataset Card for "openwebtext" Dataset Summary An open-source replication of the WebText dataset from OpenAI, that was used to train GPT-2. This distribution was created by Aaron Gokaslan and Vanya Cohen of Brown University. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances plain_text Size of downloaded dataset files: 13.51 GB Size of the… See the full description on the dataset page: https://huggingface.co/datasets/Skylion007/openwebtext.

Task_categories:text-GenerationTask_categories:fill-MaskTask_ids:language-ModelingTask_ids:masked-Language-ModelingAnnotations_creators:no-AnnotationLanguage_creators:found
SwayStar123/preprocessed_commoncatalog-cc-by HF Unverified

I also seperately provide just the prompts in prompts.json keys are the image_id, and the values are the captions generated Captions generated by moondream: vikhyatk/moondream2 Latents generated by SDXL VAE: madebyollin/sdxl-vae-fp16-fix Embeddings generated by SigLIP: hf-hub:timm/ViT-SO400M-14-SigLIP-384 Original dataset: common-canvas/commoncatalog-cc-by Latents f32 and embeddings are f16 bytes Compute cost: 16x3090 for 3 day. Approximately.

Language:enSize_categories:10M<n<100MFormat:parquetModality:textLibrary:datasetsLibrary:dask
X779/Danbooruwildcards HF Unverified

This is a set of wildcards for danbooru tags. Artist:Prompts for random artist styles, covering approximately 0.6M different artists.Please select the appropriate version of the collection, ranging from 128 to 5000, based on the model's capabilities.The full version is not recommended for use as it includes too many artists with only one image on danbooru or other websites. Almost no model can generate a style that corresponds to these artists . Characters:"Characters" is a set of wildcards… See the full description on the dataset page: https://huggingface.co/datasets/X779/Danbooruwildcards.

Task_categories:text-GenerationLanguage:enSize_categories:10M<n<100MFormat:textModality:textLibrary:datasets
amphion/Emilia-Dataset HF Unverified

Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation This is the official repository 👑 for the Emilia dataset and the source code for the Emilia-Pipe speech data preprocessing pipeline. News 🔥 2025/02/26: The Emilia-Large dataset, featuring over 200,000 hours of data, is now available!!! Emilia-Large combines the original 101k-hour Emilia dataset (licensed under CC BY-NC 4.0) with the brand-new 114k-hour Emilia-YODAS… See the full description on the dataset page: https://huggingface.co/datasets/amphion/Emilia-Dataset.

Task_categories:text-To-SpeechTask_categories:automatic-Speech-RecognitionLanguage:zhLanguage:enLanguage:jaLanguage:fr
nyu-mll/blimp HF Unverified

Dataset Card for "blimp" Dataset Summary BLiMP is a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics. The data is automatically generated according to expert-crafted grammars. Supported Tasks and Leaderboards More Information Needed Languages More Information… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/blimp.

Task_categories:text-ClassificationTask_ids:acceptability-ClassificationAnnotations_creators:crowdsourcedLanguage_creators:machine-GeneratedMultilinguality:monolingualSource_datasets:original
CERN/ColliderML-Release-1 HF Unverified

ColliderML: Dataset Release 1 Dataset Description This dataset contains simulated high-energy physics collision events generated using the Open Data Detector (ODD) geometry within the Key4hep and ACTS (A Common Tracking Software) frameworks, representing a generic collider detector similar to those at the HL-LHC. Dataset Summary Collision Energy: 14 TeV (proton-proton) Detector: Open Data Detector (ODD) Simulation: DD4hep + Geant4 + ACTS Format: Apache Parquet… See the full description on the dataset page: https://huggingface.co/datasets/CERN/ColliderML-Release-1.

Task_categories:otherSize_categories:10M<n<100MFormat:parquetModality:timeseriesLibrary:datasetsLibrary:dask
echodict/KakologArchives_duplicate HF Unverified

ニコニコ実況 過去ログアーカイブ ニコニコ実況 過去ログアーカイブは、ニコニコ実況 のサービス開始から現在までのすべての過去ログコメントを収集したデータセットです。 去る2020年12月、ニコニコ実況は ニコニコ生放送内の一公式チャンネルとしてリニューアル されました。これに伴い、2009年11月から運用されてきた旧システムは提供終了となり(事実上のサービス終了)、torne や BRAVIA などの家電への対応が軒並み終了する中、当時の生の声が詰まった約11年分の過去ログも同時に失われることとなってしまいました。 そこで 5ch の DTV 板の住民が中心となり、旧ニコニコ実況が終了するまでに11年分の全チャンネルの過去ログをアーカイブする計画が立ち上がりました。紆余曲折あり Nekopanda 氏が約11年分のラジオや BS も含めた全チャンネルの過去ログを完璧に取得してくださったおかげで、11年分の過去ログが電子の海に消えていく事態は回避できました。しかし、旧 API が廃止されてしまったため過去ログを API… See the full description on the dataset page: https://huggingface.co/datasets/echodict/KakologArchives_duplicate.

Task_categories:text-ClassificationLanguage:ja
ceval/ceval-exam HF Unverified

C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels. Please visit our website and GitHub or check our paper for more details. Each subject consists of three splits: dev, val, and test. The dev set per subject consists of five exemplars with explanations for few-shot evaluation. The val set is intended to be used for hyperparameter tuning. And the test set is for model… See the full description on the dataset page: https://huggingface.co/datasets/ceval/ceval-exam.

Task_categories:text-ClassificationTask_categories:multiple-ChoiceTask_categories:question-AnsweringLanguage:zhSize_categories:10K<n<100KFormat:parquet
Horama/animal-200 HF Unverified

Horama/animal-200 Raw wildlife image collection covering 199 species (mammals, birds, reptiles), scraped from multiple web sources. Images are organized by species folder and can be used as-is for image classification (species identification) or as input for downstream annotation pipelines (object detection, etc.). For animal detection, see Horama/animal-200-detection dataset. Sources Images were collected from three web sources using dedicated scrapers: Source… See the full description on the dataset page: https://huggingface.co/datasets/Horama/animal-200.

Task_categories:image-ClassificationTask_categories:object-DetectionLanguage:enLanguage:frSize_categories:10K<n<100KModality:image
Benjy/typed_digital_signatures HF Unverified

Typed Digital Signatures Dataset This comprehensive dataset contains synthetic digital signatures rendered across 30 different Google Fonts, specifically selected for their handwriting and signature-style characteristics. Each font contributes unique stylistic elements, making this dataset ideal for robust signature analysis and font recognition tasks. Dataset Overview Total Fonts: 30 different Google Fonts Images per Font: 3,000 signatures Total Dataset Size: ~90,000… See the full description on the dataset page: https://huggingface.co/datasets/Benjy/typed_digital_signatures.

Task_categories:image-ClassificationTask_categories:zero-Shot-Image-ClassificationTask_categories:image-Feature-ExtractionLanguage:enSize_categories:10K<n<100KModality:image
rezzzq/RSCD-1million HF Unverified

RSCD: Road Surface Condition Dataset Dataset Description The Road Surface Condition Dataset (RSCD) is a large-scale image dataset containing over 1 million images for road surface condition classification. This dataset is designed for training computer vision models to identify and classify various road surface types, moisture conditions, and damage severity levels. Dataset Summary Total Images: ~1,028,000 images Image Format: JPG Use Cases: Road condition… See the full description on the dataset page: https://huggingface.co/datasets/rezzzq/RSCD-1million.

Task_categories:image-ClassificationSize_categories:1M<n<10MRoad-ConditionSurface-ClassificationComputer-VisionAutonomous-Driving
a2015003713/military-aircraft-detection-dataset HF Unverified

Military Aircraft Detection Dataset Military aircraft detection dataset in COCO and YOLO format. This dataset is synchronized from the original Kaggle dataset:https://www.kaggle.com/datasets/a2015003713/militaryaircraftdetectiondataset

Task_categories:object-DetectionTask_categories:image-ClassificationTask_categories:image-Feature-ExtractionSize_categories:10K<n<100KFormat:textModality:image
vsevolodpl/REPID HF Unverified

REPID: Rendering Evaluation of Photographic Image Dataset REPID (officially introduced as the Rendering Evaluation of Photographic Image Dataset) is a large-scale benchmark designed for Image Rendering Quality Assessment (IRQA). Unlike traditional Image Quality Assessment (IQA) which focuses on technical degradations like noise or blur, REPID aims to model subjective human aesthetic preferences for different rendering styles of the same scene. Dataset Overview… See the full description on the dataset page: https://huggingface.co/datasets/vsevolodpl/REPID.

Task_categories:image-ClassificationTask_categories:image-To-ImageTask_categories:otherSize_categories:10K<n<100KFormat:csvModality:image
ufldl-stanford/svhn HF Unverified

Dataset Card for Street View House Numbers Dataset Summary SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem… See the full description on the dataset page: https://huggingface.co/datasets/ufldl-stanford/svhn.

Task_categories:image-ClassificationTask_categories:object-DetectionAnnotations_creators:machine-GeneratedAnnotations_creators:expert-GeneratedLanguage_creators:machine-GeneratedMultilinguality:monolingual
codraja2006/tomato-leaves-dataset HF Unverified

Tomato Leaves Dataset Overview This dataset contains images of tomato leaves categorized into different classes based on the type of disease or health condition. The dataset is divided into training, validation, and test sets, with a ratio of 8:1:1. The classes include various diseases as well as healthy leaves. The dataset includes both augmented and non-augmented images. Dataset Structure The dataset is organized into three main splits: train validation test… See the full description on the dataset page: https://huggingface.co/datasets/codraja2006/tomato-leaves-dataset.

Task_categories:feature-ExtractionTask_categories:image-ClassificationLanguage:enSize_categories:n<1KModality:imageTomato
ChengyouJia/agentic-critic-dataset HF Unverified

Agentic Critic Dataset High-quality AIGC images with rich metadata for aesthetic evaluation. Metadata Fields Each entry in metadata.jsonl contains: prompt: Positive prompt negative_prompt: Negative prompt model: Model name and hash sampler: Sampling method steps: Generation steps cfg_scale: CFG scale seed: Random seed stats: Engagement metrics image_path: Relative path to image Usage from datasets import load_dataset dataset =… See the full description on the dataset page: https://huggingface.co/datasets/ChengyouJia/agentic-critic-dataset.

Task_categories:image-ClassificationTask_categories:text-To-ImageSize_categories:n<1KAigcCivitaiAesthetic
Ahnuf/Military_Aircraft_Detection_Classification_Image_Dataset HF Unverified

Military Aircraft Detection & Classification Dataset 88 Classes with Advanced Background Suppression Overview This dataset is a professionally curated resource for training high-performance object detection and image classification models such as YOLOv11.It contains 88 distinct military aircraft classes and is explicitly designed for real-world deployment, where false positives from civilian aircraft, birds, and small drones are common. To address this, the… See the full description on the dataset page: https://huggingface.co/datasets/Ahnuf/Military_Aircraft_Detection_Classification_Image_Dataset.

Task_categories:object-DetectionTask_categories:image-ClassificationMilitaryAircraftAerospaceYolo
Forithmus/MR-RATE HF Unverified

MR-RATE: A Vision-Language Foundation Model and Dataset for Magnetic Resonance Imaging Welcome to the official page for MR-RATE, a pioneering vision-language model and 3D medical imaging dataset that pairs textual reports with brain and spine MRI volumes. Following the approach of CT-RATE, the first 3D medical imaging dataset to pair images with textual reports, MR-RATE offers brain and spine MRI volumes matched with… See the full description on the dataset page: https://huggingface.co/datasets/Forithmus/MR-RATE.

Task_categories:image-To-TextTask_categories:text-To-ImageTask_categories:image-ClassificationTask_categories:question-AnsweringTask_categories:visual-Question-AnsweringTask_categories:zero-Shot-Classification
zalando-datasets/fashion_mnist HF Unverified

Dataset Card for FashionMNIST Dataset Summary Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing… See the full description on the dataset page: https://huggingface.co/datasets/zalando-datasets/fashion_mnist.

Task_categories:image-ClassificationTask_ids:multi-Class-Image-ClassificationAnnotations_creators:expert-GeneratedLanguage_creators:foundMultilinguality:monolingualSource_datasets:original
deepguess/tornet-temporal HF Unverified

TorNet-Temporal: Temporal Dual-Pol NEXRAD Radar for Tornado Detection A large-scale dataset of storm-centered NEXRAD WSR-88D radar sequences for tornado detection and prediction, featuring 24-channel dual-polarimetric data across variable-length temporal sequences. Dataset Summary 24,862 storm events from NEXRAD Level-II radar archives (2013-2022) 8-22 consecutive radar scans per event (~4-5 min cadence, ~45-90 min total; median 13 frames) 24 channels: 6 dual-pol radar… See the full description on the dataset page: https://huggingface.co/datasets/deepguess/tornet-temporal.

Task_categories:image-ClassificationTask_categories:video-ClassificationSize_categories:10K<n<100KWeatherRadarTornado
Showing 20 of 126 datasets (page 6 of 7)