Datasets
Training datasets with quantum-safe provenance
I also seperately provide just the prompts in prompts.json keys are the image_id, and the values are the captions generated Captions generated by moondream: vikhyatk/moondream2 Latents generated by SDXL VAE: madebyollin/sdxl-vae-fp16-fix Embeddings generated by SigLIP: hf-hub:timm/ViT-SO400M-14-SigLIP-384 Original dataset: common-canvas/commoncatalog-cc-by Latents f32 and embeddings are f16 bytes Compute cost: 16x3090 for 3 day. Approximately.
This is a set of wildcards for danbooru tags. Artist:Prompts for random artist styles, covering approximately 0.6M different artists.Please select the appropriate version of the collection, ranging from 128 to 5000, based on the model's capabilities.The full version is not recommended for use as it includes too many artists with only one image on danbooru or other websites. Almost no model can generate a style that corresponds to these artists . Characters:"Characters" is a set of wildcards… See the full description on the dataset page: https://huggingface.co/datasets/X779/Danbooruwildcards.
This is the wikipedia split used to evaluate the Dense Passage Retrieval (DPR) model. It contains 21M passages from wikipedia along with their DPR embeddings. The wikipedia articles were split into multiple, disjoint text blocks of 100 words as passages.
Dataset Card for "blimp" Dataset Summary BLiMP is a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics. The data is automatically generated according to expert-crafted grammars. Supported Tasks and Leaderboards More Information Needed Languages More Information… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/blimp.
Dataset Card for "rotten_tomatoes" Dataset Summary Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005. Supported Tasks and Leaderboards More Information Needed Languages… See the full description on the dataset page: https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes.
ColliderML: Dataset Release 1 Dataset Description This dataset contains simulated high-energy physics collision events generated using the Open Data Detector (ODD) geometry within the Key4hep and ACTS (A Common Tracking Software) frameworks, representing a generic collider detector similar to those at the HL-LHC. Dataset Summary Collision Energy: 14 TeV (proton-proton) Detector: Open Data Detector (ODD) Simulation: DD4hep + Geant4 + ACTS Format: Apache Parquet… See the full description on the dataset page: https://huggingface.co/datasets/CERN/ColliderML-Release-1.
DartLab Data Structured company data from DART & EDGAR disclosure filings DART 전자공시 + EDGAR 공시 데이터 — 한국 2,700사 / 미국 970사 What is this? Pre-collected Parquet files from DartLab — a Python library that turns DART (Korea) and EDGAR (US) disclosure filings into one structured company map. 한국 DART 전자공시 시스템과 미국 SEC EDGAR에서 수집한 기업 공시 데이터입니다. This dataset is the data layer behind DartLab. When you run dartlab.Company("005930"), the library automatically downloads the… See the full description on the dataset page: https://huggingface.co/datasets/eddmpython/dartlab-data.
HaGRID Gesture Recognition Subset Dataset Description A curated subset of the HaGRID (Hand Gesture Recognition Image Dataset) containing 24 gesture classes for training gesture recognition models. Dataset Summary Total Images: 19,200 Gesture Classes: 24 Samples per Class: 800 Image Format: JPEG Average Image Size: ~302 KB Splits Split Images Percentage Train 14,592 76% Val 1,728 9% Test 2,880 15% Gesture Classes call… See the full description on the dataset page: https://huggingface.co/datasets/schwein69/hagrid-subset.
LongBench is a comprehensive benchmark for multilingual and multi-task purposes, with the goal to fully measure and evaluate the ability of pre-trained language models to understand long text. This dataset consists of twenty different tasks, covering key long-text application scenarios such as multi-document QA, single-document QA, summarization, few-shot learning, synthetic tasks, and code completion.
Dataset Card for MMLA Ol Pejeta Conservancy Dataset Details This is a dataset containing annotated video frames of Plains zebras collected at the Ol Pejeta Conservancy (OPC) in Kenya using the semi-autonomous WildWing system. The dataset is intended for use in training and evaluating computer vision models for animal detection and classification from drone imagery. It includes frames from various sessions, with annotations indicating the presence of zebras in the… See the full description on the dataset page: https://huggingface.co/datasets/imageomics/mmla_opc.
Dataset Card for mmla-mpala Dataset Details This is a dataset containing annotated video frames of giraffes, Grevy's zebras, and Plains zebras collected at the Mpala Research Center in Kenya. The dataset is intended for use in training and evaluating computer vision models for animal detection and classification from drone imagery. The annotations indicate the presence of animals in the images in YOLO format. The dataset is designed to facilitate research in wildlife… See the full description on the dataset page: https://huggingface.co/datasets/imageomics/mmla_mpala.
C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels. Please visit our website and GitHub or check our paper for more details. Each subject consists of three splits: dev, val, and test. The dev set per subject consists of five exemplars with explanations for few-shot evaluation. The val set is intended to be used for hyperparameter tuning. And the test set is for model… See the full description on the dataset page: https://huggingface.co/datasets/ceval/ceval-exam.
Horama/animal-200 Raw wildlife image collection covering 199 species (mammals, birds, reptiles), scraped from multiple web sources. Images are organized by species folder and can be used as-is for image classification (species identification) or as input for downstream annotation pipelines (object detection, etc.). For animal detection, see Horama/animal-200-detection dataset. Sources Images were collected from three web sources using dedicated scrapers: Source… See the full description on the dataset page: https://huggingface.co/datasets/Horama/animal-200.
MR-RATE: A Vision-Language Foundation Model and Dataset for Magnetic Resonance Imaging This is the MR-RATE-coreg repository, part of the MR-RATE dataset release. It contains co-registered MRI volumes in which all imaging volumes within each study have been spatially aligned to a common T1-weighted reference frame. For full dataset details, native-space MRI volumes, radiology reports, metadata, and data splits, please… See the full description on the dataset page: https://huggingface.co/datasets/Forithmus/MR-RATE-coreg.
-- 2nd International Chinese Word Segmentation Bakeoff - Data Release Release 1, 2005-11-18 Introduction This directory contains the training, test, and gold-standard data used in the 2nd International Chinese Word Segmentation Bakeoff. Also included is the script used to score the results submitted by the bakeoff participants and the simple segmenter used to generate the baseline and topline data. File List gold/ Contains the gold standard… See the full description on the dataset page: https://huggingface.co/datasets/zeroMN/hanlp_date-zh.
RSCD: Road Surface Condition Dataset Dataset Description The Road Surface Condition Dataset (RSCD) is a large-scale image dataset containing over 1 million images for road surface condition classification. This dataset is designed for training computer vision models to identify and classify various road surface types, moisture conditions, and damage severity levels. Dataset Summary Total Images: ~1,028,000 images Image Format: JPG Use Cases: Road condition… See the full description on the dataset page: https://huggingface.co/datasets/rezzzq/RSCD-1million.
Military Aircraft Detection Dataset Military aircraft detection dataset in COCO and YOLO format. This dataset is synchronized from the original Kaggle dataset:https://www.kaggle.com/datasets/a2015003713/militaryaircraftdetectiondataset
REPID: Rendering Evaluation of Photographic Image Dataset REPID (officially introduced as the Rendering Evaluation of Photographic Image Dataset) is a large-scale benchmark designed for Image Rendering Quality Assessment (IRQA). Unlike traditional Image Quality Assessment (IQA) which focuses on technical degradations like noise or blur, REPID aims to model subjective human aesthetic preferences for different rendering styles of the same scene. Dataset Overview… See the full description on the dataset page: https://huggingface.co/datasets/vsevolodpl/REPID.
Dataset Card for Street View House Numbers Dataset Summary SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem… See the full description on the dataset page: https://huggingface.co/datasets/ufldl-stanford/svhn.