Model Hub

Browse PQC-verified AI models, datasets, and tools

leosltl/Android-in-the-Wild HF Unverified

Android in the Wild (AITW) This is a mirror of Google's Android in the Wild (AITW) dataset, re-hosted on Hugging Face for easier community access. Original Source Paper: Android in the Wild: A Large-Scale Dataset for Android Device Control Original Repository: google-research/google-research/tree/master/android_in_the_wild Dataset Description Android in the Wild (AITW) is a large-scale dataset for Android device control. It contains human demonstrations of… See the full description on the dataset page: https://huggingface.co/datasets/leosltl/Android-in-the-Wild.

Task_categories:image-ClassificationTask_categories:visual-Question-AnsweringSize_categories:100M<n<1BAndroidMobileUi-Automation
CohereLabs/aya_collection HF Unverified

This dataset is uploaded in two places: here and additionally here as 'Aya Collection Language Split.' These datasets are identical in content but differ in structure of upload. This dataset is structured by folders split according to dataset name. The version here instead divides the Aya collection into folders split by language. We recommend you use the language split version if you are only interested in downloading data for a single or smaller set of languages, and this version if you… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/aya_collection.

Task_categories:text-ClassificationTask_categories:summarizationTask_categories:translationLanguage:aceLanguage:afrLanguage:amh
japanese-asr/whisper_transcriptions.reazon_speech_all HF Unverified

Size_categories:10M<n<100MFormat:parquetModality:audioModality:textLibrary:datasetsLibrary:dask
JosephusCheung/GuanacoDataset HF Unverified

Sorry, it's no longer available on Hugging Face. Please reach out to those who have already downloaded it. If you have a copy, please refrain from re-uploading it to Hugging Face. The people here don't deserve it. See also: https://twitter.com/RealJosephus/status/1779913520529707387 GuanacoDataset News: We're heading towards multimodal VQA, with blip2-flan-t5-xxl Alignment to Guannaco 7B LLM. Still under construction: GuanacoVQA weight & GuanacoVQA Dataset Notice: Effective… See the full description on the dataset page: https://huggingface.co/datasets/JosephusCheung/GuanacoDataset.

Task_categories:text-GenerationTask_categories:question-AnsweringLanguage:zhLanguage:enLanguage:jaLanguage:de
N
nvidia/Alpamayo-1.5-10B HF Unverified

RoboticsSafetensorsAlpamayo1_5Base_model:nvidia/Cosmos-Reason2-8BBase_model:finetune:nvidia/Cosmos-Reason2-8BEnglish HIGH
allenai/sciq HF Unverified

Dataset Card for "sciq" Dataset Summary The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed… See the full description on the dataset page: https://huggingface.co/datasets/allenai/sciq.

Task_categories:question-AnsweringTask_ids:closed-Domain-QaAnnotations_creators:no-AnnotationLanguage_creators:crowdsourcedMultilinguality:monolingualSource_datasets:original
opendatalab/Sci-Base HF Unverified

Sci-Base: The Largest AI-Ready Scientific Foundation Dataset 🌌 The Sciverse Data Foundation Sciverse is a comprehensive, multi-layered scientific data foundation designed to provide the ultimate data infrastructure for the AI for Science (AI4S) community. As scientific research becomes increasingly data-driven, Sciverse supplies the essential, high-quality data resources required to build robust scientific knowledge systems and accelerate research. Sciverse… See the full description on the dataset page: https://huggingface.co/datasets/opendatalab/Sci-Base.

Language:enSize_categories:1M<n<10MFormat:parquetModality:textLibrary:datasetsLibrary:dask
lishaoyong/latex-formulas-80M HF Unverified

For more details, please refer to the 𝐓𝐞𝐱𝐓𝐞𝐥𝐥𝐞𝐫 GitHub repository. IMPORTANT NOTE!!! The handwritten subset of this dataset was collected entirely from existing open source work, which includes all test sets. If you want to use this subset for your experimental ablation, please filter it yourself based on the latex label of the test set

Size_categories:10M<n<100MFormat:parquetModality:imageModality:textLibrary:datasetsLibrary:dask
K
KingTechnician/videomae-small-finetuned-kinetics-xd-violence-binary HF PQC Verified

Video-ClassificationTransformersSafetensorsVideomaeGenerated_from_trainerBase_model:MCG-NJU/videomae-Small-Finetuned-Kinetics MEDIUM
cornell-movie-review-data/rotten_tomatoes HF Unverified

Dataset Card for "rotten_tomatoes" Dataset Summary Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005. Supported Tasks and Leaderboards More Information Needed Languages… See the full description on the dataset page: https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes.

Task_categories:text-ClassificationTask_ids:sentiment-ClassificationAnnotations_creators:crowdsourcedLanguage_creators:crowdsourcedMultilinguality:monolingualSource_datasets:original
Skylion007/openwebtext HF Unverified

Dataset Card for "openwebtext" Dataset Summary An open-source replication of the WebText dataset from OpenAI, that was used to train GPT-2. This distribution was created by Aaron Gokaslan and Vanya Cohen of Brown University. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances plain_text Size of downloaded dataset files: 13.51 GB Size of the… See the full description on the dataset page: https://huggingface.co/datasets/Skylion007/openwebtext.

Task_categories:text-GenerationTask_categories:fill-MaskTask_ids:language-ModelingTask_ids:masked-Language-ModelingAnnotations_creators:no-AnnotationLanguage_creators:found
W
Wan-AI/Wan2.1-T2V-14B HF Unverified

Text-To-VideoDiffusersSafetensorsT2vVideo generationEnglish HIGH
SwayStar123/preprocessed_commoncatalog-cc-by HF Unverified

I also seperately provide just the prompts in prompts.json keys are the image_id, and the values are the captions generated Captions generated by moondream: vikhyatk/moondream2 Latents generated by SDXL VAE: madebyollin/sdxl-vae-fp16-fix Embeddings generated by SigLIP: hf-hub:timm/ViT-SO400M-14-SigLIP-384 Original dataset: common-canvas/commoncatalog-cc-by Latents f32 and embeddings are f16 bytes Compute cost: 16x3090 for 3 day. Approximately.

Language:enSize_categories:10M<n<100MFormat:parquetModality:textLibrary:datasetsLibrary:dask
X779/Danbooruwildcards HF Unverified

This is a set of wildcards for danbooru tags. Artist:Prompts for random artist styles, covering approximately 0.6M different artists.Please select the appropriate version of the collection, ranging from 128 to 5000, based on the model's capabilities.The full version is not recommended for use as it includes too many artists with only one image on danbooru or other websites. Almost no model can generate a style that corresponds to these artists . Characters:"Characters" is a set of wildcards… See the full description on the dataset page: https://huggingface.co/datasets/X779/Danbooruwildcards.

Task_categories:text-GenerationLanguage:enSize_categories:10M<n<100MFormat:textModality:textLibrary:datasets
M
MCG-NJU/videomae-base HF Unverified

Video-ClassificationTransformersPyTorchSafetensorsVideomaePretraining MEDIUM
amphion/Emilia-Dataset HF Unverified

Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation This is the official repository 👑 for the Emilia dataset and the source code for the Emilia-Pipe speech data preprocessing pipeline. News 🔥 2025/02/26: The Emilia-Large dataset, featuring over 200,000 hours of data, is now available!!! Emilia-Large combines the original 101k-hour Emilia dataset (licensed under CC BY-NC 4.0) with the brand-new 114k-hour Emilia-YODAS… See the full description on the dataset page: https://huggingface.co/datasets/amphion/Emilia-Dataset.

Task_categories:text-To-SpeechTask_categories:automatic-Speech-RecognitionLanguage:zhLanguage:enLanguage:jaLanguage:fr
T
TianheWu/VisualQuality-R1-7B HF Unverified

Reinforcement-LearningSafetensorsQwen2_5_vlIQAReasoningVLM HIGH
O
openvla/openvla-7b-finetuned-libero-object HF Unverified

Image-Text-to-TextTransformersSafetensorsOpenvlaFeature ExtractionRobotics HIGH
nyu-mll/blimp HF Unverified

Dataset Card for "blimp" Dataset Summary BLiMP is a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics. The data is automatically generated according to expert-crafted grammars. Supported Tasks and Leaderboards More Information Needed Languages More Information… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/blimp.

Task_categories:text-ClassificationTask_ids:acceptability-ClassificationAnnotations_creators:crowdsourcedLanguage_creators:machine-GeneratedMultilinguality:monolingualSource_datasets:original
S
SAP/sap-rpt-1-oss HF Unverified

Tabular-ClassificationSap-Rpt-1-OssTabularFoundation-ModelDeep-LearningIn-Context MEDIUM
Showing 20 of 531 items (page 23 of 27)