Model Hub
Browse PQC-verified AI models, datasets, and tools
Android in the Wild (AITW) This is a mirror of Google's Android in the Wild (AITW) dataset, re-hosted on Hugging Face for easier community access. Original Source Paper: Android in the Wild: A Large-Scale Dataset for Android Device Control Original Repository: google-research/google-research/tree/master/android_in_the_wild Dataset Description Android in the Wild (AITW) is a large-scale dataset for Android device control. It contains human demonstrations of… See the full description on the dataset page: https://huggingface.co/datasets/leosltl/Android-in-the-Wild.
This dataset is uploaded in two places: here and additionally here as 'Aya Collection Language Split.' These datasets are identical in content but differ in structure of upload. This dataset is structured by folders split according to dataset name. The version here instead divides the Aya collection into folders split by language. We recommend you use the language split version if you are only interested in downloading data for a single or smaller set of languages, and this version if you… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/aya_collection.
Sorry, it's no longer available on Hugging Face. Please reach out to those who have already downloaded it. If you have a copy, please refrain from re-uploading it to Hugging Face. The people here don't deserve it. See also: https://twitter.com/RealJosephus/status/1779913520529707387 GuanacoDataset News: We're heading towards multimodal VQA, with blip2-flan-t5-xxl Alignment to Guannaco 7B LLM. Still under construction: GuanacoVQA weight & GuanacoVQA Dataset Notice: Effective… See the full description on the dataset page: https://huggingface.co/datasets/JosephusCheung/GuanacoDataset.
Dataset Card for "sciq" Dataset Summary The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed… See the full description on the dataset page: https://huggingface.co/datasets/allenai/sciq.
Sci-Base: The Largest AI-Ready Scientific Foundation Dataset 🌌 The Sciverse Data Foundation Sciverse is a comprehensive, multi-layered scientific data foundation designed to provide the ultimate data infrastructure for the AI for Science (AI4S) community. As scientific research becomes increasingly data-driven, Sciverse supplies the essential, high-quality data resources required to build robust scientific knowledge systems and accelerate research. Sciverse… See the full description on the dataset page: https://huggingface.co/datasets/opendatalab/Sci-Base.
For more details, please refer to the 𝐓𝐞𝐱𝐓𝐞𝐥𝐥𝐞𝐫 GitHub repository. IMPORTANT NOTE!!! The handwritten subset of this dataset was collected entirely from existing open source work, which includes all test sets. If you want to use this subset for your experimental ablation, please filter it yourself based on the latex label of the test set
Dataset Card for "rotten_tomatoes" Dataset Summary Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005. Supported Tasks and Leaderboards More Information Needed Languages… See the full description on the dataset page: https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes.
Dataset Card for "openwebtext" Dataset Summary An open-source replication of the WebText dataset from OpenAI, that was used to train GPT-2. This distribution was created by Aaron Gokaslan and Vanya Cohen of Brown University. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances plain_text Size of downloaded dataset files: 13.51 GB Size of the… See the full description on the dataset page: https://huggingface.co/datasets/Skylion007/openwebtext.
I also seperately provide just the prompts in prompts.json keys are the image_id, and the values are the captions generated Captions generated by moondream: vikhyatk/moondream2 Latents generated by SDXL VAE: madebyollin/sdxl-vae-fp16-fix Embeddings generated by SigLIP: hf-hub:timm/ViT-SO400M-14-SigLIP-384 Original dataset: common-canvas/commoncatalog-cc-by Latents f32 and embeddings are f16 bytes Compute cost: 16x3090 for 3 day. Approximately.
This is a set of wildcards for danbooru tags. Artist:Prompts for random artist styles, covering approximately 0.6M different artists.Please select the appropriate version of the collection, ranging from 128 to 5000, based on the model's capabilities.The full version is not recommended for use as it includes too many artists with only one image on danbooru or other websites. Almost no model can generate a style that corresponds to these artists . Characters:"Characters" is a set of wildcards… See the full description on the dataset page: https://huggingface.co/datasets/X779/Danbooruwildcards.
Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation This is the official repository 👑 for the Emilia dataset and the source code for the Emilia-Pipe speech data preprocessing pipeline. News 🔥 2025/02/26: The Emilia-Large dataset, featuring over 200,000 hours of data, is now available!!! Emilia-Large combines the original 101k-hour Emilia dataset (licensed under CC BY-NC 4.0) with the brand-new 114k-hour Emilia-YODAS… See the full description on the dataset page: https://huggingface.co/datasets/amphion/Emilia-Dataset.
Dataset Card for "blimp" Dataset Summary BLiMP is a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics. The data is automatically generated according to expert-crafted grammars. Supported Tasks and Leaderboards More Information Needed Languages More Information… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/blimp.