Model Hub

Browse PQC-verified AI models, datasets, and tools

This dataset is uploaded in two places: here and additionally here as 'Aya Collection Language Split.' These datasets are identical in content but differ in structure of upload. This dataset is structured by folders split according to dataset name. The version here instead divides the Aya collection into folders split by language. We recommend you use the language split version if you are only interested in downloading data for a single or smaller set of languages, and this version if you… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/aya_collection.

Task_categories:text-ClassificationTask_categories:summarizationTask_categories:translationLanguage:aceLanguage:afrLanguage:amh

82K 235

Updated 2026-05-07 Source available

locuslab/TOFU HF Unverified

TOFU: Task of Fictitious Unlearning 🍢 The TOFU dataset serves as a benchmark for evaluating unlearning performance of large language models on realistic tasks. The dataset comprises question-answer pairs based on autobiographies of 200 different authors that do not exist and are completely fictitiously generated by the GPT-4 model. The goal of the task is to unlearn a fine-tuned model on various fractions of the forget set. Quick Links Website: The landing page for TOFU… See the full description on the dataset page: https://huggingface.co/datasets/locuslab/TOFU.

Task_categories:question-AnsweringTask_ids:closed-Domain-QaAnnotations_creators:machine-GeneratedLanguage_creators:machine-GeneratedMultilinguality:monolingualSource_datasets:original

81K 55

Updated 2026-06-30 Source available

amphion/Emilia-Dataset HF Unverified

Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation This is the official repository 👑 for the Emilia dataset and the source code for the Emilia-Pipe speech data preprocessing pipeline. News 🔥 2025/02/26: The Emilia-Large dataset, featuring over 200,000 hours of data, is now available!!! Emilia-Large combines the original 101k-hour Emilia dataset (licensed under CC BY-NC 4.0) with the brand-new 114k-hour Emilia-YODAS… See the full description on the dataset page: https://huggingface.co/datasets/amphion/Emilia-Dataset.

Task_categories:text-To-SpeechTask_categories:automatic-Speech-RecognitionLanguage:zhLanguage:enLanguage:jaLanguage:fr

81K 460

Updated 2026-06-30 Source available

JosephusCheung/GuanacoDataset HF Unverified

Sorry, it's no longer available on Hugging Face. Please reach out to those who have already downloaded it. If you have a copy, please refrain from re-uploading it to Hugging Face. The people here don't deserve it. See also: https://twitter.com/RealJosephus/status/1779913520529707387 GuanacoDataset News: We're heading towards multimodal VQA, with blip2-flan-t5-xxl Alignment to Guannaco 7B LLM. Still under construction: GuanacoVQA weight & GuanacoVQA Dataset Notice: Effective… See the full description on the dataset page: https://huggingface.co/datasets/JosephusCheung/GuanacoDataset.

Task_categories:text-GenerationTask_categories:question-AnsweringLanguage:zhLanguage:enLanguage:jaLanguage:de

81K 516

Updated 2026-05-08 Source available

allenai/sciq HF Unverified

Dataset Card for "sciq" Dataset Summary The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed… See the full description on the dataset page: https://huggingface.co/datasets/allenai/sciq.

Task_categories:question-AnsweringTask_ids:closed-Domain-QaAnnotations_creators:no-AnnotationLanguage_creators:crowdsourcedMultilinguality:monolingualSource_datasets:original

81K 142

Updated 2026-06-29 Source available

hotpotqa/hotpot_qa HF Unverified

Dataset Card for "hotpot_qa" Dataset Summary HotpotQA is a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowingQA systems to reason… See the full description on the dataset page: https://huggingface.co/datasets/hotpotqa/hotpot_qa.

Task_categories:question-AnsweringAnnotations_creators:crowdsourcedLanguage_creators:foundMultilinguality:monolingualSource_datasets:originalLanguage:en

80K 307

Updated 2026-06-30 Source available

truthfulqa/truthful_qa HF Unverified

Dataset Card for truthful_qa Dataset Summary TruthfulQA is a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. Questions are crafted so that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts.… See the full description on the dataset page: https://huggingface.co/datasets/truthfulqa/truthful_qa.

Task_categories:multiple-ChoiceTask_categories:text-GenerationTask_categories:question-AnsweringTask_ids:multiple-Choice-QaTask_ids:language-ModelingTask_ids:open-Domain-Qa

80K 286

Updated 2026-06-26 Source available

baber/uspto_raw HF Unverified

Size_categories:10M<n<100MFormat:parquetModality:textLibrary:datasetsLibrary:daskLibrary:mlcroissant

80K 0

Updated 2026-06-27 Source available

opendatalab/Sci-Base HF Unverified

Sci-Base: The Largest AI-Ready Scientific Foundation Dataset 🌌 The Sciverse Data Foundation Sciverse is a comprehensive, multi-layered scientific data foundation designed to provide the ultimate data infrastructure for the AI for Science (AI4S) community. As scientific research becomes increasingly data-driven, Sciverse supplies the essential, high-quality data resources required to build robust scientific knowledge systems and accelerate research. Sciverse… See the full description on the dataset page: https://huggingface.co/datasets/opendatalab/Sci-Base.

Language:enSize_categories:1M<n<10MFormat:parquetModality:textLibrary:datasetsLibrary:dask

79K 28

Updated 2026-05-05 Source available

roneneldan/TinyStories HF Unverified

Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary. Described in the following paper: https://arxiv.org/abs/2305.07759. The models referred to in the paper were trained on TinyStories-train.txt (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M. Additional resources: tinystories_all_data.tar.gz - contains a superset of… See the full description on the dataset page: https://huggingface.co/datasets/roneneldan/TinyStories.

Task_categories:text-GenerationLanguage:enSize_categories:1M<n<10MFormat:parquetModality:textLibrary:datasets

79K 1,042

Updated 2026-06-29 Source available

lishaoyong/latex-formulas-80M HF Unverified

For more details, please refer to the 𝐓𝐞𝐱𝐓𝐞𝐥𝐥𝐞𝐫 GitHub repository. IMPORTANT NOTE!!! The handwritten subset of this dataset was collected entirely from existing open source work, which includes all test sets. If you want to use this subset for your experimental ablation, please filter it yourself based on the latex label of the test set

Size_categories:10M<n<100MFormat:parquetModality:imageModality:textLibrary:datasetsLibrary:dask

79K 0

Updated 2026-05-08 Source available

KingTechnician/videomae-small-finetuned-kinetics-xd-violence-binary HF PQC Verified

Video-ClassificationTransformersSafetensorsVideomaeGenerated_from_trainerBase_model:MCG-NJU/videomae-Small-Finetuned-Kinetics MEDIUM

78K 0

Updated 2026-05-07

Skylion007/openwebtext HF Unverified

Dataset Card for "openwebtext" Dataset Summary An open-source replication of the WebText dataset from OpenAI, that was used to train GPT-2. This distribution was created by Aaron Gokaslan and Vanya Cohen of Brown University. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances plain_text Size of downloaded dataset files: 13.51 GB Size of the… See the full description on the dataset page: https://huggingface.co/datasets/Skylion007/openwebtext.

Task_categories:text-GenerationTask_categories:fill-MaskTask_ids:language-ModelingTask_ids:masked-Language-ModelingAnnotations_creators:no-AnnotationLanguage_creators:found

73K 505

Updated 2026-04-26 Source available

Wan-AI/Wan2.1-T2V-14B HF Unverified

Text-To-VideoDiffusersSafetensorsT2vVideo generationEnglish HIGH

72K 1,492

Updated 2026-04-30

microsoft/xclip-base-patch32 HF Unverified

Video-ClassificationTransformersPyTorchSafetensorsXclipVision HIGH

72K 114

Updated 2026-06-30

SwayStar123/preprocessed_commoncatalog-cc-by HF Unverified

I also seperately provide just the prompts in prompts.json keys are the image_id, and the values are the captions generated Captions generated by moondream: vikhyatk/moondream2 Latents generated by SDXL VAE: madebyollin/sdxl-vae-fp16-fix Embeddings generated by SigLIP: hf-hub:timm/ViT-SO400M-14-SigLIP-384 Original dataset: common-canvas/commoncatalog-cc-by Latents f32 and embeddings are f16 bytes Compute cost: 16x3090 for 3 day. Approximately.

Language:enSize_categories:10M<n<100MFormat:parquetModality:textLibrary:datasetsLibrary:dask

71K 2

Updated 2026-05-03 Source available

X779/Danbooruwildcards HF Unverified

This is a set of wildcards for danbooru tags. Artist：Prompts for random artist styles, covering approximately 0.6M different artists.Please select the appropriate version of the collection, ranging from 128 to 5000, based on the model's capabilities.The full version is not recommended for use as it includes too many artists with only one image on danbooru or other websites. Almost no model can generate a style that corresponds to these artists . Characters："Characters" is a set of wildcards… See the full description on the dataset page: https://huggingface.co/datasets/X779/Danbooruwildcards.

Task_categories:text-GenerationLanguage:enSize_categories:10M<n<100MFormat:textModality:textLibrary:datasets

70K 11

Updated 2026-05-08 Source available

facebook/wiki_dpr HF Unverified

This is the wikipedia split used to evaluate the Dense Passage Retrieval (DPR) model. It contains 21M passages from wikipedia along with their DPR embeddings. The wikipedia articles were split into multiple, disjoint text blocks of 100 words as passages.

Task_categories:fill-MaskTask_categories:text-GenerationTask_ids:language-ModelingTask_ids:masked-Language-ModelingAnnotations_creators:no-AnnotationLanguage_creators:crowdsourced

70K 45

Updated 2026-06-28 Source available

openvla/openvla-7b-finetuned-libero-object HF Unverified

Image-Text-to-TextTransformersSafetensorsOpenvlaFeature ExtractionRobotics HIGH

67K 1

Updated 2026-05-07

nyu-mll/blimp HF Unverified

Dataset Card for "blimp" Dataset Summary BLiMP is a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics. The data is automatically generated according to expert-crafted grammars. Supported Tasks and Leaderboards More Information Needed Languages More Information… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/blimp.

Task_categories:text-ClassificationTask_ids:acceptability-ClassificationAnnotations_creators:crowdsourcedLanguage_creators:machine-GeneratedMultilinguality:monolingualSource_datasets:original

66K 38

Updated 2026-05-07 Source available

Showing 20 of 665 items (page 28 of 34)

Prev Next