Model Hub

Browse PQC-verified AI models, datasets, and tools

Dataset Card for The Cauldron Dataset description The Cauldron is part of the Idefics2 release. It is a massive collection of 50 vision-language datasets (training sets only) that were used for the fine-tuning of the vision-language model Idefics2. Load the dataset To load the dataset, install the library datasets with pip install datasets. Then, from datasets import load_dataset ds = load_dataset("HuggingFaceM4/the_cauldron", "ai2d") to download and load the… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/the_cauldron.

Size_categories:1M<n<10MFormat:parquetModality:imageModality:textLibrary:datasetsLibrary:dask

277K 547

Updated 2026-06-30 Source available

speechbrain/emotion-recognition-wav2vec2-IEMOCAP HF Unverified

Audio-ClassificationSpeechbrainEmotionRecognitionWav2vec2PyTorch MEDIUM

277K 188

Updated 2026-06-30

google-t5/t5-3b HF PQC Verified

TranslationTransformersPyTorchTfSafetensorsT5 HIGH

276K 53

Updated 2026-06-30

cross-encoder/nli-MiniLM2-L6-H768 HF Unverified

Zero-Shot ClassificationSentence-TransformersPyTorchONNXSafetensorsOpenvino HIGH

273K 14

Updated 2026-05-08

John6666/diving-illustrious-real-asian-v50-sdxl HF PQC Verified

Text-to-ImageDiffusersSafetensorsStable-DiffusionStable-Diffusion-XlRealistic HIGH

270K 0

Updated 2026-05-08

MCG-NJU/videomae-base HF Unverified

Video-ClassificationTransformersPyTorchSafetensorsVideomaePretraining MEDIUM

269K 55

Updated 2026-06-30

abisee/cnn_dailymail HF Unverified

Dataset Card for CNN Dailymail Dataset Dataset Summary The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering. Supported Tasks and Leaderboards 'summarization': Versions… See the full description on the dataset page: https://huggingface.co/datasets/abisee/cnn_dailymail.

Task_categories:summarizationTask_ids:news-Articles-SummarizationAnnotations_creators:no-AnnotationLanguage_creators:foundMultilinguality:monolingualSource_datasets:original

268K 345

Updated 2026-06-30 Source available

PekingU/rtdetr_r50vd_coco_o365 HF Unverified

Object-DetectionTransformersSafetensorsRt_detrVisionEnglish MEDIUM

268K 17

Updated 2026-06-30

google/pegasus-xsum HF Unverified

SummarizationTransformersPyTorchTfJAXPegasus HIGH

264K 222

Updated 2026-06-30

CohereLabs/xP3x HF Unverified

Dataset Card for xP3x Dataset Summary xP3x (Crosslingual Public Pool of Prompts eXtended) is a collection of prompts & datasets across 277 languages & 16 NLP tasks. It contains all of xP3 + much more! It is used for training future contenders of mT0 & BLOOMZ at project Aya @Cohere Labs 🧡 Creation: The dataset can be recreated using instructions available here together with the file in this repository named xp3x_create.py. We provide this version to save processing… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/xP3x.

Task_categories:otherAnnotations_creators:expert-GeneratedAnnotations_creators:crowdsourcedMultilinguality:multilingualLanguage:afLanguage:ar

263K 95

Updated 2026-06-30 Source available

microsoft/VibeVoice-1.5B HF Unverified

Text-To-SpeechTransformersSafetensorsVibevoiceText GenerationPodcast HIGH

259K 2,365

Updated 2026-05-08

HuggingFaceFW/FineWeb HF PQC Verified

15T token dataset of cleaned English web data. Deduplicated and filtered from CommonCrawl, outperforms C4 and RefinedWeb for LLM pretraining.

DatasetPretrainingEnglish15T tokens CRITICAL

258K 2,908

Updated 2026-06-30 Source available

anisoleai/fineweb-tokenized HF Unverified

FineWeb Tokenized > 4 trillion tokens of the pre-tokenized data the 🌐 web has to offer What is it? This is a pre-tokenized version of the HuggingFaceFW/fineweb dataset (currently in-progress, tokenization of the ~15 trillion tokens corpus is ongoing). The data is being pre-processed and tokenized using the AnisoleAI BPE tokenizer (52,022 vocabulary size) and packed into compact uint16 Parquet shards. By distributing the pre-tokenized corpus, we eliminate… See the full description on the dataset page: https://huggingface.co/datasets/anisoleai/fineweb-tokenized.

Task_categories:text-GenerationLanguage:enSize_categories:n>1TModality:tabularModality:textTabular

256K 2

Updated 2026-06-30 Source available

autogluon/mitra-regressor HF Unverified

Tabular-RegressionSafetensors MEDIUM

255K 31

Updated 2026-06-30

allenai/objaverse HF Unverified

Objaverse Objaverse is a Massive Dataset with 800K+ Annotated 3D Objects. More documentation is coming soon. In the meantime, please see our paper and website for additional details. License The use of the dataset as a whole is licensed under the ODC-By v1.0 license. Individual objects in Objaverse are all licensed as creative commons distributable objects, and may be under the following licenses: CC-BY 4.0 - 721K objects CC-BY-NC 4.0 - 25K objects CC-BY-NC-SA 4.0 - 52K… See the full description on the dataset page: https://huggingface.co/datasets/allenai/objaverse.

Language:en

253K 453

Updated 2026-06-30 Source available

timpal0l/mdeberta-v3-base-squad2 HF Unverified

Question AnsweringTransformersPyTorchSafetensorsDeberta-V2Deberta HIGH

253K 259

Updated 2026-06-30

Intel/dpt-hybrid-midas HF PQC Verified

Depth-EstimationTransformersPyTorchDptVisionModel-Index MEDIUM

247K 111

Updated 2026-06-30

mlfoundations/MINT-1T-PDF-CC-2023-06 HF PQC Verified

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-06.

Task_categories:image-To-TextTask_categories:text-GenerationLanguage:enSize_categories:100B<n<1TMultimodal

245K 10

Updated 2026-04-30 Source available

Showing 20 of 665 items (page 19 of 34)

Prev Next