Model Hub

Browse PQC-verified AI models, datasets, and tools

aps/super_glue HF Unverified

Dataset Card for "super_glue" Dataset Summary SuperGLUE (https://super.gluebenchmark.com/) is a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances axb Size of downloaded dataset files: 0.03 MB Size of… See the full description on the dataset page: https://huggingface.co/datasets/aps/super_glue.

Task_categories:text-ClassificationTask_categories:token-ClassificationTask_categories:question-AnsweringTask_ids:natural-Language-InferenceTask_ids:word-Sense-DisambiguationTask_ids:coreference-Resolution
TIGER-Lab/MMLU-Pro HF Unverified

MMLU-Pro Dataset MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines. |Github | 🏆Leaderboard | 📖Paper | 🚀 What's New [2026.03.11] Added more cutting-edge frontier models to the leaderboard, including the Claude-4.6 series, Seed2.0 series, Qwen3.5 series, and Gemini-3.1-Pro, among… See the full description on the dataset page: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro.

Benchmark:officialTask_categories:question-AnsweringLanguage:enSize_categories:10K<n<100KFormat:parquetModality:tabular
F
facebook/vjepa2-vitl-fpc64-256 HF Unverified

Video-ClassificationTransformersSafetensorsVjepa2Feature ExtractionVideo HIGH
jobs-git/HPLT2.0_cleaned HF Unverified

This is a large-scale collection of web-crawled documents in 191 world languages, produced by the HPLT project. The source of the data is mostly Internet Archive with some additions from Common Crawl. For a detailed description of the dataset, please refer to https://hplt-project.org/datasets/v2.0 The Cleaned variant of HPLT Datasets v2.0 This is the cleaned variant of the HPLT Datasets v2.0 converted to the Parquet format semi-automatically when being uploaded here. The original JSONL files… See the full description on the dataset page: https://huggingface.co/datasets/jobs-git/HPLT2.0_cleaned.

Task_categories:fill-MaskTask_categories:text-GenerationTask_ids:language-ModelingMultilinguality:multilingualLanguage:aceLanguage:af
allenai/dolma3_mix-6T-1025-7B HF Unverified

⚠️ WARNING: This dataset is intended ONLY for reproducing Olmo 3 7B ⚠️ For all other training use cases, including training from scratch, please utilize our primary dolma 3 data mix: https://huggingface.co/datasets/allenai/dolma3_mix-6T. Note: Some olmOCR science PDFs in the current dataset have been redacted following the training of Olmo 3 7B. These texts are indicated with [REMOVED] in the text field. This will affect reproducibility of Olmo 3 7B. For this reason, please use our… See the full description on the dataset page: https://huggingface.co/datasets/allenai/dolma3_mix-6T-1025-7B.

Task_categories:text-GenerationLanguage:en
M
monologg/koelectra-small-v2-distilled-korquad-384 HF Unverified

Question AnsweringTransformersPyTorchTfliteSafetensorsElectra MEDIUM
mlfoundations/MINT-1T-PDF-CC-2023-23 HF PQC Verified

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-23.

Task_categories:image-To-TextTask_categories:text-GenerationLanguage:enSize_categories:1M<n<10MFormat:webdatasetModality:image
allenai/openbookqa HF Unverified

Dataset Card for OpenBookQA Dataset Summary OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in. In particular, it contains questions that require multi-step reasoning, use of additional common and commonsense knowledge, and rich text comprehension. OpenBookQA is a new kind of… See the full description on the dataset page: https://huggingface.co/datasets/allenai/openbookqa.

Task_categories:question-AnsweringTask_ids:open-Domain-QaAnnotations_creators:crowdsourcedAnnotations_creators:expert-GeneratedLanguage_creators:expert-GeneratedMultilinguality:monolingual
D
deepset/tinyroberta-squad2 HF Unverified

Question AnsweringTransformersPyTorchSafetensorsRobertaModel-Index MEDIUM
F
facebook/vjepa2-vith-fpc64-256 HF Unverified

Video-ClassificationTransformersSafetensorsVjepa2Feature ExtractionVideo HIGH
D
depth-anything/DA3NESTED-GIANT-LARGE-1.1 HF Unverified

Depth-EstimationDepth-Anything-3SafetensorsComputer-VisionMonocular-DepthMulti-View-Geometry HIGH
nvidia/Nemotron-CC-v2 HF Unverified

Nemotron-Pre-Training-Dataset-v1 Release Data Overview This pretraining dataset, for generative AI model training, preserves high-value math and code while enriching it with diverse multilingual Q&A, fueling the next generation of intelligent, globally-capable models. This dataset supports NVIDIA Nemotron Nano 2, a family of large language models (LLMs) that consists of the NVIDIA-Nemotron-Nano-9B-v2, NVIDIA-Nemotron-Nano-9B-v2-Base, and NVIDIA-Nemotron-Nano-12B-v2-Base… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-CC-v2.

Task_categories:text-GenerationSize_categories:1B<n<10BFormat:parquetModality:textLibrary:datasetsLibrary:dask
I
Intel/zoedepth-nyu-kitti HF PQC Verified

Depth-EstimationTransformersSafetensorsZoedepthVision HIGH
P
philschmid/bart-large-cnn-samsum HF Unverified

SummarizationTransformersPyTorchBartText2text-GenerationSagemaker HIGH
P
PekingU/rtdetr_r18vd_coco_o365 HF PQC Verified

Object-DetectionTransformersSafetensorsRt_detrVisionEnglish MEDIUM
T
typeform/distilbert-base-uncased-mnli HF Unverified

Zero-Shot ClassificationTransformersPyTorchTfSafetensorsDistilbert MEDIUM
A
abhishtagatya/hubert-base-960h-itw-deepfake HF Unverified

Audio-ClassificationTransformersTensorboardSafetensorsHubertDeepfake MEDIUM
zekaiwang/trex_dataset HF Unverified

T-Rex Dataset A large-scale, tactile-reactive bimanual manipulation dataset, collected via teleoperation on a Dexmate Vega-1 robot with two Sharpa Wave dexterous hands. Stored as a LeRobotDataset v3.0. 🌐 Project Page · ✍️ Paper (arXiv) · 💻 Code (T-Rex) · 🚀 Dataset Quickstart · 📓 Colab notebook One episode from each of 20 motor primitives (head-camera view, cropped to the workspace), each with a different object. Teleoperation setup: Manus gloves + VIVE… See the full description on the dataset page: https://huggingface.co/datasets/zekaiwang/trex_dataset.

Task_categories:roboticsLanguage:enSize_categories:1M<n<10MFormat:parquetModality:tabularModality:text
Zyphra/Zyda-2 HF Unverified

Zyda-2 Zyda-2 is a 5 trillion token language modeling dataset created by collecting open and high quality datasets and combining them and cross-deduplication and model-based quality filtering. Zyda-2 comprises diverse sources of web data, highly educational content, math, code, and scientific papers. To construct Zyda-2, we took the best open-source datasets available: Zyda, FineWeb, DCLM, and Dolma. Models trained on Zyda-2 significantly outperform identical models trained on the… See the full description on the dataset page: https://huggingface.co/datasets/Zyphra/Zyda-2.

Task_categories:text-GenerationLanguage:enSize_categories:n>1T
ylecun/mnist HF Unverified

Dataset Card for MNIST Dataset Summary The MNIST dataset consists of 70,000 28x28 black-and-white images of handwritten digits extracted from two NIST databases. There are 60,000 images in the training dataset and 10,000 images in the validation dataset, one class per digit so a total of 10 classes, with 7,000 images (6,000 train images and 1,000 test images) per class. Half of the image were drawn by Census Bureau employees and the other half by high school students… See the full description on the dataset page: https://huggingface.co/datasets/ylecun/mnist.

Task_categories:image-ClassificationTask_ids:multi-Class-Image-ClassificationAnnotations_creators:expert-GeneratedLanguage_creators:foundMultilinguality:monolingualSource_datasets:extended|other-Nist
Showing 20 of 665 items (page 24 of 34)