Model Hub

Browse PQC-verified AI models, datasets, and tools

Dataset Card for Mostly Basic Python Problems (mbpp) Dataset Summary The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. As described in the paper, a subset of the data has been hand-verified by us. Released here as part of… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/mbpp.

Annotations_creators:crowdsourcedAnnotations_creators:expert-GeneratedLanguage_creators:crowdsourcedLanguage_creators:expert-GeneratedMultilinguality:monolingualSource_datasets:original

186K 230

Updated 2026-05-08 Source available

cross-encoder/nli-deberta-v3-xsmall HF Unverified

Zero-Shot ClassificationSentence-TransformersPyTorchONNXSafetensorsDeberta-V2 HIGH

183K 7

Updated 2026-06-30

jhu-clsp/ettin-pretraining-data HF Unverified

Ettin Pre-training Data Phase 1 of 3: Diverse pre-training data mixture (1.7T tokens) used to train the Ettin model suite. This dataset contains the pre-training phase data used to train all Ettin encoder and decoder models. The data is provided in MDS format ready for use with Composer and the ModernBERT training repository. 📊 Data Composition Data Source Tokens (B) Percentage Description DCLM 837.2 49.1% High-quality web crawl data CC Head 356.6… See the full description on the dataset page: https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data.

Task_categories:text-GenerationTask_categories:fill-MaskTask_categories:text-ClassificationLanguage:enPretrainingLanguage-Modeling

183K 9

Updated 2026-06-30 Source available

uoft-cs/cifar10 HF Unverified

Dataset Card for CIFAR-10 Dataset Summary The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain… See the full description on the dataset page: https://huggingface.co/datasets/uoft-cs/cifar10.

Task_categories:image-ClassificationAnnotations_creators:crowdsourcedLanguage_creators:foundMultilinguality:monolingualSource_datasets:extended|other-80-Million-Tiny-ImagesLanguage:en

181K 108

Updated 2026-06-30 Source available

cagliostrolab/animagine-xl-3.1 HF PQC Verified

Text-to-ImageDiffusersSafetensorsStable-DiffusionStable-Diffusion-XlBase_model:cagliostrolab/animagine-Xl-3.0 CRITICAL

181K 714

Updated 2026-04-20

wikimedia/Wikipedia (Nov 2023) HF PQC Verified

Complete Wikipedia dump across all languages. Standard pretraining data source. Structured articles with metadata.

DatasetTextMultilingualKnowledge CRITICAL

179K 1,257

Updated 2026-06-30 Source available

stanfordnlp/imdb HF Unverified

Dataset Card for "imdb" Dataset Summary Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.

Task_categories:text-ClassificationTask_ids:sentiment-ClassificationAnnotations_creators:expert-GeneratedLanguage_creators:expert-GeneratedMultilinguality:monolingualSource_datasets:original

179K 388

Updated 2026-06-30 Source available

isaacus/open-australian-legal-corpus HF Unverified

Open Australian Legal Corpus ‍⚖️ The Open Australian Legal Corpus by Isaacus, a foundational legal AI research company, is the first and only multijurisdictional open corpus of Australian legislative and judicial documents. Comprised of 229,122 texts totalling over 60 million lines and 1.4 billion tokens, the Corpus includes every in force statute and regulation in the Commonwealth, New South Wales, Queensland, Western Australia, South Australia, Tasmania and Norfolk Island, in… See the full description on the dataset page: https://huggingface.co/datasets/isaacus/open-australian-legal-corpus.

Task_categories:text-GenerationTask_categories:fill-MaskTask_categories:text-RetrievalTask_ids:language-ModelingTask_ids:masked-Language-ModelingTask_ids:document-Retrieval

176K 92

Updated 2026-06-27 Source available

depth-anything/DA3-LARGE HF Unverified

Depth-EstimationDepth-Anything-3SafetensorsComputer-VisionMonocular-DepthMulti-View-Geometry HIGH

175K 16

Updated 2026-06-30

nebius/SWE-rebench-V2-PRs HF Unverified

SWE-rebench-V2-PRs Dataset Summary SWE-rebench-V2-PRs is a large-scale dataset of real-world GitHub pull requests collected across multiple programming languages, intended for training and evaluating code-generation and software-engineering agents. The dataset contains 126,300 samples covering Go, Python, JavaScript, TypeScript, Rust, Java, C, C++, Julia, Elixir, Kotlin, PHP, Scala, Clojure, Dart, OCaml, and other languages. For log parser functions, base Dockerfiles, and… See the full description on the dataset page: https://huggingface.co/datasets/nebius/SWE-rebench-V2-PRs.

Task_categories:text-GenerationLanguage:enSize_categories:100K<n<1MFormat:parquetModality:textLibrary:datasets

172K 14

Updated 2026-05-08 Source available

airtrain-ai/fineweb-edu-fortified HF Unverified

Fineweb-Edu-Fortified The composition of fineweb-edu-fortified, produced by automatically clustering a 500k row sample in Airtrain What is it? Fineweb-Edu-Fortified is a dataset derived from Fineweb-Edu by applying exact-match deduplication across the whole dataset and producing an embedding for each row. The number of times the text from each row appears is also included as a count column. The embeddings were produced using TaylorAI/bge-micro Fineweb and… See the full description on the dataset page: https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fortified.

Task_categories:text-GenerationLanguage:enSize_categories:100M<n<1BFormat:parquetModality:tabularModality:text

169K 65

Updated 2026-06-30 Source available

autogluon/mitra-classifier HF PQC Verified

Tabular-ClassificationSafetensors MEDIUM

168K 39

Updated 2026-06-30

epfml/FineWeb-HQ HF Unverified

FineWeb-HQ Dataset Summary FineWeb-HQ is a high-quality, model-filtered pretraining dataset derived as a subset of FineWeb. FineWeb-HQ was created by selecting the top 10% of FineWeb documents based on a deep learning classifier trained to identify structured and knowledge-rich samples. This classifier uses XLM-RoBERTa embeddings to score documents. To validate our approach, we pretrained 1B-parameter LLM models with a Llama-like architecture across multiple languages and… See the full description on the dataset page: https://huggingface.co/datasets/epfml/FineWeb-HQ.

Task_categories:text-GenerationLanguage:enSize_categories:1B<n<10BFormat:parquetModality:tabularModality:text

167K 7

Updated 2026-04-21 Source available

Wan-AI/Wan2.1-T2V-1.3B-Diffusers HF Unverified

Text-To-VideoDiffusersSafetensorsVideoVideo-GenerationDiffusers:WanPipeline HIGH

167K 127

Updated 2026-06-30

mlfoundations/MINT-1T-PDF-CC-2024-18 HF PQC Verified

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2024-18.

Task_categories:image-To-TextTask_categories:text-GenerationLanguage:enSize_categories:100B<n<1TMultimodal

166K 30

Updated 2026-05-02 Source available

allenai/MADLAD-400 HF PQC Verified

MADLAD-400 Dataset and Introduction MADLAD-400 (Multilingual Audited Dataset: Low-resource And Document-level) is a document-level multilingual dataset based on Common Crawl, covering 419 languages in total. This uses all snapshots of CommonCrawl available as of August 1, 2022. The primary advantage of this dataset over similar datasets is that it is more multilingual (419 languages), it is audited and more highly filtered, and it is document-level. The main disadvantage… See the full description on the dataset page: https://huggingface.co/datasets/allenai/MADLAD-400.

Task_categories:text-GenerationSize_categories:n>1T

166K 168

Updated 2026-04-23 Source available

jobs-git/Zyda-2 HF Unverified

Zyda-2 Zyda-2 is a 5 trillion token language modeling dataset created by collecting open and high quality datasets and combining them and cross-deduplication and model-based quality filtering. Zyda-2 comprises diverse sources of web data, highly educational content, math, code, and scientific papers. To construct Zyda-2, we took the best open-source datasets available: Zyda, FineWeb, DCLM, and Dolma. Models trained on Zyda-2 significantly outperform identical models trained on the… See the full description on the dataset page: https://huggingface.co/datasets/jobs-git/Zyda-2.

Task_categories:text-GenerationLanguage:enSize_categories:n>1T

164K 1

Updated 2026-06-30 Source available

Abiray/Sulphur-2-base-GGUF HF Unverified

Text-To-VideoGGUFQuantizedBase_model:SulphurAI/Sulphur-2-BaseBase_model:quantized:SulphurAI/Sulphur-2-Base CRITICAL

162K 71

Updated 2026-06-30

rajpurkar/squad HF Unverified

Dataset Card for SQuAD Dataset Summary Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 1.1 contains 100,000+ question-answer pairs on 500+ articles. Supported Tasks and Leaderboards Question Answering.… See the full description on the dataset page: https://huggingface.co/datasets/rajpurkar/squad.

Task_categories:question-AnsweringTask_ids:extractive-QaAnnotations_creators:crowdsourcedLanguage_creators:crowdsourcedLanguage_creators:foundMultilinguality:monolingual

158K 368

Updated 2026-06-30 Source available

Showing 20 of 665 items (page 23 of 34)

Prev Next