Model Hub

Browse PQC-verified AI models, datasets, and tools

Sort: Most Downloaded Most Liked Recently Updated

FineFineWeb: A Comprehensive Study on Fine-Grained Domain Web Corpus arXiv: Coming Soon Project Page: Coming Soon Blog: Coming Soon Data Statistics Domain (#tokens/#samples) Iteration 1 Tokens Iteration 2 Tokens Iteration 3 Tokens Total Tokens Iteration 1 Count Iteration 2 Count Iteration 3 Count Total Count aerospace 5.77B 261.63M 309.33M 6.34B 9100000 688505 611034 10399539 agronomy 13.08B 947.41M 229.04M 14.26B 15752828 2711790 649404 19114022 artistic… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/FineFineWeb.

Task_categories:text-ClassificationTask_categories:text-GenerationLanguage:enSize_categories:1B<n<10BModality:tabularModality:text

1.2M 152

Updated 2026-06-30 Source available

LLM360/TxT360 HF PQC Verified

TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend Changelog Version Details v1.1 Added new data sources: TxT360_BestOfWeb, TxT360_QA, europarl-aligned, and wikipedia_extended. Details of v1.1 Additions TxT360_BestOfWeb: This is a filtered version of the TxT360 dataset, created using the ProX document filtering model. The model is similar to the FineWeb-Edu classifier, but also assigns an additional format score that… See the full description on the dataset page: https://huggingface.co/datasets/LLM360/TxT360.

Task_categories:text-GenerationLanguage:enSize_categories:n>1T

1.2M 254

Updated 2026-05-08 Source available

StanfordAIMI/stanford-deidentifier-base HF PQC Verified

Token ClassificationTransformersPyTorchBERTSequence-Tagger-ModelPubmedbert MEDIUM

1.2M 81

Updated 2026-06-30

theainerd/Wav2Vec2-large-xlsr-hindi HF PQC Verified

Speech RecognitionTransformersPyTorchSafetensorsWav2vec2Base_model:facebook/wav2vec2-Large-Xlsr-53 HIGH

1.2M 12

Updated 2026-04-20

codeparrot/github-code HF Unverified

The GitHub Code dataest consists of 115M code files from GitHub in 32 programming languages with 60 extensions totalling in 1TB of text data. The dataset was created from the GitHub dataset on BiqQuery.

Task_categories:text-GenerationTask_ids:language-ModelingLanguage_creators:crowdsourcedLanguage_creators:expert-GeneratedMultilinguality:multilingualLanguage:code

1.2M 367

Updated 2026-06-30 Source available

Salesforce/SFR-Embedding-2_R HF PQC Verified

State-of-the-art text embedding model. Top of MTEB leaderboard with strong retrieval and clustering.

TransformerEmbeddings7BRetrieval HIGH

1.2M 890

Updated 2026-03-26

timm/tf_efficientnetv2_s.in21k_ft_in1k HF Unverified

Image-ClassificationTimmPyTorchSafetensorsTransformers MEDIUM

1.2M 2

Updated 2026-06-30

fineinstructions/fineinstructions_nemotron HF Unverified

✨ Note: For all FineInstructions resources please visit: https://huggingface.co/fineinstructions This dataset is ~1B+ synthetic instruction-answer pairs or ~300B tokens created using the FineInstructions pipeline. The FineInstructions pipeline was run over the raw pre-training documents in the Nemotron-CC pre-training corpus (a subset of high-quality documents from CommonCrawl). See our paper for more details. Each .parquet file in the data folderhas a corresponding judge-*.json file that… See the full description on the dataset page: https://huggingface.co/datasets/fineinstructions/fineinstructions_nemotron.

Language:enSize_categories:1B<n<10BFormat:parquetModality:tabularModality:textLibrary:datasets

1.2M 9

Updated 2026-05-08 Source available

Helsinki-NLP/opus-mt-nl-en HF Unverified

TranslationTransformersPyTorchTfRustMarian HIGH

1.2M 10

Updated 2026-06-30

lucas-leme/FinBERT-PT-BR HF PQC Verified

Text ClassificationTransformersPyTorchBERTPt MEDIUM

1.2M 29

Updated 2026-04-28

Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice HF Unverified

Text-To-SpeechSafetensorsQwen3_ttsTtsQwenAudio HIGH

1.2M 163

Updated 2026-06-30

Systran/faster-whisper-tiny.en HF PQC Verified

Speech RecognitionCtranslate2AudioEnglish MEDIUM

1.2M 9

Updated 2026-05-08

mistralai/Voxtral-Mini-4B-Realtime-2602 HF Unverified

Speech RecognitionVllmSafetensorsVoxtral_realtimeMistral-CommonBase_model:mistralai/Ministral-3-3B-Base-2512 HIGH

1.2M 840

Updated 2026-05-08

facebook/mms-1b-all HF PQC Verified

Speech RecognitionTransformersPyTorchSafetensorsWav2vec2Mms HIGH

1.1M 198

Updated 2026-05-08

nlptown/bert-base-multilingual-uncased-sentiment HF PQC Verified

Text ClassificationTransformersPyTorchTfJAXSafetensors HIGH

1.1M 473

Updated 2026-04-23

black-forest-labs/FLUX.1-dev HF PQC Verified

State-of-the-art text-to-image model with exceptional prompt adherence and image quality. 12B parameter rectified flow transformer.

DiffusionImage Generation12B HIGH

1.1M 13,409

Updated 2026-06-30

kresnik/wav2vec2-large-xlsr-korean HF PQC Verified

Speech RecognitionTransformersPyTorchSafetensorsWav2vec2Speech HIGH

1.1M 55

Updated 2026-04-25

allenai/c4 HF PQC Verified

C4 Dataset Summary A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's C4 dataset We prepared five variants of the data: en, en.noclean, en.noblocklist, realnewslike, and multilingual (mC4). For reference, these are the sizes of the variants: en: 305GB en.noclean: 2.3TB en.noblocklist: 380GB realnewslike: 15GB multilingual (mC4): 9.7TB (108 subsets, one per… See the full description on the dataset page: https://huggingface.co/datasets/allenai/c4.

Task_categories:text-GenerationTask_categories:fill-MaskTask_ids:language-ModelingTask_ids:masked-Language-ModelingAnnotations_creators:no-AnnotationLanguage_creators:found

1.1M 602

Updated 2026-06-30 Source available

facebook/detr-resnet-50 HF PQC Verified

Object-DetectionTransformersPyTorchSafetensorsDetrVision MEDIUM

1.1M 957

Updated 2026-06-30

jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn HF PQC Verified

Speech RecognitionTransformersPyTorchJAXWav2vec2Audio HIGH

1.1M 133

Updated 2026-05-08

Showing 20 of 663 items (page 12 of 34)

Prev Next