Model Hub

Browse PQC-verified AI models, datasets, and tools

m-a-p/FineFineWeb HF PQC Verified

FineFineWeb: A Comprehensive Study on Fine-Grained Domain Web Corpus arXiv: Coming Soon Project Page: Coming Soon Blog: Coming Soon Data Statistics Domain (#tokens/#samples) Iteration 1 Tokens Iteration 2 Tokens Iteration 3 Tokens Total Tokens Iteration 1 Count Iteration 2 Count Iteration 3 Count Total Count aerospace 5.77B 261.63M 309.33M 6.34B 9100000 688505 611034 10399539 agronomy 13.08B 947.41M 229.04M 14.26B 15752828 2711790 649404 19114022 artistic… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/FineFineWeb.

Task_categories:text-ClassificationTask_categories:text-GenerationLanguage:enSize_categories:1B<n<10BModality:tabularModality:text
LLM360/TxT360 HF PQC Verified

TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend Changelog Version Details v1.1 Added new data sources: TxT360_BestOfWeb, TxT360_QA, europarl-aligned, and wikipedia_extended. Details of v1.1 Additions TxT360_BestOfWeb: This is a filtered version of the TxT360 dataset, created using the ProX document filtering model. The model is similar to the FineWeb-Edu classifier, but also assigns an additional format score that… See the full description on the dataset page: https://huggingface.co/datasets/LLM360/TxT360.

Task_categories:text-GenerationLanguage:enSize_categories:n>1T
S
StanfordAIMI/stanford-deidentifier-base HF PQC Verified

Token ClassificationTransformersPyTorchBERTSequence-Tagger-ModelPubmedbert MEDIUM
T
theainerd/Wav2Vec2-large-xlsr-hindi HF PQC Verified

Speech RecognitionTransformersPyTorchSafetensorsWav2vec2Base_model:facebook/wav2vec2-Large-Xlsr-53 HIGH
codeparrot/github-code HF Unverified

The GitHub Code dataest consists of 115M code files from GitHub in 32 programming languages with 60 extensions totalling in 1TB of text data. The dataset was created from the GitHub dataset on BiqQuery.

Task_categories:text-GenerationTask_ids:language-ModelingLanguage_creators:crowdsourcedLanguage_creators:expert-GeneratedMultilinguality:multilingualLanguage:code
S
Salesforce/SFR-Embedding-2_R HF PQC Verified

State-of-the-art text embedding model. Top of MTEB leaderboard with strong retrieval and clustering.

TransformerEmbeddings7BRetrieval HIGH
T
timm/tf_efficientnetv2_s.in21k_ft_in1k HF Unverified

Image-ClassificationTimmPyTorchSafetensorsTransformers MEDIUM
fineinstructions/fineinstructions_nemotron HF Unverified

✨ Note: For all FineInstructions resources please visit: https://huggingface.co/fineinstructions This dataset is ~1B+ synthetic instruction-answer pairs or ~300B tokens created using the FineInstructions pipeline. The FineInstructions pipeline was run over the raw pre-training documents in the Nemotron-CC pre-training corpus (a subset of high-quality documents from CommonCrawl). See our paper for more details. Each .parquet file in the data folderhas a corresponding judge-*.json file that… See the full description on the dataset page: https://huggingface.co/datasets/fineinstructions/fineinstructions_nemotron.

Language:enSize_categories:1B<n<10BFormat:parquetModality:tabularModality:textLibrary:datasets
H
Helsinki-NLP/opus-mt-nl-en HF Unverified

TranslationTransformersPyTorchTfRustMarian HIGH
L
lucas-leme/FinBERT-PT-BR HF PQC Verified

Text ClassificationTransformersPyTorchBERTPt MEDIUM
Q
Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice HF Unverified

Text-To-SpeechSafetensorsQwen3_ttsTtsQwenAudio HIGH
S
Systran/faster-whisper-tiny.en HF PQC Verified

Speech RecognitionCtranslate2AudioEnglish MEDIUM
M
mistralai/Voxtral-Mini-4B-Realtime-2602 HF Unverified

Speech RecognitionVllmSafetensorsVoxtral_realtimeMistral-CommonBase_model:mistralai/Ministral-3-3B-Base-2512 HIGH
F
facebook/mms-1b-all HF PQC Verified

Speech RecognitionTransformersPyTorchSafetensorsWav2vec2Mms HIGH
N
nlptown/bert-base-multilingual-uncased-sentiment HF PQC Verified

Text ClassificationTransformersPyTorchTfJAXSafetensors HIGH
B
black-forest-labs/FLUX.1-dev HF PQC Verified

State-of-the-art text-to-image model with exceptional prompt adherence and image quality. 12B parameter rectified flow transformer.

DiffusionImage Generation12B HIGH
K
kresnik/wav2vec2-large-xlsr-korean HF PQC Verified

Speech RecognitionTransformersPyTorchSafetensorsWav2vec2Speech HIGH
allenai/c4 HF PQC Verified

C4 Dataset Summary A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's C4 dataset We prepared five variants of the data: en, en.noclean, en.noblocklist, realnewslike, and multilingual (mC4). For reference, these are the sizes of the variants: en: 305GB en.noclean: 2.3TB en.noblocklist: 380GB realnewslike: 15GB multilingual (mC4): 9.7TB (108 subsets, one per… See the full description on the dataset page: https://huggingface.co/datasets/allenai/c4.

Task_categories:text-GenerationTask_categories:fill-MaskTask_ids:language-ModelingTask_ids:masked-Language-ModelingAnnotations_creators:no-AnnotationLanguage_creators:found
F
facebook/detr-resnet-50 HF PQC Verified

Object-DetectionTransformersPyTorchSafetensorsDetrVision MEDIUM
J
jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn HF PQC Verified

Speech RecognitionTransformersPyTorchJAXWav2vec2Audio HIGH
Showing 20 of 663 items (page 12 of 34)