Model Hub

Browse PQC-verified AI models, datasets, and tools

Sort: Most Downloaded Most Liked Recently Updated

FineFineWeb: A Comprehensive Study on Fine-Grained Domain Web Corpus arXiv: Coming Soon Project Page: Coming Soon Blog: Coming Soon Data Statistics Domain (#tokens/#samples) Iteration 1 Tokens Iteration 2 Tokens Iteration 3 Tokens Total Tokens Iteration 1 Count Iteration 2 Count Iteration 3 Count Total Count aerospace 5.77B 261.63M 309.33M 6.34B 9100000 688505 611034 10399539 agronomy 13.08B 947.41M 229.04M 14.26B 15752828 2711790 649404 19114022 artistic… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/FineFineWeb.

Task_categories:text-ClassificationTask_categories:text-GenerationLanguage:enSize_categories:1B<n<10BModality:tabularModality:text

1.2M 152

Updated 2026-06-29 Source available

HuggingFaceFW/fineweb-edu HF PQC Verified

📚 FineWeb-Edu 1.3 trillion tokens of the finest educational data the 🌐 web has to offer Paper: https://arxiv.org/abs/2406.17557 What is it? 📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.

Task_categories:text-GenerationLanguage:enSize_categories:1B<n<10BFormat:parquetModality:tabularModality:text

382K 1,165

Updated 2026-06-29 Source available

HuggingFaceFW/FineWeb HF PQC Verified

15T token dataset of cleaned English web data. Deduplicated and filtered from CommonCrawl, outperforms C4 and RefinedWeb for LLM pretraining.

DatasetPretrainingEnglish15T tokens CRITICAL

253K 2,907

Updated 2026-06-29 Source available

epfml/FineWeb-HQ HF Unverified

FineWeb-HQ Dataset Summary FineWeb-HQ is a high-quality, model-filtered pretraining dataset derived as a subset of FineWeb. FineWeb-HQ was created by selecting the top 10% of FineWeb documents based on a deep learning classifier trained to identify structured and knowledge-rich samples. This classifier uses XLM-RoBERTa embeddings to score documents. To validate our approach, we pretrained 1B-parameter LLM models with a Llama-like architecture across multiple languages and… See the full description on the dataset page: https://huggingface.co/datasets/epfml/FineWeb-HQ.

Task_categories:text-GenerationLanguage:enSize_categories:1B<n<10BFormat:parquetModality:tabularModality:text

167K 7

Updated 2026-04-21 Source available

anisoleai/fineweb-tokenized HF Unverified

FineWeb Tokenized > 4 trillion tokens of the pre-tokenized data the 🌐 web has to offer What is it? This is a pre-tokenized version of the HuggingFaceFW/fineweb dataset (currently in-progress, tokenization of the ~15 trillion tokens corpus is ongoing). The data is being pre-processed and tokenized using the AnisoleAI BPE tokenizer (52,022 vocabulary size) and packed into compact uint16 Parquet shards. By distributing the pre-tokenized corpus, we eliminate… See the full description on the dataset page: https://huggingface.co/datasets/anisoleai/fineweb-tokenized.

Task_categories:text-GenerationLanguage:enSize_categories:n>1TModality:tabularModality:textTabular

151K 2

Updated 2026-06-29 Source available

HuggingFaceFW/fineweb-2 HF Unverified

🥂 FineWeb2 A sparkling update with 1000s of languages What is it? This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages. The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments. In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-2.

Task_categories:text-GenerationLanguage:aaiLanguage:aakLanguage:aauLanguage:aazLanguage:aba

121K 792

Updated 2026-05-08 Source available

fancyzhx/ag_news HF Unverified

Dataset Card for "ag_news" Dataset Summary AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml… See the full description on the dataset page: https://huggingface.co/datasets/fancyzhx/ag_news.

Task_categories:text-ClassificationTask_ids:topic-ClassificationAnnotations_creators:foundLanguage_creators:foundMultilinguality:monolingualSource_datasets:original

121K 190

Updated 2026-06-29 Source available

airtrain-ai/fineweb-edu-fortified HF Unverified

Fineweb-Edu-Fortified The composition of fineweb-edu-fortified, produced by automatically clustering a 500k row sample in Airtrain What is it? Fineweb-Edu-Fortified is a dataset derived from Fineweb-Edu by applying exact-match deduplication across the whole dataset and producing an embedding for each row. The number of times the text from each row appears is also included as a count column. The embeddings were produced using TaylorAI/bge-micro Fineweb and… See the full description on the dataset page: https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fortified.

Task_categories:text-GenerationLanguage:enSize_categories:100M<n<1BFormat:parquetModality:tabularModality:text

120K 65

Updated 2026-06-27 Source available

Helsinki-NLP/fineweb-edu-translated HF PQC Verified

Helsinki-NLP/fineweb-edu-translated fineweb-edu-tanslated is a collection of automatically translated documents from fineweb-edu. Translations are based on OPUS-MT and HPLT-MT models. The data covers 36,704,000 documents with over 28 billion space-searated tokens of English data translated into 36 languages. The total data set is incudes of over 960 billion tokens and the translated documents are aligned across all languages. More information about how the data has been produced can… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/fineweb-edu-translated.

Task_categories:translationTask_categories:text-GenerationLanguage:bosLanguage:bulLanguage:catLanguage:ces

119K 16

Updated 2026-06-26 Source available

Showing 9 of 9 items (page 1 of 1)

Prev Next