Model Hub

Browse PQC-verified AI models, datasets, and tools

FineWeb Tokenized > 4 trillion tokens of the pre-tokenized data the 🌐 web has to offer What is it? This is a pre-tokenized version of the HuggingFaceFW/fineweb dataset (currently in-progress, tokenization of the ~15 trillion tokens corpus is ongoing). The data is being pre-processed and tokenized using the AnisoleAI BPE tokenizer (52,022 vocabulary size) and packed into compact uint16 Parquet shards. By distributing the pre-tokenized corpus, we eliminate… See the full description on the dataset page: https://huggingface.co/datasets/anisoleai/fineweb-tokenized.

Task_categories:text-GenerationLanguage:enSize_categories:n>1TModality:tabularModality:textTabular

151K 2

Updated 2026-06-29 Source available

Showing 1 of 1 items (page 1 of 1)

Prev Next