Browse PQC-verified AI models, datasets, and tools
FineWeb Tokenized > 4 trillion tokens of the pre-tokenized data the š web has to offer What is it? This is a pre-tokenized version of the HuggingFaceFW/fineweb dataset (currently in-progress, tokenization of the ~15 trillion tokens corpus is ongoing). The data is being pre-processed and tokenized using the AnisoleAI BPE tokenizer (52,022 vocabulary size) and packed into compact uint16 Parquet shards. By distributing the pre-tokenized corpus, we eliminate⦠See the full description on the dataset page: https://huggingface.co/datasets/anisoleai/fineweb-tokenized.