FineWeb Tokenized > 4 trillion tokens of the pre-tokenized data the š web has to offer What is it? This is a pre-tokenized version of the HuggingFaceFW/fineweb dataset (currently in-progress, tokenization of the ~15 trillion tokens corpus is ongoing). The data is being pre-processed and tokenized using the AnisoleAI BPE tokenizer (52,022 vocabulary size) and packed into compact uint16 Parquet shards. By distributing the pre-tokenized corpus, we eliminate⦠See the full description on the dataset page: https://huggingface.co/datasets/anisoleai/fineweb-tokenized.
Use this model
Pull with QuantumShield
quantumshield pull anisoleai/fineweb-tokenized Verify integrity
quantumshield verify anisoleai/fineweb-tokenized pip install
pip install quantumshield && quantumshield pull anisoleai/fineweb-tokenized Unverified Model
This model has not been PQC-verified. File integrity cannot be guaranteed against quantum threats.
README.md
fineweb-tokenized
FineWeb Tokenized > 4 trillion tokens of the pre-tokenized data the š web has to offer What is it? This is a pre-tokenized version of the HuggingFaceFW/fineweb dataset (currently in-progress, tokenization of the ~15 trillion tokens corpus is ongoing). The data is being pre-processed and tokenized using the AnisoleAI BPE tokenizer (52,022 vocabulary size) and packed into compact uint16 Parquet shards. By distributing the pre-tokenized corpus, we eliminate⦠See the full description on the dataset page: https://huggingface.co/datasets/anisoleai/fineweb-tokenized.
Intended Uses
This model is registered on the QuantaMrkt quantum-safe registry. This model has not yet been PQC-verified.
Quick Start
# Install the CLI pip install quantumshield # Pull the model quantumshield pull anisoleai/fineweb-tokenized # Verify file integrity quantumshield verify anisoleai/fineweb-tokenized
About
FineWeb Tokenized > 4 trillion tokens of the pre-tokenized data the š web has to offer What is it? This is a pre-tokenized version of the HuggingFaceFW/fineweb dataset (currently in-progress, tokenization of the ~15 trillion tokens corpus is ongoing). The data is being pre-processed and tokenized using the AnisoleAI BPE tokenizer (52,022 vocabulary size) and packed into compact uint16 Parquet shards. By distributing the pre-tokenized corpus, we eliminate⦠See the full description on the dataset page: https://huggingface.co/datasets/anisoleai/fineweb-tokenized.
Get this model
Pull with QuantumShield
quantumshield pull anisoleai/fineweb-tokenized Verify signatures
quantumshield verify anisoleai/fineweb-tokenized