anisoleai / fineweb-tokenized

Unverified HuggingFace

FineWeb Tokenized > 4 trillion tokens of the pre-tokenized data the 🌐 web has to offer What is it? This is a pre-tokenized version of the HuggingFaceFW/fineweb dataset (currently in-progress, tokenization of the ~15 trillion tokens corpus is ongoing). The data is being pre-processed and tokenized using the AnisoleAI BPE tokenizer (52,022 vocabulary size) and packed into compact uint16 Parquet shards. By distributing the pre-tokenized corpus, we eliminate… See the full description on the dataset page: https://huggingface.co/datasets/anisoleai/fineweb-tokenized.

Task_categories:text-Generation Language:en Size_categories:n>1T Modality:tabular Modality:text Tabular

2 150,625 1

Unverified Model

This model has not been PQC-verified. File integrity cannot be guaranteed against quantum threats.

README.md

fineweb-tokenized

Intended Uses

This model is registered on the QuantaMrkt quantum-safe registry. This model has not yet been PQC-verified.

Quick Start

# Install the CLI
pip install quantumshield

# Pull the model
quantumshield pull anisoleai/fineweb-tokenized

# Verify file integrity
quantumshield verify anisoleai/fineweb-tokenized

About

Created 2026-06-23

Downloads 150,625

Likes 2

Get this model

View on HuggingFace

Pull with QuantumShield

quantumshield pull anisoleai/fineweb-tokenized

Verify signatures

quantumshield verify anisoleai/fineweb-tokenized

Links

huggingface.co/anisoleai/fineweb-tokenized Transparency Log API Endpoint

Signers

did:quantamrkt:regis...hield-v1

anisoleai / fineweb-tokenized

README.md

fineweb-tokenized

Intended Uses

Quick Start

File Manifest

Signature Chain

HNDL Risk Assessment

Transparency Log

About

Get this model

Links

Signers

Tags

anisoleai / fineweb-tokenized

Use this model

README.md

fineweb-tokenized

Intended Uses

Quick Start

File Manifest

Signature Chain

HNDL Risk Assessment

Transparency Log

About

Get this model

Links

Signers

Tags