{"id":592,"slug":"anisoleai--fineweb-tokenized","name":"fineweb-tokenized","author":"anisoleai","description":"\n\t\n\t\t\n\t\n\t\n\t\tFineWeb Tokenized\n\t\n\n\n    \n\n> 4 trillion tokens of the pre-tokenized data the 🌐 web has to offer\n\n\n\t\n\t\t\n\t\n\t\n\t\tWhat is it?\n\t\n\nThis is a pre-tokenized version of the HuggingFaceFW/fineweb dataset (currently in-progress, tokenization of the ~15 trillion tokens corpus is ongoing). The data is being pre-processed and tokenized using the AnisoleAI BPE tokenizer (52,022 vocabulary size) and packed into compact uint16 Parquet shards.\nBy distributing the pre-tokenized corpus, we eliminate… See the full description on the dataset page: https://huggingface.co/datasets/anisoleai/fineweb-tokenized.","tags":"[\"Task_categories:text-Generation\",\"Language:en\",\"Size_categories:n>1T\",\"Modality:tabular\",\"Modality:text\",\"Tabular\"]","license":null,"framework":null,"parameters":null,"downloads":150625,"likes":2,"verified":0,"created_at":"2026-06-23 18:23:36","updated_at":"2026-06-29 14:23:35","source_url":"https://huggingface.co/datasets/anisoleai/fineweb-tokenized","source_platform":"huggingface","hf_repo_id":"anisoleai/fineweb-tokenized","ollama_name":"","category":"dataset","latest_version":"v1.0.0","version_count":1,"signature_count":1,"risk_level":null,"risk_score":null,"versions":[{"id":591,"model_id":592,"version":"v1.0.0","manifest_hash":"2c6c7f1c28f4c1f99dc4cd04366665b0dbb1e76c1ba9ba609ebcf810997b4c01","file_count":0,"total_size":0,"r2_manifest_key":"manifests/datasets/anisoleai--fineweb-tokenized/v1.0.0.json","created_at":"2026-06-23 18:23:36"}],"files":[],"signatures":[{"id":1125,"version_id":591,"signer_did":"did:quantamrkt:registry:shield-v1","algorithm":"ML-DSA-65","signature_hex":"b8b3fb4609ccd80008f4b8f715591a6d40d6ea4e8fa7c8aaab0c59c8047abf5b","attestation_type":"registry","signed_at":"2026-06-23 18:23:36"}],"hndl":null}