A

allenai / c4

PQC Verified HuggingFace

C4 Dataset Summary A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's C4 dataset We prepared five variants of the data: en, en.noclean, en.noblocklist, realnewslike, and multilingual (mC4). For reference, these are the sizes of the variants: en: 305GB en.noclean: 2.3TB en.noblocklist: 380GB realnewslike: 15GB multilingual (mC4): 9.7TB (108 subsets, one per… See the full description on the dataset page: https://huggingface.co/datasets/allenai/c4.

566 773,251 2

PQC-Verified with ML-DSA-87

This model has a real FIPS 204 ML-DSA-87 (Dilithium5) signature from the platform signing authority. Signature chain includes 2 verification(s). Last verified 2026-05-08.

ML-DSA-87 Signer: did:web:quantamrkt.com:chain:authority View public key

README.md

c4

C4 Dataset Summary A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's C4 dataset We prepared five variants of the data: en, en.noclean, en.noblocklist, realnewslike, and multilingual (mC4). For reference, these are the sizes of the variants: en: 305GB en.noclean: 2.3TB en.noblocklist: 380GB realnewslike: 15GB multilingual (mC4): 9.7TB (108 subsets, one per… See the full description on the dataset page: https://huggingface.co/datasets/allenai/c4.

Intended Uses

This model is registered on the QuantaMrkt quantum-safe registry. All files have been cryptographically verified using post-quantum signatures.

Quick Start

# Install the CLI
pip install quantumshield

# Pull the model
quantumshield pull allenai/c4

# Verify file integrity
quantumshield verify allenai/c4

About

C4 Dataset Summary A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's C4 dataset We prepared five variants of the data: en, en.noclean, en.noblocklist, realnewslike, and multilingual (mC4). For reference, these are the sizes of the variants: en: 305GB en.noclean: 2.3TB en.noblocklist: 380GB realnewslike: 15GB multilingual (mC4): 9.7TB (108 subsets, one per… See the full description on the dataset page: https://huggingface.co/datasets/allenai/c4.

Created 2026-04-20
Downloads 773,251
Likes 566

Get this model

View on HuggingFace

Pull with QuantumShield

quantumshield pull allenai/c4

Verify signatures

quantumshield verify allenai/c4

Signers

V1
did:quantamrkt:regis...hield-v1
TY
did:web:quantamrkt.c...uthority