A

allenai / MADLAD-400

PQC Verified HuggingFace

MADLAD-400 Dataset and Introduction MADLAD-400 (Multilingual Audited Dataset: Low-resource And Document-level) is a document-level multilingual dataset based on Common Crawl, covering 419 languages in total. This uses all snapshots of CommonCrawl available as of August 1, 2022. The primary advantage of this dataset over similar datasets is that it is more multilingual (419 languages), it is audited and more highly filtered, and it is document-level. The main disadvantage… See the full description on the dataset page: https://huggingface.co/datasets/allenai/MADLAD-400.

168 165,545 2

PQC-Verified with ML-DSA-87

This model has a real FIPS 204 ML-DSA-87 (Dilithium5) signature from the platform signing authority. Signature chain includes 2 verification(s). Last verified 2026-04-23.

ML-DSA-87 Signer: did:web:quantamrkt.com:chain:authority View public key

README.md

MADLAD-400

MADLAD-400 Dataset and Introduction MADLAD-400 (Multilingual Audited Dataset: Low-resource And Document-level) is a document-level multilingual dataset based on Common Crawl, covering 419 languages in total. This uses all snapshots of CommonCrawl available as of August 1, 2022. The primary advantage of this dataset over similar datasets is that it is more multilingual (419 languages), it is audited and more highly filtered, and it is document-level. The main disadvantage… See the full description on the dataset page: https://huggingface.co/datasets/allenai/MADLAD-400.

Intended Uses

This model is registered on the QuantaMrkt quantum-safe registry. All files have been cryptographically verified using post-quantum signatures.

Quick Start

# Install the CLI
pip install quantumshield

# Pull the model
quantumshield pull allenai/MADLAD-400

# Verify file integrity
quantumshield verify allenai/MADLAD-400

About

MADLAD-400 Dataset and Introduction MADLAD-400 (Multilingual Audited Dataset: Low-resource And Document-level) is a document-level multilingual dataset based on Common Crawl, covering 419 languages in total. This uses all snapshots of CommonCrawl available as of August 1, 2022. The primary advantage of this dataset over similar datasets is that it is more multilingual (419 languages), it is audited and more highly filtered, and it is document-level. The main disadvantage… See the full description on the dataset page: https://huggingface.co/datasets/allenai/MADLAD-400.

Created 2026-04-20
Downloads 165,545
Likes 168

Get this model

View on HuggingFace

Pull with QuantumShield

quantumshield pull allenai/MADLAD-400

Verify signatures

quantumshield verify allenai/MADLAD-400

Signers

V1
did:quantamrkt:regis...hield-v1
TY
did:web:quantamrkt.c...uthority