P

PleIAs / common_corpus

Unverified HuggingFace

Common Corpus Full paper - ICLR 2026 oral Common Corpus is the largest open and permissible licensed text dataset, comprising 2.27 trillion tokens (2,267,302,720,836 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners. Common Corpus differs from existing open datasets in that it is: Truly Open: contains only data that… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.

398 198,993 1

Unverified Model

This model has not been PQC-verified. File integrity cannot be guaranteed against quantum threats.

README.md

common_corpus

Common Corpus Full paper - ICLR 2026 oral Common Corpus is the largest open and permissible licensed text dataset, comprising 2.27 trillion tokens (2,267,302,720,836 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners. Common Corpus differs from existing open datasets in that it is: Truly Open: contains only data that… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.

Intended Uses

This model is registered on the QuantaMrkt quantum-safe registry. This model has not yet been PQC-verified.

Quick Start

# Install the CLI
pip install quantumshield

# Pull the model
quantumshield pull PleIAs/common_corpus

# Verify file integrity
quantumshield verify PleIAs/common_corpus

About

Common Corpus Full paper - ICLR 2026 oral Common Corpus is the largest open and permissible licensed text dataset, comprising 2.27 trillion tokens (2,267,302,720,836 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners. Common Corpus differs from existing open datasets in that it is: Truly Open: contains only data that… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.

Created 2026-04-30
Downloads 198,993
Likes 398

Get this model

View on HuggingFace

Pull with QuantumShield

quantumshield pull PleIAs/common_corpus

Verify signatures

quantumshield verify PleIAs/common_corpus

Signers

V1
did:quantamrkt:regis...hield-v1