PleIAs / common_corpus

Unverified HuggingFace

Common Corpus Full paper - ICLR 2026 oral Common Corpus is the largest open and permissible licensed text dataset, comprising 2.27 trillion tokens (2,267,302,720,836 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners. Common Corpus differs from existing open datasets in that it is: Truly Open: contains only data that… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.

Language:en Language:fr Language:de Language:zh Language:it Language:es

398 198,993 1

Unverified Model

This model has not been PQC-verified. File integrity cannot be guaranteed against quantum threats.

README.md

common_corpus

Intended Uses

This model is registered on the QuantaMrkt quantum-safe registry. This model has not yet been PQC-verified.

Quick Start

# Install the CLI
pip install quantumshield

# Pull the model
quantumshield pull PleIAs/common_corpus

# Verify file integrity
quantumshield verify PleIAs/common_corpus

About

Created 2026-04-30

Downloads 198,993

Likes 398

Get this model

View on HuggingFace

Pull with QuantumShield

quantumshield pull PleIAs/common_corpus

Verify signatures

quantumshield verify PleIAs/common_corpus

Links

huggingface.co/PleIAs/common_corpus Transparency Log API Endpoint

Signers

did:quantamrkt:regis...hield-v1

PleIAs / common_corpus

README.md

common_corpus

Intended Uses

Quick Start

File Manifest

Signature Chain

HNDL Risk Assessment

Transparency Log

About

Get this model

Links

Signers

Tags

PleIAs / common_corpus

Use this model

README.md

common_corpus

Intended Uses

Quick Start

File Manifest

Signature Chain

HNDL Risk Assessment

Transparency Log

About

Get this model

Links

Signers

Tags