{"id":500,"slug":"pleias--common_corpus","name":"common_corpus","author":"PleIAs","description":"\n\t\n\t\t\n\t\tCommon Corpus\n\t\n\n\n  Full paper - ICLR 2026 oral\n\n\nCommon Corpus is the largest open and permissible licensed text dataset, comprising 2.27 trillion tokens (2,267,302,720,836 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners.\nCommon Corpus differs from existing open datasets in that it is:\n\nTruly Open: contains only data that… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.","tags":"[\"Language:en\",\"Language:fr\",\"Language:de\",\"Language:zh\",\"Language:it\",\"Language:es\"]","license":null,"framework":null,"parameters":null,"downloads":198993,"likes":398,"verified":0,"created_at":"2026-04-30 14:24:41","updated_at":"2026-05-08 14:17:37","source_url":"https://huggingface.co/datasets/PleIAs/common_corpus","source_platform":"huggingface","hf_repo_id":"PleIAs/common_corpus","ollama_name":"","category":"dataset","latest_version":"v1.0.0","version_count":1,"signature_count":1,"risk_level":null,"risk_score":null,"versions":[{"id":499,"model_id":500,"version":"v1.0.0","manifest_hash":"688555c3bf868ed347c85c5f5795589ed07ab014024b6dd8d920f62579a22d88","file_count":0,"total_size":0,"r2_manifest_key":"manifests/datasets/pleias--common_corpus/v1.0.0.json","created_at":"2026-04-30 14:24:41"}],"files":[],"signatures":[{"id":1024,"version_id":499,"signer_did":"did:quantamrkt:registry:shield-v1","algorithm":"ML-DSA-65","signature_hex":"f9fccf1753e92bb47350743b74b3062900df1ca253f132263d1d8f2816635845","attestation_type":"registry","signed_at":"2026-04-30 14:24:41"}],"hndl":null}