{"id":603,"slug":"jhu-clsp--ettin-pretraining-data","name":"ettin-pretraining-data","author":"jhu-clsp","description":"\n\t\n\t\t\n\t\tEttin Pre-training Data\n\t\n\n\n\n\n\n\nPhase 1 of 3: Diverse pre-training data mixture (1.7T tokens) used to train the Ettin model suite.\n\nThis dataset contains the pre-training phase data used to train all Ettin encoder and decoder models. The data is provided in MDS format ready for use with Composer and the ModernBERT training repository.\n\n\t\n\t\t\n\t\t📊 Data Composition\n\t\n\n\n\t\n\t\t\nData Source\nTokens (B)\nPercentage\nDescription\n\n\n\t\t\nDCLM\n837.2\n49.1%\nHigh-quality web crawl data\n\n\nCC Head\n356.6… See the full description on the dataset page: https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data.","tags":"[\"Task_categories:text-Generation\",\"Task_categories:fill-Mask\",\"Task_categories:text-Classification\",\"Language:en\",\"Pretraining\",\"Language-Modeling\"]","license":null,"framework":null,"parameters":null,"downloads":178072,"likes":9,"verified":0,"created_at":"2026-06-23 19:23:32","updated_at":"2026-06-29 15:23:29","source_url":"https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data","source_platform":"huggingface","hf_repo_id":"jhu-clsp/ettin-pretraining-data","ollama_name":"","category":"dataset","latest_version":"v1.0.0","version_count":1,"signature_count":1,"risk_level":null,"risk_score":null,"versions":[{"id":602,"model_id":603,"version":"v1.0.0","manifest_hash":"9c60518c61989debbac1e05cf496633989eae0742840275ba1b0d29178907ee8","file_count":0,"total_size":0,"r2_manifest_key":"manifests/datasets/jhu-clsp--ettin-pretraining-data/v1.0.0.json","created_at":"2026-06-23 19:23:32"}],"files":[],"signatures":[{"id":1150,"version_id":602,"signer_did":"did:quantamrkt:registry:shield-v1","algorithm":"ML-DSA-65","signature_hex":"b916bec84f49df57b75a5071169aadaa561b78fc6eef7b1e8fe544d9470089e3","attestation_type":"registry","signed_at":"2026-06-23 19:23:32"}],"hndl":null}