Browse PQC-verified AI models, datasets, and tools
Ettin Pre-training Data Phase 1 of 3: Diverse pre-training data mixture (1.7T tokens) used to train the Ettin model suite. This dataset contains the pre-training phase data used to train all Ettin encoder and decoder models. The data is provided in MDS format ready for use with Composer and the ModernBERT training repository. 📊 Data Composition Data Source Tokens (B) Percentage Description DCLM 837.2 49.1% High-quality web crawl data CC Head 356.6… See the full description on the dataset page: https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data.