Ettin Pre-training Data Phase 1 of 3: Diverse pre-training data mixture (1.7T tokens) used to train the Ettin model suite. This dataset contains the pre-training phase data used to train all Ettin encoder and decoder models. The data is provided in MDS format ready for use with Composer and the ModernBERT training repository. 📊 Data Composition Data Source Tokens (B) Percentage Description DCLM 837.2 49.1% High-quality web crawl data CC Head 356.6… See the full description on the dataset page: https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data.
Use this model
Pull with QuantumShield
quantumshield pull jhu-clsp/ettin-pretraining-data Verify integrity
quantumshield verify jhu-clsp/ettin-pretraining-data pip install
pip install quantumshield && quantumshield pull jhu-clsp/ettin-pretraining-data Unverified Model
This model has not been PQC-verified. File integrity cannot be guaranteed against quantum threats.
README.md
ettin-pretraining-data
Ettin Pre-training Data Phase 1 of 3: Diverse pre-training data mixture (1.7T tokens) used to train the Ettin model suite. This dataset contains the pre-training phase data used to train all Ettin encoder and decoder models. The data is provided in MDS format ready for use with Composer and the ModernBERT training repository. 📊 Data Composition Data Source Tokens (B) Percentage Description DCLM 837.2 49.1% High-quality web crawl data CC Head 356.6… See the full description on the dataset page: https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data.
Intended Uses
This model is registered on the QuantaMrkt quantum-safe registry. This model has not yet been PQC-verified.
Quick Start
# Install the CLI pip install quantumshield # Pull the model quantumshield pull jhu-clsp/ettin-pretraining-data # Verify file integrity quantumshield verify jhu-clsp/ettin-pretraining-data
About
Ettin Pre-training Data Phase 1 of 3: Diverse pre-training data mixture (1.7T tokens) used to train the Ettin model suite. This dataset contains the pre-training phase data used to train all Ettin encoder and decoder models. The data is provided in MDS format ready for use with Composer and the ModernBERT training repository. 📊 Data Composition Data Source Tokens (B) Percentage Description DCLM 837.2 49.1% High-quality web crawl data CC Head 356.6… See the full description on the dataset page: https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data.
Get this model
Pull with QuantumShield
quantumshield pull jhu-clsp/ettin-pretraining-data Verify signatures
quantumshield verify jhu-clsp/ettin-pretraining-data