J

jhu-clsp / ettin-pretraining-data

Unverified HuggingFace

Ettin Pre-training Data Phase 1 of 3: Diverse pre-training data mixture (1.7T tokens) used to train the Ettin model suite. This dataset contains the pre-training phase data used to train all Ettin encoder and decoder models. The data is provided in MDS format ready for use with Composer and the ModernBERT training repository. 📊 Data Composition Data Source Tokens (B) Percentage Description DCLM 837.2 49.1% High-quality web crawl data CC Head 356.6… See the full description on the dataset page: https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data.

9 178,072 1

Unverified Model

This model has not been PQC-verified. File integrity cannot be guaranteed against quantum threats.

README.md

ettin-pretraining-data

Ettin Pre-training Data Phase 1 of 3: Diverse pre-training data mixture (1.7T tokens) used to train the Ettin model suite. This dataset contains the pre-training phase data used to train all Ettin encoder and decoder models. The data is provided in MDS format ready for use with Composer and the ModernBERT training repository. 📊 Data Composition Data Source Tokens (B) Percentage Description DCLM 837.2 49.1% High-quality web crawl data CC Head 356.6… See the full description on the dataset page: https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data.

Intended Uses

This model is registered on the QuantaMrkt quantum-safe registry. This model has not yet been PQC-verified.

Quick Start

# Install the CLI
pip install quantumshield

# Pull the model
quantumshield pull jhu-clsp/ettin-pretraining-data

# Verify file integrity
quantumshield verify jhu-clsp/ettin-pretraining-data

About

Ettin Pre-training Data Phase 1 of 3: Diverse pre-training data mixture (1.7T tokens) used to train the Ettin model suite. This dataset contains the pre-training phase data used to train all Ettin encoder and decoder models. The data is provided in MDS format ready for use with Composer and the ModernBERT training repository. 📊 Data Composition Data Source Tokens (B) Percentage Description DCLM 837.2 49.1% High-quality web crawl data CC Head 356.6… See the full description on the dataset page: https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data.

Created 2026-06-23
Downloads 178,072
Likes 9

Get this model

View on HuggingFace

Pull with QuantumShield

quantumshield pull jhu-clsp/ettin-pretraining-data

Verify signatures

quantumshield verify jhu-clsp/ettin-pretraining-data

Signers

V1
did:quantamrkt:regis...hield-v1