π MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens π MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. π MINT-1T is designed to facilitate research in multimodal pretraining. π MINT-1T is created by a team from the University of Washington inβ¦ See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-06.
Use this model
Pull with QuantumShield
quantumshield pull mlfoundations/MINT-1T-PDF-CC-2023-06 Verify integrity
quantumshield verify mlfoundations/MINT-1T-PDF-CC-2023-06 pip install
pip install quantumshield && quantumshield pull mlfoundations/MINT-1T-PDF-CC-2023-06 PQC-Verified with ML-DSA-87
This model has a real FIPS 204 ML-DSA-87 (Dilithium5) signature from the platform signing authority. Signature chain includes 2 verification(s). Last verified 2026-04-30.
README.md
MINT-1T-PDF-CC-2023-06
π MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens π MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. π MINT-1T is designed to facilitate research in multimodal pretraining. π MINT-1T is created by a team from the University of Washington inβ¦ See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-06.
Intended Uses
This model is registered on the QuantaMrkt quantum-safe registry. All files have been cryptographically verified using post-quantum signatures.
Quick Start
# Install the CLI pip install quantumshield # Pull the model quantumshield pull mlfoundations/MINT-1T-PDF-CC-2023-06 # Verify file integrity quantumshield verify mlfoundations/MINT-1T-PDF-CC-2023-06
About
π MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens π MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. π MINT-1T is designed to facilitate research in multimodal pretraining. π MINT-1T is created by a team from the University of Washington inβ¦ See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-06.
Get this model
Pull with QuantumShield
quantumshield pull mlfoundations/MINT-1T-PDF-CC-2023-06 Verify signatures
quantumshield verify mlfoundations/MINT-1T-PDF-CC-2023-06