🥂 FineWeb2 A sparkling update with 1000s of languages What is it? This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages. The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments. In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-2.
Use this model
Pull with QuantumShield
quantumshield pull HuggingFaceFW/fineweb-2 Verify integrity
quantumshield verify HuggingFaceFW/fineweb-2 pip install
pip install quantumshield && quantumshield pull HuggingFaceFW/fineweb-2 Unverified Model
This model has not been PQC-verified. File integrity cannot be guaranteed against quantum threats.
README.md
fineweb-2
🥂 FineWeb2 A sparkling update with 1000s of languages What is it? This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages. The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments. In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-2.
Intended Uses
This model is registered on the QuantaMrkt quantum-safe registry. This model has not yet been PQC-verified.
Quick Start
# Install the CLI pip install quantumshield # Pull the model quantumshield pull HuggingFaceFW/fineweb-2 # Verify file integrity quantumshield verify HuggingFaceFW/fineweb-2
About
🥂 FineWeb2 A sparkling update with 1000s of languages What is it? This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages. The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments. In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-2.
Get this model
Pull with QuantumShield
quantumshield pull HuggingFaceFW/fineweb-2 Verify signatures
quantumshield verify HuggingFaceFW/fineweb-2