Model Hub
Browse PQC-verified AI models, datasets, and tools
SAGE-10k SAGE-10k is a large-scale interactive indoor scene dataset featuring realistic layouts, generated by the agentic-driven pipeline introduced in "SAGE: Scalable Agentic 3D Scene Generation for Embodied AI". The dataset contains 10,000 diverse scenes spanning 50 room types and styles, along with 565K uniquely generated 3D objects. 🔑 Key Features SAGE-10k integrates a wide variety of scenes, and particularly, preserves small items for… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/SAGE-10k.
Dataset Card for The Cauldron Dataset description The Cauldron is part of the Idefics2 release. It is a massive collection of 50 vision-language datasets (training sets only) that were used for the fine-tuning of the vision-language model Idefics2. Load the dataset To load the dataset, install the library datasets with pip install datasets. Then, from datasets import load_dataset ds = load_dataset("HuggingFaceM4/the_cauldron", "ai2d") to download and load the… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/the_cauldron.
[ICLR 2026] OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling 🎉NEWS [2026.3.21] 🔥 OmniWorld-Game with Metric Scale is now released! Check out our latest model Pi3X (an enhanced version of Pi3), which leverages this data to achieve better performance! [2026.1.26] 🎉 OmniWorld was accepted by ICLR 2026! [2026.1.7] Update OmniWorld-Game, release RH20T-Robot, RH20T-Human, Ego-Exo4D, EgoDex, Epic-Kitchens. [2025.11.11] The OmniWorld is… See the full description on the dataset page: https://huggingface.co/datasets/InternRobotics/OmniWorld.
Complete Wikipedia dump across all languages. Standard pretraining data source. Structured articles with metadata.
Dataset Card for GUI Odyssey News⭐️ A new and improved version of the GUIOdyssey dataset has been released! 🎉🎉 👉 Please use the latest version and refer to the updated README for the most up-to-date information. We highly recommend using the new version for all training and evaluation! Repository: https://github.com/OpenGVLab/GUI-Odyssey Latest Version of Dataset: hflqf88888/GUIOdyssey Paper: https://arxiv.org/pdf/2406.08451 Introduction GUI Odyssey is… See the full description on the dataset page: https://huggingface.co/datasets/OpenGVLab/GUI-Odyssey.
Common Corpus Full paper - ICLR 2026 oral Common Corpus is the largest open and permissible licensed text dataset, comprising 2.27 trillion tokens (2,267,302,720,836 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners. Common Corpus differs from existing open datasets in that it is: Truly Open: contains only data that… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.
Dataset Card for "super_glue" Dataset Summary SuperGLUE (https://super.gluebenchmark.com/) is a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances axb Size of downloaded dataset files: 0.03 MB Size of… See the full description on the dataset page: https://huggingface.co/datasets/aps/super_glue.