Model Hub
Browse PQC-verified AI models, datasets, and tools
The CT-RATE Team organizes the VLM3D Challenge VLM3D 2026 (2nd Edition) → Challenge Finals at MICCAI 2026 VLM3D 2025 (1st Edition) → Challenge Finals at MICCAI 2025 • Workshop at ICCV 2025 The CT-RATE Team is developing the MR-RATE Dataset A large-scale brain MRI dataset with paired radiology reports for training 3D vision-language models. GitHub | Dataset | Metadata Dashboard Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography… See the full description on the dataset page: https://huggingface.co/datasets/ibrahimhamamci/CT-RATE.
OpenGitHub What is it? This dataset contains every public event on GitHub: every push, pull request, issue, star, fork, code review, release, and discussion across all public repositories. GitHub is the world's largest software development platform, home to over 200 million repositories and the daily work of tens of millions of developers, from individual open-source contributors to the engineering teams behind the most widely used software on earth. The archive currently… See the full description on the dataset page: https://huggingface.co/datasets/open-index/open-github.
Dataset Card for GPQA GPQA is a multiple-choice, Q&A dataset of very hard questions written and validated by experts in biology, physics, and chemistry. When attempting questions out of their own domain (e.g., a physicist answers a chemistry question), these experts get only 34% accuracy, despite spending >30m with full access to Google. We request that you do not reveal examples from this dataset in plain text or images online, to reduce the risk of leakage into foundation model… See the full description on the dataset page: https://huggingface.co/datasets/Idavidrein/gpqa.
Dataset Card for "ArtifactAI/arxiv_s2orc_parsed" Dataset Description https://huggingface.co/datasets/AlgorithmicResearchGroup/arxiv_s2orc_parsed Dataset Summary AlgorithmicResearchGroup/arxiv_s2orc_parsed is a subset of the AllenAI S2ORC dataset, a general-purpose corpus for NLP and text mining research over scientific papers, The dataset is filtered strictly for ArXiv papers, including the full text for each paper. Github links have been extracted from each… See the full description on the dataset page: https://huggingface.co/datasets/AlgorithmicResearchGroup/arxiv_s2orc_parsed.
Dataset Card for Dataset Name This dataset card aims to be a base template for new datasets. It has been generated using this raw template. Dataset Details Dataset Description Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed] Dataset Sources [optional] Repository: [More… See the full description on the dataset page: https://huggingface.co/datasets/bluuebunny/arxiv_metadata_by_year.
LLaVA-OneVision-1.5 Instruction Data Paper | Code 📌 Introduction This dataset, LLaVA-OneVision-1.5-Instruct, was collected and integrated during the development of LLaVA-OneVision-1.5. LLaVA-OneVision-1.5 is a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. This meticulously curated 22M instruction dataset (LLaVA-OneVision-1.5-Instruct) is part of a comprehensive and… See the full description on the dataset page: https://huggingface.co/datasets/mvp-lab/LLaVA-OneVision-1.5-Instruct-Data.
MR-RATE: A Vision-Language Foundation Model and Dataset for Magnetic Resonance Imaging Welcome to the official page for MR-RATE, a pioneering vision-language model and 3D medical imaging dataset that pairs textual reports with brain and spine MRI volumes. Following the approach of CT-RATE, the first 3D medical imaging dataset to pair images with textual reports, MR-RATE offers brain and spine MRI volumes matched with… See the full description on the dataset page: https://huggingface.co/datasets/Forithmus/MR-RATE.
ニコニコ実況 過去ログアーカイブ ニコニコ実況 過去ログアーカイブは、ニコニコ実況 のサービス開始から現在までのすべての過去ログコメントを収集したデータセットです。 去る2020年12月、ニコニコ実況は ニコニコ生放送内の一公式チャンネルとしてリニューアル されました。これに伴い、2009年11月から運用されてきた旧システムは提供終了となり(事実上のサービス終了)、torne や BRAVIA などの家電への対応が軒並み終了する中、当時の生の声が詰まった約11年分の過去ログも同時に失われることとなってしまいました。 そこで 5ch の DTV 板の住民が中心となり、旧ニコニコ実況が終了するまでに11年分の全チャンネルの過去ログをアーカイブする計画が立ち上がりました。紆余曲折あり Nekopanda 氏が約11年分のラジオや BS も含めた全チャンネルの過去ログを完璧に取得してくださったおかげで、11年分の過去ログが電子の海に消えていく事態は回避できました。しかし、旧 API が廃止されてしまったため過去ログを API… See the full description on the dataset page: https://huggingface.co/datasets/echodict/KakologArchives_duplicate.
MedThinkVQA MedThinkVQA is an expert-annotated benchmark for multi-image diagnostic reasoning in radiology. Unlike prior medical VQA benchmarks that typically contain at most one image per case, MedThinkVQA requires models to extract evidence from each image, integrate cross-view information, and perform differential-diagnosis reasoning. Links GitHub: https://github.com/benluwang/MedThinkVQA Leaderboard: https://benluwang.github.io/MedThinkVQA/ Submission Guide:… See the full description on the dataset page: https://huggingface.co/datasets/bio-nlp-umass/MedThinkVQA.
This is a partial copy of CoVoST2 dataset. The main difference is that the audio data is included in the dataset, which makes usage easier and allows browsing the samples using HF Dataset Viewer. The limitation of this method is that all audio samples of the EN_XX subsets are duplicated, as such the size of the dataset is larger. As such, not all the data is included: Only the validation and test subsets are available. From the XX_EN subsets, only fr, es, and zh-CN are included.
Introduction TL;DR: DreamDojo is a generalist robot world model pretrained on 44k hours of human egocentric data, showing unprecedented generalization to diverse objects and environments. Project page: https://dreamdojo-world.github.io/ Paper: https://arxiv.org/abs/2602.06949 Code: https://github.com/NVIDIA/DreamDojo How to Use Check out https://github.com/NVIDIA/DreamDojo Citation @article{gao2026dreamdojo, title={DreamDojo: A Generalist Robot… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/PhysicalAI-Robotics-GR00T-Teleop-GR1.
natgillin/translations-raw Frozen, canonical raw bitext consolidated from upstream alvations/mtdata-raw* snapshots (since deleted). This is the read-only source-of-truth for downstream quality-filtering pipelines. 31,663 parquet files (1566.8 GB) 49 language pairs under data/<src-tgt>/ Schema: 5 columns — see below Read-only for downstream pipelines. Do not delete or modify. Schema Each parquet has 5 columns: column type description source string… See the full description on the dataset page: https://huggingface.co/datasets/natgillin/translations-raw.
OpenMathInstruct-2 OpenMathInstruct-2 is a math instruction tuning dataset with 14M problem-solution pairs generated using the Llama3.1-405B-Instruct model. The training set problems of GSM8K and MATH are used for constructing the dataset in the following ways: Solution augmentation: Generating chain-of-thought solutions for training set problems in GSM8K and MATH. Problem-Solution augmentation: Generating new problems, followed by solutions for these new problems.… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/OpenMathInstruct-2.
Android in the Wild (AITW) This is a mirror of Google's Android in the Wild (AITW) dataset, re-hosted on Hugging Face for easier community access. Original Source Paper: Android in the Wild: A Large-Scale Dataset for Android Device Control Original Repository: google-research/google-research/tree/master/android_in_the_wild Dataset Description Android in the Wild (AITW) is a large-scale dataset for Android device control. It contains human demonstrations of… See the full description on the dataset page: https://huggingface.co/datasets/leosltl/Android-in-the-Wild.
Leopard-Instruct Paper | Github | Models-LLaVA | Models-Idefics2 Summaries Leopard-Instruct is a large instruction-tuning dataset, comprising 925K instances, with 739K specifically designed for text-rich, multiimage scenarios. It's been used to train Leopard-LLaVA [checkpoint] and Leopard-Idefics2 [checkpoint]. Loading dataset to load the dataset without automatically downloading and process the images (Please run the following codes with datasets==2.18.0)… See the full description on the dataset page: https://huggingface.co/datasets/wyu1/Leopard-Instruct.
⚠️ Important: If you have already submitted an access request but have not completed the required DocuSign agreement, your request will remain pending. Please complete signing and we will grant access once verified. Interactive Intelligence from Human Xperience Xperience-10M Dataset Summary Xperience-10M is a large-scale egocentric multimodal dataset of human experience for embodied AI, robotics, world models, and spatial… See the full description on the dataset page: https://huggingface.co/datasets/ropedia-ai/xperience-10m.