Model Hub | QuantaMrkt

M

monologg/koelectra-small-v2-distilled-korquad-384 HF Unverified

Question AnsweringTransformersPyTorchTfliteSafetensorsElectra MEDIUM

162K 7

Updated 2026-05-08

mlfoundations/MINT-1T-PDF-CC-2023-23 HF PQC Verified

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-23.

Task_categories:image-To-TextTask_categories:text-GenerationLanguage:enSize_categories:1M<n<10MFormat:webdatasetModality:image

155K 10

Updated 2026-05-01 Source available

W

Wan-AI/Wan2.1-T2V-1.3B-Diffusers HF Unverified

Text-To-VideoDiffusersSafetensorsVideoVideo-GenerationDiffusers:WanPipeline HIGH

153K 124

Updated 2026-05-08

open-index/open-github HF Unverified

OpenGitHub What is it? This dataset contains every public event on GitHub: every push, pull request, issue, star, fork, code review, release, and discussion across all public repositories. GitHub is the world's largest software development platform, home to over 200 million repositories and the daily work of tens of millions of developers, from individual open-source contributors to the engineering teams behind the most widely used software on earth. The archive currently… See the full description on the dataset page: https://huggingface.co/datasets/open-index/open-github.

Task_categories:text-GenerationTask_categories:text-ClassificationTask_categories:feature-ExtractionLanguage:enLanguage:mulSize_categories:100K<n<1M

151K 8

Updated 2026-05-08 Source available

TIGER-Lab/MMLU-Pro HF Unverified

MMLU-Pro Dataset MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines. |Github | 🏆Leaderboard | 📖Paper | 🚀 What's New [2026.03.11] Added more cutting-edge frontier models to the leaderboard, including the Claude-4.6 series, Seed2.0 series, Qwen3.5 series, and Gemini-3.1-Pro, among… See the full description on the dataset page: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro.

Benchmark:officialTask_categories:question-AnsweringLanguage:enSize_categories:10K<n<100KFormat:parquetModality:tabular

148K 470

Updated 2026-05-08 Source available

nvidia/Nemotron-CC-v2 HF Unverified

Nemotron-Pre-Training-Dataset-v1 Release Data Overview This pretraining dataset, for generative AI model training, preserves high-value math and code while enriching it with diverse multilingual Q&A, fueling the next generation of intelligent, globally-capable models. This dataset supports NVIDIA Nemotron Nano 2, a family of large language models (LLMs) that consists of the NVIDIA-Nemotron-Nano-9B-v2, NVIDIA-Nemotron-Nano-9B-v2-Base, and NVIDIA-Nemotron-Nano-12B-v2-Base… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-CC-v2.

Task_categories:text-GenerationSize_categories:1B<n<10BFormat:parquetModality:textLibrary:datasetsLibrary:dask

147K 116

Updated 2026-05-02 Source available

zeroMN/hanlp_date-zh HF Unverified

-- 2nd International Chinese Word Segmentation Bakeoff - Data Release Release 1, 2005-11-18 Introduction This directory contains the training, test, and gold-standard data used in the 2nd International Chinese Word Segmentation Bakeoff. Also included is the script used to score the results submitted by the bakeoff participants and the simple segmenter used to generate the baseline and topline data. File List gold/ Contains the gold standard… See the full description on the dataset page: https://huggingface.co/datasets/zeroMN/hanlp_date-zh.

Task_categories:text-ClassificationLanguage:zhSize_categories:100M<n<1BCode

146K 1

Updated 2026-05-07 Source available

P

PekingU/rtdetr_r18vd_coco_o365 HF PQC Verified

Object-DetectionTransformersSafetensorsRt_detrVisionEnglish MEDIUM

145K 5

Updated 2026-04-26

D

deepset/tinyroberta-squad2 HF Unverified

Question AnsweringTransformersPyTorchSafetensorsRobertaModel-Index MEDIUM

145K 113

Updated 2026-05-08

rajpurkar/squad HF Unverified

Dataset Card for SQuAD Dataset Summary Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 1.1 contains 100,000+ question-answer pairs on 500+ articles. Supported Tasks and Leaderboards Question Answering.… See the full description on the dataset page: https://huggingface.co/datasets/rajpurkar/squad.

Task_categories:question-AnsweringTask_ids:extractive-QaAnnotations_creators:crowdsourcedLanguage_creators:crowdsourcedLanguage_creators:foundMultilinguality:monolingual

144K 363

Updated 2026-05-08 Source available

A

abhishtagatya/hubert-base-960h-itw-deepfake HF Unverified

Audio-ClassificationTransformersTensorboardSafetensorsHubertDeepfake MEDIUM

142K 1

Updated 2026-04-22

mvp-lab/LLaVA-OneVision-1.5-Mid-Training-85M HF Unverified

🚀 LLaVA-One-Vision-1.5-Mid-Training-85M Dataset is being uploaded 🚀 Upload Status All Completed: ImageNet-21k、LAIONCN、DataComp-1B、Zero250M、COYO700M、SA-1B、MINT、Obelics 📜 Cite If you find LLaVA-One-Vision-1.5-Mid-Training-85M useful in your research, please consider to cite the following related papers: @misc{an2025llavaonevision15fullyopenframework, title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training}… See the full description on the dataset page: https://huggingface.co/datasets/mvp-lab/LLaVA-OneVision-1.5-Mid-Training-85M.

Size_categories:10M<n<100MFormat:parquetModality:imageModality:textLibrary:datasetsLibrary:dask

140K 70

Updated 2026-05-08 Source available

mlfoundations/MINT-1T-HTML HF Unverified

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-HTML.

Task_categories:image-To-TextTask_categories:text-GenerationLanguage:enSize_categories:100M<n<1BFormat:parquetModality:text

139K 94

Updated 2026-05-08 Source available

mlfoundations/MINT-1T-PDF-CC-2023-14 HF PQC Verified

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-14.

Task_categories:image-To-TextTask_categories:text-GenerationLanguage:enSize_categories:1M<n<10MFormat:webdatasetModality:image

135K 6

Updated 2026-04-30 Source available

uoft-cs/cifar10 HF Unverified

Dataset Card for CIFAR-10 Dataset Summary The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain… See the full description on the dataset page: https://huggingface.co/datasets/uoft-cs/cifar10.

Task_categories:image-ClassificationAnnotations_creators:crowdsourcedLanguage_creators:foundMultilinguality:monolingualSource_datasets:extended|other-80-Million-Tiny-ImagesLanguage:en

135K 104

Updated 2026-05-08 Source available

locuslab/TOFU HF Unverified

TOFU: Task of Fictitious Unlearning 🍢 The TOFU dataset serves as a benchmark for evaluating unlearning performance of large language models on realistic tasks. The dataset comprises question-answer pairs based on autobiographies of 200 different authors that do not exist and are completely fictitiously generated by the GPT-4 model. The goal of the task is to unlearn a fine-tuned model on various fractions of the forget set. Quick Links Website: The landing page for TOFU… See the full description on the dataset page: https://huggingface.co/datasets/locuslab/TOFU.

Task_categories:question-AnsweringTask_ids:closed-Domain-QaAnnotations_creators:machine-GeneratedLanguage_creators:machine-GeneratedMultilinguality:monolingualSource_datasets:original

134K 51

Updated 2026-05-08 Source available

wyu1/Leopard-Instruct HF Unverified

Leopard-Instruct Paper | Github | Models-LLaVA | Models-Idefics2 Summaries Leopard-Instruct is a large instruction-tuning dataset, comprising 925K instances, with 739K specifically designed for text-rich, multiimage scenarios. It's been used to train Leopard-LLaVA [checkpoint] and Leopard-Idefics2 [checkpoint]. Loading dataset to load the dataset without automatically downloading and process the images (Please run the following codes with datasets==2.18.0)… See the full description on the dataset page: https://huggingface.co/datasets/wyu1/Leopard-Instruct.

Language:enSize_categories:1M<n<10MFormat:parquetModality:imageModality:textLibrary:datasets

134K 65

Updated 2026-05-08 Source available

D

deepset/xlm-roberta-large-squad2 HF Unverified

Question AnsweringTransformersPyTorchSafetensorsXlm-RobertaMultilingual HIGH

131K 57

Updated 2026-05-08

allenai/openbookqa HF Unverified

Dataset Card for OpenBookQA Dataset Summary OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in. In particular, it contains questions that require multi-step reasoning, use of additional common and commonsense knowledge, and rich text comprehension. OpenBookQA is a new kind of… See the full description on the dataset page: https://huggingface.co/datasets/allenai/openbookqa.

Task_categories:question-AnsweringTask_ids:open-Domain-QaAnnotations_creators:crowdsourcedAnnotations_creators:expert-GeneratedLanguage_creators:expert-GeneratedMultilinguality:monolingual

129K 129

Updated 2026-05-08 Source available

Idavidrein/gpqa HF Unverified

Dataset Card for GPQA GPQA is a multiple-choice, Q&A dataset of very hard questions written and validated by experts in biology, physics, and chemistry. When attempting questions out of their own domain (e.g., a physicist answers a chemistry question), these experts get only 34% accuracy, despite spending >30m with full access to Google. We request that you do not reveal examples from this dataset in plain text or images online, to reduce the risk of leakage into foundation model… See the full description on the dataset page: https://huggingface.co/datasets/Idavidrein/gpqa.

Benchmark:officialBenchmark:eval-YamlTask_categories:question-AnsweringTask_categories:text-GenerationLanguage:enSize_categories:1K<n<10K

128K 428

Updated 2026-05-08 Source available