Model Hub

Browse PQC-verified AI models, datasets, and tools

D
deepset/bert-large-uncased-whole-word-masking-squad2 HF Unverified

Question AnsweringTransformersPyTorchTfJAXSafetensors HIGH
N
nvidia/segformer-b1-finetuned-ade-512-512 HF Unverified

Image-SegmentationTransformersPyTorchTfSegformerVision MEDIUM
J
John6666/amanatsu-illustrious-v11-sdxl HF PQC Verified

Text-to-ImageDiffusersSafetensorsStable-DiffusionStable-Diffusion-XlAnime HIGH
F
facebook/detr-resnet-50 HF PQC Verified

Object-DetectionTransformersPyTorchSafetensorsDetrVision MEDIUM
T
typeform/distilbert-base-uncased-mnli HF Unverified

Zero-Shot ClassificationTransformersPyTorchTfSafetensorsDistilbert MEDIUM
WINGNUS/ACL-OCL HF Unverified

Dataset Card for ACL Anthology Corpus This repository provides full-text and metadata to the ACL anthology collection (80k articles/posters as of September 2022) also including .pdf files and grobid extractions of the pdfs. How is this different from what ACL anthology provides and what already exists? We provide pdfs, full-text, references and other details extracted by grobid from the PDFs while ACL Anthology only provides abstracts. There exists a similar corpus… See the full description on the dataset page: https://huggingface.co/datasets/WINGNUS/ACL-OCL.

Task_categories:token-ClassificationLanguage_creators:foundMultilinguality:monolingualSource_datasets:originalLanguage:enSize_categories:10K<n<100K
X
xingyang1/Distill-Any-Depth-Large-hf HF Unverified

Depth-EstimationTransformersSafetensorsDepth_anythingDistill-Any-DepthVision HIGH
J
John6666/obsession-illustriousxl-v10-sdxl HF PQC Verified

Text-to-ImageDiffusersSafetensorsStable-DiffusionStable-Diffusion-XlAnime HIGH
google-research-datasets/mbpp HF Unverified

Dataset Card for Mostly Basic Python Problems (mbpp) Dataset Summary The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. As described in the paper, a subset of the data has been hand-verified by us. Released here as part of… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/mbpp.

Annotations_creators:crowdsourcedAnnotations_creators:expert-GeneratedLanguage_creators:crowdsourcedLanguage_creators:expert-GeneratedMultilinguality:monolingualSource_datasets:original
T
timpal0l/mdeberta-v3-base-squad2 HF Unverified

Question AnsweringTransformersPyTorchSafetensorsDeberta-V2Deberta HIGH
C
cross-encoder/nli-deberta-v3-base HF Unverified

Zero-Shot ClassificationSentence-TransformersPyTorchONNXSafetensorsDeberta-V2 HIGH
C
cagliostrolab/animagine-xl-3.1 HF PQC Verified

Text-to-ImageDiffusersSafetensorsStable-DiffusionStable-Diffusion-XlBase_model:cagliostrolab/animagine-Xl-3.0 CRITICAL
stanfordnlp/imdb HF Unverified

Dataset Card for "imdb" Dataset Summary Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.

Task_categories:text-ClassificationTask_ids:sentiment-ClassificationAnnotations_creators:expert-GeneratedLanguage_creators:expert-GeneratedMultilinguality:monolingualSource_datasets:original
ropedia-ai/xperience-10m HF PQC Verified

⚠️ Important: If you have already submitted an access request but have not completed the required DocuSign agreement, your request will remain pending. Please complete signing and we will grant access once verified. Interactive Intelligence from Human Xperience Xperience-10M Dataset Summary Xperience-10M is a large-scale egocentric multimodal dataset of human experience for embodied AI, robotics, world models, and spatial… See the full description on the dataset page: https://huggingface.co/datasets/ropedia-ai/xperience-10m.

Task_categories:video-ClassificationTask_categories:image-To-TextTask_categories:depth-EstimationTask_categories:roboticsLanguage:enSize_categories:1M<n<10M
HuggingFaceH4/MATH-500 HF Unverified

Dataset Card for MATH-500 This dataset contains a subset of 500 problems from the MATH benchmark that OpenAI created in their Let's Verify Step by Step paper. See their GitHub repo for the source file: https://github.com/openai/prm800k/tree/main?tab=readme-ov-file#math-splits

Task_categories:text-GenerationLanguage:enSize_categories:n<1KFormat:jsonModality:textLibrary:datasets
nebius/SWE-rebench-V2-PRs HF Unverified

SWE-rebench-V2-PRs Dataset Summary SWE-rebench-V2-PRs is a large-scale dataset of real-world GitHub pull requests collected across multiple programming languages, intended for training and evaluating code-generation and software-engineering agents. The dataset contains 126,300 samples covering Go, Python, JavaScript, TypeScript, Rust, Java, C, C++, Julia, Elixir, Kotlin, PHP, Scala, Clojure, Dart, OCaml, and other languages. For log parser functions, base Dockerfiles, and… See the full description on the dataset page: https://huggingface.co/datasets/nebius/SWE-rebench-V2-PRs.

Task_categories:text-GenerationLanguage:enSize_categories:100K<n<1MFormat:parquetModality:textLibrary:datasets
anon8231489123/ShareGPT_Vicuna_unfiltered HF Unverified

Further cleaning done. Please look through the dataset and ensure that I didn't miss anything. Update: Confirmed working method for training the model: https://huggingface.co/AlekseyKorshuk/vicuna-7b/discussions/4#64346c08ef6d5abefe42c12c Two choices: Removes instances of "I'm sorry, but": https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json Has instances of "I'm sorry, but":… See the full description on the dataset page: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered.

Language:en
epfml/FineWeb-HQ HF Unverified

FineWeb-HQ Dataset Summary FineWeb-HQ is a high-quality, model-filtered pretraining dataset derived as a subset of FineWeb. FineWeb-HQ was created by selecting the top 10% of FineWeb documents based on a deep learning classifier trained to identify structured and knowledge-rich samples. This classifier uses XLM-RoBERTa embeddings to score documents. To validate our approach, we pretrained 1B-parameter LLM models with a Llama-like architecture across multiple languages and… See the full description on the dataset page: https://huggingface.co/datasets/epfml/FineWeb-HQ.

Task_categories:text-GenerationLanguage:enSize_categories:1B<n<10BFormat:parquetModality:tabularModality:text
mlfoundations/MINT-1T-PDF-CC-2024-18 HF PQC Verified

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2024-18.

Task_categories:image-To-TextTask_categories:text-GenerationLanguage:enSize_categories:100B<n<1TMultimodal
allenai/MADLAD-400 HF PQC Verified

MADLAD-400 Dataset and Introduction MADLAD-400 (Multilingual Audited Dataset: Low-resource And Document-level) is a document-level multilingual dataset based on Common Crawl, covering 419 languages in total. This uses all snapshots of CommonCrawl available as of August 1, 2022. The primary advantage of this dataset over similar datasets is that it is more multilingual (419 languages), it is audited and more highly filtered, and it is document-level. The main disadvantage… See the full description on the dataset page: https://huggingface.co/datasets/allenai/MADLAD-400.

Task_categories:text-GenerationSize_categories:n>1T
Showing 20 of 531 items (page 19 of 27)