README.md
| 1 | --- |
| 2 | license: apache-2.0 |
| 3 | tags: |
| 4 | - object-detection |
| 5 | - vision |
| 6 | datasets: |
| 7 | - coco |
| 8 | widget: |
| 9 | - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg |
| 10 | example_title: Savanna |
| 11 | - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg |
| 12 | example_title: Football Match |
| 13 | - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg |
| 14 | example_title: Airport |
| 15 | --- |
| 16 | |
| 17 | # YOLOS (small-sized) model |
| 18 | |
| 19 | YOLOS model fine-tuned on COCO 2017 object detection (118k annotated images). It was introduced in the paper [You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection](https://arxiv.org/abs/2106.00666) by Fang et al. and first released in [this repository](https://github.com/hustvl/YOLOS). |
| 20 | |
| 21 | Disclaimer: The team releasing YOLOS did not write a model card for this model so this model card has been written by the Hugging Face team. |
| 22 | |
| 23 | ## Model description |
| 24 | |
| 25 | YOLOS is a Vision Transformer (ViT) trained using the DETR loss. Despite its simplicity, a base-sized YOLOS model is able to achieve 42 AP on COCO validation 2017 (similar to DETR and more complex frameworks such as Faster R-CNN). |
| 26 | |
| 27 | The model is trained using a "bipartite matching loss": one compares the predicted classes + bounding boxes of each of the N = 100 object queries to the ground truth annotations, padded up to the same length N (so if an image only contains 4 objects, 96 annotations will just have a "no object" as class and "no bounding box" as bounding box). The Hungarian matching algorithm is used to create an optimal one-to-one mapping between each of the N queries and each of the N annotations. Next, standard cross-entropy (for the classes) and a linear combination of the L1 and generalized IoU loss (for the bounding boxes) are used to optimize the parameters of the model. |
| 28 | |
| 29 | ## Intended uses & limitations |
| 30 | |
| 31 | You can use the raw model for object detection. See the [model hub](https://huggingface.co/models?search=hustvl/yolos) to look for all available YOLOS models. |
| 32 | |
| 33 | ### How to use |
| 34 | |
| 35 | Here is how to use this model: |
| 36 | |
| 37 | ```python |
| 38 | from transformers import YolosFeatureExtractor, YolosForObjectDetection |
| 39 | from PIL import Image |
| 40 | import requests |
| 41 | |
| 42 | url = 'http://images.cocodataset.org/val2017/000000039769.jpg' |
| 43 | image = Image.open(requests.get(url, stream=True).raw) |
| 44 | |
| 45 | feature_extractor = YolosFeatureExtractor.from_pretrained('hustvl/yolos-small') |
| 46 | model = YolosForObjectDetection.from_pretrained('hustvl/yolos-small') |
| 47 | |
| 48 | inputs = feature_extractor(images=image, return_tensors="pt") |
| 49 | outputs = model(**inputs) |
| 50 | |
| 51 | # model predicts bounding boxes and corresponding COCO classes |
| 52 | logits = outputs.logits |
| 53 | bboxes = outputs.pred_boxes |
| 54 | ``` |
| 55 | |
| 56 | Currently, both the feature extractor and model support PyTorch. |
| 57 | |
| 58 | ## Training data |
| 59 | |
| 60 | The YOLOS model was pre-trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet2012) and fine-tuned on [COCO 2017 object detection](https://cocodataset.org/#download), a dataset consisting of 118k/5k annotated images for training/validation respectively. |
| 61 | |
| 62 | ### Training |
| 63 | |
| 64 | The model was pre-trained for 200 epochs on ImageNet-1k and fine-tuned for 150 epochs on COCO. |
| 65 | |
| 66 | ## Evaluation results |
| 67 | |
| 68 | This model achieves an AP (average precision) of **36.1** on COCO 2017 validation. For more details regarding evaluation results, we refer to table 1 of the original paper. |
| 69 | |
| 70 | ### BibTeX entry and citation info |
| 71 | |
| 72 | ```bibtex |
| 73 | @article{DBLP:journals/corr/abs-2106-00666, |
| 74 | author = {Yuxin Fang and |
| 75 | Bencheng Liao and |
| 76 | Xinggang Wang and |
| 77 | Jiemin Fang and |
| 78 | Jiyang Qi and |
| 79 | Rui Wu and |
| 80 | Jianwei Niu and |
| 81 | Wenyu Liu}, |
| 82 | title = {You Only Look at One Sequence: Rethinking Transformer in Vision through |
| 83 | Object Detection}, |
| 84 | journal = {CoRR}, |
| 85 | volume = {abs/2106.00666}, |
| 86 | year = {2021}, |
| 87 | url = {https://arxiv.org/abs/2106.00666}, |
| 88 | eprinttype = {arXiv}, |
| 89 | eprint = {2106.00666}, |
| 90 | timestamp = {Fri, 29 Apr 2022 19:49:16 +0200}, |
| 91 | biburl = {https://dblp.org/rec/journals/corr/abs-2106-00666.bib}, |
| 92 | bibsource = {dblp computer science bibliography, https://dblp.org} |
| 93 | } |
| 94 | ``` |