README.md · rtdetr_r18vd_coco_o365

1

---

2

library_name: transformers

3

license: apache-2.0

4

language:

5

- en

6

pipeline_tag: object-detection

7

tags:

8

- object-detection

9

- vision

10

datasets:

11

- coco

12

widget:

13

- src: >-

14

https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg

15

example_title: Savanna

16

- src: >-

17

https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg

18

example_title: Football Match

19

- src: >-

20

https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg

21

example_title: Airport

22

---

23

24

25

# Model Card for RT-DETR

26

27

28

## Table of Contents

29

30

1. [Model Details](#model-details)

31

2. [Model Sources](#model-sources)

32

3. [How to Get Started with the Model](#how-to-get-started-with-the-model)

33

4. [Training Details](#training-details)

34

5. [Evaluation](#evaluation)

35

6. [Model Architecture and Objective](#model-architecture-and-objective)

36

7. [Citation](#citation)

37

38

39

## Model Details

40

41

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6579e0eaa9e58aec614e9d97/WULSDLsCVs7RNEs9KB0Lr.png)

42

43

> The YOLO series has become the most popular framework for real-time object detection due to its reasonable trade-off between speed and accuracy.

44

However, we observe that the speed and accuracy of YOLOs are negatively affected by the NMS.

45

Recently, end-to-end Transformer-based detectors (DETRs) have provided an alternative to eliminating NMS.

46

Nevertheless, the high computational cost limits their practicality and hinders them from fully exploiting the advantage of excluding NMS.

47

In this paper, we propose the Real-Time DEtection TRansformer (RT-DETR), the first real-time end-to-end object detector to our best knowledge that addresses the above dilemma.

48

We build RT-DETR in two steps, drawing on the advanced DETR:

49

first we focus on maintaining accuracy while improving speed, followed by maintaining speed while improving accuracy.

50

Specifically, we design an efficient hybrid encoder to expeditiously process multi-scale features by decoupling intra-scale interaction and cross-scale fusion to improve speed.

51

Then, we propose the uncertainty-minimal query selection to provide high-quality initial queries to the decoder, thereby improving accuracy.

52

In addition, RT-DETR supports flexible speed tuning by adjusting the number of decoder layers to adapt to various scenarios without retraining.

53

Our RT-DETR-R50 / R101 achieves 53.1% / 54.3% AP on COCO and 108 / 74 FPS on T4 GPU, outperforming previously advanced YOLOs in both speed and accuracy.

54

We also develop scaled RT-DETRs that outperform the lighter YOLO detectors (S and M models).

55

Furthermore, RT-DETR-R50 outperforms DINO-R50 by 2.2% AP in accuracy and about 21 times in FPS.

56

After pre-training with Objects365, RT-DETR-R50 / R101 achieves 55.3% / 56.2% AP. The project page: this [https URL](https://zhao-yian.github.io/RTDETR/).

57

58

59

60

This is the model card of a 🤗 [transformers](https://huggingface.co/docs/transformers/index) model that has been pushed on the Hub.

61

62

- **Developed by:** Yian Zhao and Sangbum Choi

63

- **Funded by:**  National Key R&D Program of China (No.2022ZD0118201), Natural Science Foundation of China (No.61972217, 32071459, 62176249, 62006133, 62271465),

64

and the Shenzhen Medical Research Funds in China (No.

65

B2302037).

66

- **Shared by:** Sangbum Choi

67

- **Model type:** [RT-DETR](https://huggingface.co/docs/transformers/main/en/model_doc/rt_detr)

68

- **License:** Apache-2.0

69

70

### Model Sources

71

72

73

74

- **HF Docs:** [RT-DETR](https://huggingface.co/docs/transformers/main/en/model_doc/rt_detr)

75

- **Repository:** https://github.com/lyuwenyu/RT-DETR

76

- **Paper:** https://arxiv.org/abs/2304.08069

77

- **Demo:** [RT-DETR Tracking](https://huggingface.co/spaces/merve/RT-DETR-tracking-coco)

78

79

## How to Get Started with the Model

80

81

Use the code below to get started with the model.

82

83

```python

84

import torch

85

import requests

86

87

from PIL import Image

88

from transformers import RTDetrForObjectDetection, RTDetrImageProcessor

89

90

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'

91

image = Image.open(requests.get(url, stream=True).raw)

92

93

image_processor = RTDetrImageProcessor.from_pretrained("PekingU/rtdetr_r18vd_coco_o365")

94

model = RTDetrForObjectDetection.from_pretrained("PekingU/rtdetr_r18vd_coco_o365")

95

96

inputs = image_processor(images=image, return_tensors="pt")

97

98

with torch.no_grad():

99

outputs = model(**inputs)

100

101

results = image_processor.post_process_object_detection(outputs, target_sizes=torch.tensor([image.size[::-1]]), threshold=0.3)

102

103

for result in results:

104

for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):

105

score, label = score.item(), label_id.item()

106

box = [round(i, 2) for i in box.tolist()]

107

print(f"{model.config.id2label[label]}: {score:.2f} {box}")

108

```

109

This should output

110

```

111

sofa: 0.97 [0.14, 0.38, 640.13, 476.21]

112

cat: 0.96 [343.38, 24.28, 640.14, 371.5]

113

cat: 0.96 [13.23, 54.18, 318.98, 472.22]

114

remote: 0.95 [40.11, 73.44, 175.96, 118.48]

115

remote: 0.92 [333.73, 76.58, 369.97, 186.99]

116

```

117

118

## Training Details

119

120

### Training Data

121

122

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

123

124

The RTDETR model was trained on [COCO 2017 object detection](https://cocodataset.org/#download), a dataset consisting of 118k/5k annotated images for training/validation respectively.

125

126

### Training Procedure

127

128

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

129

130

We conduct experiments on COCO and Objects365 datasets, where RT-DETR is trained on COCO train2017 and validated on COCO val2017 dataset.

131

We report the standard COCO metrics, including AP (averaged over uniformly sampled IoU thresholds ranging from 0.50-0.95 with a step size of 0.05),

132

AP50, AP75, as well as AP at different scales: APS, APM, APL.

133

134

### Preprocessing

135

136

Images are resized to 640x640 pixels and rescaled with `image_mean=[0.485, 0.456, 0.406]` and `image_std=[0.229, 0.224, 0.225]`.

137

138

### Training Hyperparameters

139

140

- **Training regime:** <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

141

142

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6579e0eaa9e58aec614e9d97/E15I9MwZCtwNIms-W8Ra9.png)

143

144

145

## Evaluation

146

147

148

| Model                      | #Epochs | #Params (M) | GFLOPs | FPS_bs=1 | AP (val) | AP50 (val) | AP75 (val) | AP-s (val) | AP-m (val) | AP-l (val) |

149

|----------------------------|---------|-------------|--------|----------|--------|-----------|-----------|----------|----------|----------|

150

| RT-DETR-R18   | 72      | 20          | 60.7   | 217      | 46.5   | 63.8      | 50.4      | 28.4     | 49.8     | 63.0     |

151

| RT-DETR-R34   | 72      | 31         | 91.0   | 172      | 48.5   | 66.2      | 52.3      | 30.2     | 51.9     | 66.2     |

152

| RT-DETR R50 | 72      | 42          | 136    | 108      | 53.1   | 71.3      | 57.7      | 34.8     | 58.0     | 70.0     |

153

| RT-DETR R101| 72      | 76          | 259    | 74       | 54.3   | 72.7      | 58.6      | 36.0     | 58.8     | 72.1     |

154

| RT-DETR-R18 (Objects 365 pretrained)   | 60      | 20          | 61     | 217      | 49.2  | 66.6      | 53.5      | 33.2     | 52.3     | 64.8     |

155

| RT-DETR-R50 (Objects 365 pretrained)   | 24      | 42          | 136    | 108      | 55.3  | 73.4      | 60.1      | 37.9     | 59.9     | 71.8     |

156

| RT-DETR-R101 (Objects 365 pretrained)  | 24      | 76          | 259    | 74       | 56.2  | 74.6      | 61.3      | 38.3     | 60.5     | 73.5     |

157

158

159

160

### Model Architecture and Objective

161

162

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6579e0eaa9e58aec614e9d97/sdIwTRlHNwPzyBNwHja60.png)

163

164

Overview of RT-DETR. We feed the features from the last three stages of the backbone into the encoder. The efficient hybrid

165

encoder transforms multi-scale features into a sequence of image features through the Attention-based Intra-scale Feature Interaction (AIFI)

166

and the CNN-based Cross-scale Feature Fusion (CCFF). Then, the uncertainty-minimal query selection selects a fixed number of encoder

167

features to serve as initial object queries for the decoder. Finally, the decoder with auxiliary prediction heads iteratively optimizes object

168

queries to generate categories and boxes.

169

170

171

## Citation

172

173

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

174

175

**BibTeX:**

176

177

```bibtex

178

@misc{lv2023detrs,

179

title={DETRs Beat YOLOs on Real-time Object Detection},

180

      author={Yian Zhao and Wenyu Lv and Shangliang Xu and Jinman Wei and Guanzhong Wang and Qingqing Dang and Yi Liu and Jie Chen},

181

year={2023},

182

eprint={2304.08069},

183

archivePrefix={arXiv},

184

primaryClass={cs.CV}

185

}

186

```

187

188

## Model Card Authors

189

190

[Sangbum Choi](https://huggingface.co/danelcsb)

191

[Pavel Iakubovskii](https://huggingface.co/qubvel-hf)

192

193