README.md · xclip-base-patch32

1

---

2

language: en

3

license: mit

4

tags:

5

- vision

6

- video-classification

7

model-index:

8

- name: nielsr/xclip-base-patch32

9

results:

10

- task:

11

type: video-classification

12

dataset:

13

name: Kinetics 400

14

type: kinetics-400

15

metrics:

16

- type: top-1 accuracy

17

value: 80.4

18

- type: top-5 accuracy

19

value: 95.0

20

---

21

22

# X-CLIP (base-sized model)

23

24

X-CLIP model (base-sized, patch resolution of 32) trained fully-supervised on [Kinetics-400](https://www.deepmind.com/open-source/kinetics). It was introduced in the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Ni et al. and first released in [this repository](https://github.com/microsoft/VideoX/tree/master/X-CLIP).

25

26

This model was trained using 8 frames per video, at a resolution of 224x224.

27

28

Disclaimer: The team releasing X-CLIP did not write a model card for this model so this model card has been written by the Hugging Face team.

29

30

## Model description

31

32

X-CLIP is a minimal extension of [CLIP](https://huggingface.co/docs/transformers/model_doc/clip) for general video-language understanding. The model is trained in a contrastive way on (video, text) pairs.

33

34

![X-CLIP architecture](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/xclip_architecture.png)

35

36

This allows the model to be used for tasks like zero-shot, few-shot or fully supervised video classification and video-text retrieval.

37

38

## Intended uses & limitations

39

40

You can use the raw model for determining how well text goes with a given video. See the [model hub](https://huggingface.co/models?search=microsoft/xclip) to look for

41

fine-tuned versions on a task that interests you.

42

43

### How to use

44

45

For code examples, we refer to the [documentation](https://huggingface.co/transformers/main/model_doc/xclip.html#).

46

47

## Training data

48

49

This model was trained on [Kinetics-400](https://www.deepmind.com/open-source/kinetics).

50

51

### Preprocessing

52

53

The exact details of preprocessing during training can be found [here](https://github.com/microsoft/VideoX/blob/40f6d177e0a057a50ac69ac1de6b5938fd268601/X-CLIP/datasets/build.py#L247).

54

55

The exact details of preprocessing during validation can be found [here](https://github.com/microsoft/VideoX/blob/40f6d177e0a057a50ac69ac1de6b5938fd268601/X-CLIP/datasets/build.py#L285).

56

57

During validation, one resizes the shorter edge of each frame, after which center cropping is performed to a fixed-size resolution (like 224x224). Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.

58

59

## Evaluation results

60

61

This model achieves a top-1 accuracy of 80.4% and a top-5 accuracy of 95.0%.

62