README.md
| 1 | --- |
| 2 | language: en |
| 3 | license: mit |
| 4 | tags: |
| 5 | - vision |
| 6 | - video-classification |
| 7 | model-index: |
| 8 | - name: nielsr/xclip-base-patch32 |
| 9 | results: |
| 10 | - task: |
| 11 | type: video-classification |
| 12 | dataset: |
| 13 | name: Kinetics 400 |
| 14 | type: kinetics-400 |
| 15 | metrics: |
| 16 | - type: top-1 accuracy |
| 17 | value: 80.4 |
| 18 | - type: top-5 accuracy |
| 19 | value: 95.0 |
| 20 | --- |
| 21 | |
| 22 | # X-CLIP (base-sized model) |
| 23 | |
| 24 | X-CLIP model (base-sized, patch resolution of 32) trained fully-supervised on [Kinetics-400](https://www.deepmind.com/open-source/kinetics). It was introduced in the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Ni et al. and first released in [this repository](https://github.com/microsoft/VideoX/tree/master/X-CLIP). |
| 25 | |
| 26 | This model was trained using 8 frames per video, at a resolution of 224x224. |
| 27 | |
| 28 | Disclaimer: The team releasing X-CLIP did not write a model card for this model so this model card has been written by the Hugging Face team. |
| 29 | |
| 30 | ## Model description |
| 31 | |
| 32 | X-CLIP is a minimal extension of [CLIP](https://huggingface.co/docs/transformers/model_doc/clip) for general video-language understanding. The model is trained in a contrastive way on (video, text) pairs. |
| 33 | |
| 34 |  |
| 35 | |
| 36 | This allows the model to be used for tasks like zero-shot, few-shot or fully supervised video classification and video-text retrieval. |
| 37 | |
| 38 | ## Intended uses & limitations |
| 39 | |
| 40 | You can use the raw model for determining how well text goes with a given video. See the [model hub](https://huggingface.co/models?search=microsoft/xclip) to look for |
| 41 | fine-tuned versions on a task that interests you. |
| 42 | |
| 43 | ### How to use |
| 44 | |
| 45 | For code examples, we refer to the [documentation](https://huggingface.co/transformers/main/model_doc/xclip.html#). |
| 46 | |
| 47 | ## Training data |
| 48 | |
| 49 | This model was trained on [Kinetics-400](https://www.deepmind.com/open-source/kinetics). |
| 50 | |
| 51 | ### Preprocessing |
| 52 | |
| 53 | The exact details of preprocessing during training can be found [here](https://github.com/microsoft/VideoX/blob/40f6d177e0a057a50ac69ac1de6b5938fd268601/X-CLIP/datasets/build.py#L247). |
| 54 | |
| 55 | The exact details of preprocessing during validation can be found [here](https://github.com/microsoft/VideoX/blob/40f6d177e0a057a50ac69ac1de6b5938fd268601/X-CLIP/datasets/build.py#L285). |
| 56 | |
| 57 | During validation, one resizes the shorter edge of each frame, after which center cropping is performed to a fixed-size resolution (like 224x224). Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation. |
| 58 | |
| 59 | ## Evaluation results |
| 60 | |
| 61 | This model achieves a top-1 accuracy of 80.4% and a top-5 accuracy of 95.0%. |
| 62 | |