README.md · vivit-b-16x2

README.md

1.4 KB · 39 lines · markdown Raw

1	`---`
2	`license: "mit"`
3	`tags:`
4	`- vision`
5	`- video-classification`
6	`---`
7
8	`# ViViT (Video Vision Transformer)`
9
10	`ViViT model as introduced in the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Arnab et al. and first released in [this repository](https://github.com/google-research/scenic/tree/main/scenic/projects/vivit).`
11
12	`Disclaimer: The team releasing ViViT did not write a model card for this model so this model card has been written by the Hugging Face team.`
13
14	`## Model description`
15
16	`ViViT is an extension of the [Vision Transformer (ViT)](https://huggingface.co/docs/transformers/v4.27.0/model_doc/vit) to video.`
17
18	`We refer to the paper for details.`
19
20	`## Intended uses & limitations`
21
22	`The model is mostly meant to intended to be fine-tuned on a downstream task, like video classification. See the [model hub](https://huggingface.co/models?filter=vivit) to look for fine-tuned versions on a task that interests you.`
23
24	`### How to use`
25
26	`For code examples, we refer to the [documentation](https://huggingface.co/transformers/main/model_doc/vivit).`
27
28	`### BibTeX entry and citation info`
29
30	```bibtex
31	`@misc{arnab2021vivit,`
32	`title={ViViT: A Video Vision Transformer},`
33	`author={Anurag Arnab and Mostafa Dehghani and Georg Heigold and Chen Sun and Mario Lučić and Cordelia Schmid},`
34	`year={2021},`
35	`eprint={2103.15691},`
36	`archivePrefix={arXiv},`
37	`primaryClass={cs.CV}`
38	`}`
39	```