README.md
3.1 KB · 88 lines · markdown Raw
1 ---
2 license: apache-2.0
3 pipeline_tag: video-classification
4 tags:
5 - video
6 library_name: transformers
7 ---
8
9 # V-JEPA 2
10
11 A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of [VJEPA](https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/), resulting in state-of-the-art video understanding capabilities, leveraging data and model sizes at scale.
12 The code is released [in this repository](https://github.com/facebookresearch/vjepa2).
13
14 <img src="https://github.com/user-attachments/assets/914942d8-6a1e-409d-86ff-ff856b7346ab">&nbsp;
15
16 ## Installation
17
18 To run V-JEPA 2 model, ensure you have installed the latest transformers:
19
20 ```bash
21 pip install -U git+https://github.com/huggingface/transformers
22 ```
23
24 ## Intended Uses
25
26 V-JEPA 2 is intended to represent any video (and image) to perform video classification, retrieval, or as a video encoder for VLMs.
27
28 ```python
29 from transformers import AutoVideoProcessor, AutoModel
30
31 hf_repo = "facebook/vjepa2-vitg-fpc64-256"
32
33 model = AutoModel.from_pretrained(hf_repo)
34 processor = AutoVideoProcessor.from_pretrained(hf_repo)
35 ```
36
37
38
39 To load a video, sample the number of frames according to the model. For this model, we use 64.
40
41 ```python
42 import torch
43 from torchcodec.decoders import VideoDecoder
44 import numpy as np
45
46 video_url = "https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/archery/-Qz25rXdMjE_000014_000024.mp4"
47 vr = VideoDecoder(video_url)
48 frame_idx = np.arange(0, 64) # choosing some frames. here, you can define more complex sampling strategy
49 video = vr.get_frames_at(indices=frame_idx).data # T x C x H x W
50 video = processor(video, return_tensors="pt").to(model.device)
51 with torch.no_grad():
52 video_embeddings = model.get_vision_features(**video)
53
54 print(video_embeddings.shape)
55 ```
56
57 To load an image, simply copy the image to the desired number of frames.
58
59 ```python
60 from transformers.image_utils import load_image
61
62 image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")
63 pixel_values = processor(image, return_tensors="pt").to(model.device)["pixel_values_videos"]
64 pixel_values = pixel_values.repeat(1, 16, 1, 1, 1) # repeating image 16 times
65
66 with torch.no_grad():
67 image_embeddings = model.get_vision_features(pixel_values)
68
69 print(image_embeddings.shape)
70 ```
71
72 For more code examples, please refer to the V-JEPA 2 documentation.
73
74
75 ### Citation
76
77 ```
78 @techreport{assran2025vjepa2,
79 title={V-JEPA~2: Self-Supervised Video Models Enable Understanding, Prediction and Planning},
80 author={Assran, Mahmoud and Bardes, Adrien and Fan, David and Garrido, Quentin and Howes, Russell and
81 Komeili, Mojtaba and Muckley, Matthew and Rizvi, Ammar and Roberts, Claire and Sinha, Koustuv and Zholus, Artem and
82 Arnaud, Sergio and Gejji, Abha and Martin, Ada and Robert Hogan, Francois and Dugas, Daniel and
83 Bojanowski, Piotr and Khalidov, Vasil and Labatut, Patrick and Massa, Francisco and Szafraniec, Marc and
84 Krishnakumar, Kapil and Li, Yong and Ma, Xiaodong and Chandar, Sarath and Meier, Franziska and LeCun, Yann and
85 Rabbat, Michael and Ballas, Nicolas},
86 institution={FAIR at Meta},
87 year={2025}
88 }