README.md · vjepa2-vitg-fpc64-256

README.md

3.1 KB · 88 lines · markdown Raw

1	`---`
2	`license: apache-2.0`
3	`pipeline_tag: video-classification`
4	`tags:`
5	`- video`
6	`library_name: transformers`
7	`---`
8
9	`# V-JEPA 2`
10
11	`A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of [VJEPA](https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/), resulting in state-of-the-art video understanding capabilities, leveraging data and model sizes at scale.`
12	`The code is released [in this repository](https://github.com/facebookresearch/vjepa2).`
13
14	`<img src="https://github.com/user-attachments/assets/914942d8-6a1e-409d-86ff-ff856b7346ab"> `
15
16	`## Installation`
17
18	`To run V-JEPA 2 model, ensure you have installed the latest transformers:`
19
20	```bash
21	`pip install -U git+https://github.com/huggingface/transformers`
22	```
23
24	`## Intended Uses`
25
26	`V-JEPA 2 is intended to represent any video (and image) to perform video classification, retrieval, or as a video encoder for VLMs.`
27
28	```python
29	`from transformers import AutoVideoProcessor, AutoModel`
30
31	`hf_repo = "facebook/vjepa2-vitg-fpc64-256"`
32
33	`model = AutoModel.from_pretrained(hf_repo)`
34	`processor = AutoVideoProcessor.from_pretrained(hf_repo)`
35	```
36
37
38
39	`To load a video, sample the number of frames according to the model. For this model, we use 64.`
40
41	```python
42	`import torch`
43	`from torchcodec.decoders import VideoDecoder`
44	`import numpy as np`
45
46	`video_url = "https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/archery/-Qz25rXdMjE_000014_000024.mp4"`
47	`vr = VideoDecoder(video_url)`
48	`frame_idx = np.arange(0, 64) # choosing some frames. here, you can define more complex sampling strategy`
49	`video = vr.get_frames_at(indices=frame_idx).data # T x C x H x W`
50	`video = processor(video, return_tensors="pt").to(model.device)`
51	`with torch.no_grad():`
52	`video_embeddings = model.get_vision_features(**video)`
53
54	`print(video_embeddings.shape)`
55	```
56
57	`To load an image, simply copy the image to the desired number of frames.`
58
59	```python
60	`from transformers.image_utils import load_image`
61
62	`image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")`
63	`pixel_values = processor(image, return_tensors="pt").to(model.device)["pixel_values_videos"]`
64	`pixel_values = pixel_values.repeat(1, 16, 1, 1, 1) # repeating image 16 times`
65
66	`with torch.no_grad():`
67	`image_embeddings = model.get_vision_features(pixel_values)`
68
69	`print(image_embeddings.shape)`
70	```
71
72	`For more code examples, please refer to the V-JEPA 2 documentation.`
73
74
75	`### Citation`
76
77	```
78	`@techreport{assran2025vjepa2,`
79	`title={V-JEPA~2: Self-Supervised Video Models Enable Understanding, Prediction and Planning},`
80	`author={Assran, Mahmoud and Bardes, Adrien and Fan, David and Garrido, Quentin and Howes, Russell and`
81	`Komeili, Mojtaba and Muckley, Matthew and Rizvi, Ammar and Roberts, Claire and Sinha, Koustuv and Zholus, Artem and`
82	`Arnaud, Sergio and Gejji, Abha and Martin, Ada and Robert Hogan, Francois and Dugas, Daniel and`
83	`Bojanowski, Piotr and Khalidov, Vasil and Labatut, Patrick and Massa, Francisco and Szafraniec, Marc and`
84	`Krishnakumar, Kapil and Li, Yong and Ma, Xiaodong and Chandar, Sarath and Meier, Franziska and LeCun, Yann and`
85	`Rabbat, Michael and Ballas, Nicolas},`
86	`institution={FAIR at Meta},`
87	`year={2025}`
88	`}`