README.md · vjepa2-vitl-fpc64-256

README.md

3.1 KB · 87 lines · markdown Raw

1	`---`
2	`license: mit`
3	`pipeline_tag: video-classification`
4	`tags:`
5	`- video`
6	`library_name: transformers`
7	`---`
8
9	`# V-JEPA 2`
10
11	`A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of [VJEPA](https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/), resulting in state-of-the-art video understanding capabilities, leveraging data and model sizes at scale.`
12	`The code is released [in this repository](https://github.com/facebookresearch/vjepa2).`
13
14	`<img src="https://github.com/user-attachments/assets/914942d8-6a1e-409d-86ff-ff856b7346ab"> `
15
16	`## Installation`
17
18	`To run V-JEPA 2 model, ensure you have installed the latest transformers:`
19
20	```bash
21	`pip install -U git+https://github.com/huggingface/transformers`
22	```
23
24	`## Intended Uses`
25
26	`V-JEPA 2 is intended to represent any video (and image) to perform video classification, retrieval, or as a video encoder for VLMs.`
27
28	```python
29	`from transformers import AutoVideoProcessor, AutoModel`
30
31	`hf_repo = "facebook/vjepa2-vitl-fpc64-256"`
32
33	`model = AutoModel.from_pretrained(hf_repo)`
34	`processor = AutoVideoProcessor.from_pretrained(hf_repo)`
35	```
36
37	`To load a video, sample the number of frames according to the model. For this model, we use 64.`
38
39	```python
40	`import torch`
41	`from torchcodec.decoders import VideoDecoder`
42	`import numpy as np`
43
44	`video_url = "https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/archery/-Qz25rXdMjE_000014_000024.mp4"`
45	`vr = VideoDecoder(video_url)`
46	`frame_idx = np.arange(0, 64) # choosing some frames. here, you can define more complex sampling strategy`
47	`video = vr.get_frames_at(indices=frame_idx).data # T x C x H x W`
48	`video = processor(video, return_tensors="pt").to(model.device)`
49	`with torch.no_grad():`
50	`video_embeddings = model.get_vision_features(**video)`
51
52	`print(video_embeddings.shape)`
53	```
54
55	`To load an image, simply copy the image to the desired number of frames.`
56
57	```python
58	`from transformers.image_utils import load_image`
59
60	`image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")`
61	`pixel_values = processor(image, return_tensors="pt").to(model.device)["pixel_values_videos"]`
62	`pixel_values = pixel_values.repeat(1, 16, 1, 1, 1) # repeating image 16 times`
63
64	`with torch.no_grad():`
65	`image_embeddings = model.get_vision_features(pixel_values)`
66
67	`print(image_embeddings.shape)`
68	```
69
70	`For more code examples, please refer to the V-JEPA 2 documentation.`
71
72
73	`### Citation`
74
75	```
76	`@techreport{assran2025vjepa2,`
77	`title={V-JEPA~2: Self-Supervised Video Models Enable Understanding, Prediction and Planning},`
78	`author={Assran, Mahmoud and Bardes, Adrien and Fan, David and Garrido, Quentin and Howes, Russell and`
79	`Komeili, Mojtaba and Muckley, Matthew and Rizvi, Ammar and Roberts, Claire and Sinha, Koustuv and Zholus, Artem and`
80	`Arnaud, Sergio and Gejji, Abha and Martin, Ada and Robert Hogan, Francois and Dugas, Daniel and`
81	`Bojanowski, Piotr and Khalidov, Vasil and Labatut, Patrick and Massa, Francisco and Szafraniec, Marc and`
82	`Krishnakumar, Kapil and Li, Yong and Ma, Xiaodong and Chandar, Sarath and Meier, Franziska and LeCun, Yann and`
83	`Rabbat, Michael and Ballas, Nicolas},`
84	`institution={FAIR at Meta},`
85	`year={2025}`
86	`}`
87	```