README.md · kandinsky-videomae-large-camera-motion

README.md

1.8 KB · 24 lines · markdown Raw

1	`---`
2	`base_model:`
3	`- MCG-NJU/videomae-large`
4	`pipeline_tag: video-classification`
5	`library_name: transformers`
6	`---`
7	[VideoMAE model](https://huggingface.co/docs/transformers/model_doc/videomae)(`large`) variant that has been finetuned for multi-label video classification (a video can belong to multiple classes simultaneously) for camera motion classification on internal dataset.
8
9	The model predicts `18` different camera motion `'arc_left', 'arc_right', 'dolly_in', 'dolly_out', 'pan_left', 'pan_right', 'pedestal_down', 'pedestal_up', 'roll_left', 'roll_right', 'static', 'tilt_down', 'tilt_up', 'truck_left', 'truck_right', 'undefined', 'zoom_in', 'zoom_out'` and and `3` shot type classes: `'pov', 'shake', 'track' ` .
10
11	Model was trained to associate entire video with camera labels, not frame-level motions(!): [input video] -> label/labels (because multilabel) for all video. So, if this camera motion exists during all video frames model should predict this motion, otherwise it should predict `undefined`.
12
13	The model is configured to process `config.num_frames=16` frames per input clip. These frames are extracted uniformly from the input video, regardless of its original duration. For videos longer than 2 seconds, processing the entire video as a single clip may miss temporal nuances (e.g., varying camera motions). So, recommended workflow for such videos will be follows:
14
15	`(a) split the video into non-overlapping 2-second segments (or sliding windows with optional overlap).`
16
17	`(b) run inference independently on each segment.`
18
19	`(c) post-process results.`
20
21
22	Model accurucy on internal test dataset of 2s videos is 75%, ignoring `'pov', 'shake', 'track'` classes ~- 84%.
23
24	`Inference example can be found [here](https://github.com/gen-ai-team/kandinsky-video-tools/blob/main/demo/camera_motion_classifier_demo.ipynb)`