README.md
1.8 KB · 24 lines · markdown Raw
1 ---
2 base_model:
3 - MCG-NJU/videomae-large
4 pipeline_tag: video-classification
5 library_name: transformers
6 ---
7 [VideoMAE model](https://huggingface.co/docs/transformers/model_doc/videomae)(`large`) variant that has been finetuned for multi-label video classification (a video can belong to multiple classes simultaneously) for camera motion classification on internal dataset.
8
9 The model predicts `18` different camera motion `'arc_left', 'arc_right', 'dolly_in', 'dolly_out', 'pan_left', 'pan_right', 'pedestal_down', 'pedestal_up', 'roll_left', 'roll_right', 'static', 'tilt_down', 'tilt_up', 'truck_left', 'truck_right', 'undefined', 'zoom_in', 'zoom_out'` and and `3` shot type classes: `'pov', 'shake', 'track' ` .
10
11 Model was trained to associate **entire video** with camera labels, not frame-level motions(!): [input video] -> label/labels (because multilabel) for all video. So, if this camera motion exists during all video frames model should predict this motion, otherwise it should predict `undefined`.
12
13 The model is configured to process `config.num_frames=16` frames per input clip. These frames are extracted uniformly from the input video, regardless of its original duration. For videos longer than **2 seconds**, processing the entire video as a single clip may miss temporal nuances (e.g., varying camera motions). So, recommended workflow for such videos will be follows:
14
15 (a) split the video into non-overlapping 2-second segments (or sliding windows with optional overlap).
16
17 (b) run inference independently on each segment.
18
19 (c) post-process results.
20
21
22 Model accurucy on internal test dataset of 2s videos is **75%**, ignoring `'pov', 'shake', 'track'` classes ~- **84%**.
23
24 Inference example can be found [here](https://github.com/gen-ai-team/kandinsky-video-tools/blob/main/demo/camera_motion_classifier_demo.ipynb)