README.md
| 1 | --- |
| 2 | base_model: |
| 3 | - MCG-NJU/videomae-large |
| 4 | pipeline_tag: video-classification |
| 5 | library_name: transformers |
| 6 | --- |
| 7 | [VideoMAE model](https://huggingface.co/docs/transformers/model_doc/videomae)(`large`) variant that has been finetuned for multi-label video classification (a video can belong to multiple classes simultaneously) for camera motion classification on internal dataset. |
| 8 | |
| 9 | The model predicts `18` different camera motion `'arc_left', 'arc_right', 'dolly_in', 'dolly_out', 'pan_left', 'pan_right', 'pedestal_down', 'pedestal_up', 'roll_left', 'roll_right', 'static', 'tilt_down', 'tilt_up', 'truck_left', 'truck_right', 'undefined', 'zoom_in', 'zoom_out'` and and `3` shot type classes: `'pov', 'shake', 'track' ` . |
| 10 | |
| 11 | Model was trained to associate **entire video** with camera labels, not frame-level motions(!): [input video] -> label/labels (because multilabel) for all video. So, if this camera motion exists during all video frames model should predict this motion, otherwise it should predict `undefined`. |
| 12 | |
| 13 | The model is configured to process `config.num_frames=16` frames per input clip. These frames are extracted uniformly from the input video, regardless of its original duration. For videos longer than **2 seconds**, processing the entire video as a single clip may miss temporal nuances (e.g., varying camera motions). So, recommended workflow for such videos will be follows: |
| 14 | |
| 15 | (a) split the video into non-overlapping 2-second segments (or sliding windows with optional overlap). |
| 16 | |
| 17 | (b) run inference independently on each segment. |
| 18 | |
| 19 | (c) post-process results. |
| 20 | |
| 21 | |
| 22 | Model accurucy on internal test dataset of 2s videos is **75%**, ignoring `'pov', 'shake', 'track'` classes ~- **84%**. |
| 23 | |
| 24 | Inference example can be found [here](https://github.com/gen-ai-team/kandinsky-video-tools/blob/main/demo/camera_motion_classifier_demo.ipynb) |