README.md
| 1 | --- |
| 2 | license: mit |
| 3 | base_model: MCG-NJU/videomae-base |
| 4 | tags: |
| 5 | - video-classification |
| 6 | - crime-detection |
| 7 | - violence-detection |
| 8 | - videomae |
| 9 | - computer-vision |
| 10 | - security |
| 11 | - surveillance |
| 12 | - generated_from_trainer |
| 13 | language: |
| 14 | - en |
| 15 | datasets: |
| 16 | - jinmang2/ucf_crime |
| 17 | metrics: |
| 18 | - accuracy |
| 19 | - precision |
| 20 | - recall |
| 21 | - f1 |
| 22 | pipeline_tag: video-classification |
| 23 | model-index: |
| 24 | - name: videomae-crime-detector-maxdata-v1 |
| 25 | results: |
| 26 | - task: |
| 27 | name: Violence Detection |
| 28 | type: video-classification |
| 29 | dataset: |
| 30 | name: UCF Crime Dataset (Subset) |
| 31 | type: jinmang2/ucf_crime |
| 32 | args: violence_detection |
| 33 | metrics: |
| 34 | - name: Accuracy |
| 35 | type: accuracy |
| 36 | value: 0.7292 |
| 37 | - name: Precision |
| 38 | type: precision |
| 39 | value: 0.7289 |
| 40 | - name: Recall |
| 41 | type: recall |
| 42 | value: 0.7292 |
| 43 | - name: F1 |
| 44 | type: f1 |
| 45 | value: 0.7287 |
| 46 | --- |
| 47 | |
| 48 | # Nikeytas/Videomae Crime Detector Maxdata V1 |
| 49 | |
| 50 | This model is a fine-tuned version of [MCG-NJU/videomae-base](https://huggingface.co/MCG-NJU/videomae-base) on the UCF Crime dataset with **event-based binary classification**. It achieves the following results on the evaluation set: |
| 51 | |
| 52 | - **Loss**: 0.8405 |
| 53 | - **Accuracy**: 0.7292 |
| 54 | - **Precision**: 0.7289 |
| 55 | - **Recall**: 0.7292 |
| 56 | - **F1 Score**: 0.7287 |
| 57 | |
| 58 | ## 🎯 Model Overview |
| 59 | |
| 60 | This VideoMAE model has been fine-tuned for **binary violence detection** in video content. The model classifies videos into two categories: |
| 61 | - **Violent Crime** (1): Videos containing violent criminal activities |
| 62 | - **Non-Violent Incident** (0): Videos with non-violent or normal activities |
| 63 | |
| 64 | The model is based on the **VideoMAE architecture** and has been specifically trained on a curated subset of the UCF Crime dataset with event-based categorization for realistic crime detection scenarios. |
| 65 | |
| 66 | ## 📊 Dataset & Training |
| 67 | |
| 68 | ### Dataset Composition |
| 69 | |
| 70 | **Total Videos**: 600 |
| 71 | - **Violent Crime Videos**: 300 |
| 72 | - **Non-Violent Incident Videos**: 300 |
| 73 | |
| 74 | **Class Balance**: 50.0% violent crimes |
| 75 | |
| 76 | **Event Distribution**: |
| 77 | - **Abuse**: 28 videos |
| 78 | - **Arrest**: 18 videos |
| 79 | - **Arson**: 16 videos |
| 80 | - **Assault**: 62 videos |
| 81 | - **Burglary**: 120 videos |
| 82 | - **Explosion**: 54 videos |
| 83 | - **Fighting**: 48 videos |
| 84 | - **RoadAccidents**: 58 videos |
| 85 | - **Robbery**: 184 videos |
| 86 | - **Shoplifting**: 36 videos |
| 87 | - **Stealing**: 46 videos |
| 88 | - **Vandalism**: 72 videos |
| 89 | |
| 90 | **Data Splits**: |
| 91 | - **Training**: 384 videos |
| 92 | - **Validation**: 96 videos |
| 93 | - **Test**: 120 videos |
| 94 | |
| 95 | ## 🎯 Performance |
| 96 | |
| 97 | ### Performance Metrics |
| 98 | |
| 99 | **Validation Performance**: |
| 100 | - **eval_loss**: 0.8405 |
| 101 | - **eval_accuracy**: 0.7292 |
| 102 | - **eval_precision**: 0.7289 |
| 103 | - **eval_recall**: 0.7292 |
| 104 | - **eval_f1**: 0.7287 |
| 105 | - **eval_runtime**: 11.5149 |
| 106 | - **eval_samples_per_second**: 8.3370 |
| 107 | - **eval_steps_per_second**: 4.1690 |
| 108 | - **epoch**: 8.0000 |
| 109 | |
| 110 | **Test Performance**: |
| 111 | - **eval_loss**: 0.8573 |
| 112 | - **eval_accuracy**: 0.6750 |
| 113 | - **eval_precision**: 0.6749 |
| 114 | - **eval_recall**: 0.6750 |
| 115 | - **eval_f1**: 0.6749 |
| 116 | - **eval_runtime**: 13.8665 |
| 117 | - **eval_samples_per_second**: 8.6540 |
| 118 | - **eval_steps_per_second**: 4.3270 |
| 119 | - **epoch**: 8.0000 |
| 120 | |
| 121 | **Training Information**: |
| 122 | - **Training Time**: 33.7 minutes |
| 123 | - **Best Accuracy Achieved**: 0.7292 |
| 124 | - **Model Architecture**: VideoMAE Base (fine-tuned) |
| 125 | - **Fine-tuning Approach**: Event-based binary classification |
| 126 | |
| 127 | ## 🚀 Training Procedure |
| 128 | |
| 129 | ### Training Hyperparameters |
| 130 | |
| 131 | The following hyperparameters were used during training: |
| 132 | - **Learning Rate**: 5e-05 |
| 133 | - **Train Batch Size**: 2 |
| 134 | - **Eval Batch Size**: 2 |
| 135 | - **Optimizer**: AdamW with betas=(0.9,0.999) and epsilon=1e-08 |
| 136 | - **LR Scheduler Type**: Linear |
| 137 | - **Training Epochs**: 8 |
| 138 | - **Weight Decay**: 0.01 |
| 139 | |
| 140 | ### Training Results |
| 141 | |
| 142 | | Training Loss | Epoch | Step | Validation Loss | Accuracy | |
| 143 | |---------------|-------|------|-----------------|----------| |
| 144 | | 0.7291666666666666 | 8.00 | N/A | 0.8405 | 0.7292 | |
| 145 | |
| 146 | ### Framework Versions |
| 147 | |
| 148 | - **Transformers**: 4.30.2+ |
| 149 | - **PyTorch**: 2.0.1+ |
| 150 | - **Datasets**: Latest |
| 151 | - **Device**: Apple Silicon MPS / CUDA / CPU (Auto-detected) |
| 152 | |
| 153 | ## 🚀 Quick Start |
| 154 | |
| 155 | ### Installation |
| 156 | |
| 157 | ```bash |
| 158 | pip install transformers torch torchvision opencv-python pillow |
| 159 | ``` |
| 160 | |
| 161 | ### Basic Usage |
| 162 | |
| 163 | ```python |
| 164 | import torch |
| 165 | from transformers import AutoModelForVideoClassification, AutoProcessor |
| 166 | import cv2 |
| 167 | import numpy as np |
| 168 | |
| 169 | # Load model and processor |
| 170 | model = AutoModelForVideoClassification.from_pretrained("Nikeytas/videomae-crime-detector-maxdata-v1") |
| 171 | processor = AutoProcessor.from_pretrained("Nikeytas/videomae-crime-detector-maxdata-v1") |
| 172 | |
| 173 | # Process video |
| 174 | def classify_video(video_path, num_frames=16): |
| 175 | # Extract frames |
| 176 | cap = cv2.VideoCapture(video_path) |
| 177 | frames = [] |
| 178 | |
| 179 | total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) |
| 180 | indices = np.linspace(0, total_frames - 1, num_frames, dtype=int) |
| 181 | |
| 182 | for idx in indices: |
| 183 | cap.set(cv2.CAP_PROP_POS_FRAMES, idx) |
| 184 | ret, frame = cap.read() |
| 185 | if ret: |
| 186 | frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) |
| 187 | frames.append(frame_rgb) |
| 188 | |
| 189 | cap.release() |
| 190 | |
| 191 | # Process with model |
| 192 | inputs = processor(frames, return_tensors="pt") |
| 193 | |
| 194 | with torch.no_grad(): |
| 195 | outputs = model(**inputs) |
| 196 | predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
| 197 | predicted_class = torch.argmax(predictions, dim=-1).item() |
| 198 | confidence = predictions[0][predicted_class].item() |
| 199 | |
| 200 | label = "Violent Crime" if predicted_class == 1 else "Non-Violent" |
| 201 | return label, confidence |
| 202 | |
| 203 | # Example usage |
| 204 | video_path = "path/to/your/video.mp4" |
| 205 | prediction, confidence = classify_video(video_path) |
| 206 | print(f"Prediction: {prediction} (Confidence: {confidence:.3f})") |
| 207 | ``` |
| 208 | |
| 209 | ### Batch Processing |
| 210 | |
| 211 | ```python |
| 212 | import os |
| 213 | from pathlib import Path |
| 214 | |
| 215 | def process_video_directory(video_dir, output_file="results.txt"): |
| 216 | results = [] |
| 217 | |
| 218 | for video_file in Path(video_dir).glob("*.mp4"): |
| 219 | try: |
| 220 | prediction, confidence = classify_video(str(video_file)) |
| 221 | results.append({ |
| 222 | "file": video_file.name, |
| 223 | "prediction": prediction, |
| 224 | "confidence": confidence |
| 225 | }) |
| 226 | print(f"✅ {video_file.name}: {prediction} ({confidence:.3f})") |
| 227 | except Exception as e: |
| 228 | print(f"❌ Error processing {video_file.name}: {e}") |
| 229 | |
| 230 | # Save results |
| 231 | with open(output_file, "w") as f: |
| 232 | for result in results: |
| 233 | f.write(f"{result['file']}: {result['prediction']} ({result['confidence']:.3f})\n") |
| 234 | |
| 235 | return results |
| 236 | |
| 237 | # Process all videos in a directory |
| 238 | results = process_video_directory("./videos/") |
| 239 | ``` |
| 240 | |
| 241 | ## 📈 Technical Specifications |
| 242 | |
| 243 | - **Base Model**: MCG-NJU/videomae-base |
| 244 | - **Architecture**: Vision Transformer (ViT) adapted for video |
| 245 | - **Input Resolution**: 224x224 pixels per frame |
| 246 | - **Temporal Resolution**: 16 frames per video clip |
| 247 | - **Output Classes**: 2 (Binary classification) |
| 248 | - **Training Framework**: HuggingFace Transformers |
| 249 | - **Optimization**: AdamW optimizer with learning rate 5e-5 |
| 250 | |
| 251 | ## ⚠️ Limitations |
| 252 | |
| 253 | 1. **Dataset Scope**: Trained on a subset of UCF Crime dataset - may not generalize to all types of violence |
| 254 | 2. **Temporal Context**: Uses 16-frame clips which may miss context in longer sequences |
| 255 | 3. **Environmental Bias**: Performance may vary with different lighting, camera angles, and video quality |
| 256 | 4. **False Positives**: May misclassify intense but non-violent activities (sports, action movies) |
| 257 | 5. **Real-time Performance**: Processing time depends on hardware capabilities |
| 258 | |
| 259 | ## 🔒 Ethical Considerations |
| 260 | |
| 261 | ### Intended Use |
| 262 | - **Primary**: Research and development in video analysis |
| 263 | - **Secondary**: Security system enhancement with human oversight |
| 264 | - **Educational**: Computer vision and AI safety research |
| 265 | |
| 266 | ### Prohibited Uses |
| 267 | - **Surveillance without consent**: Do not use for unauthorized monitoring |
| 268 | - **Discriminatory profiling**: Avoid bias against specific groups or communities |
| 269 | - **Automated punishment**: Never use for automated legal or disciplinary actions |
| 270 | - **Privacy violation**: Respect privacy laws and individual rights |
| 271 | |
| 272 | ### Bias and Fairness |
| 273 | - Model trained on specific dataset that may not represent all populations |
| 274 | - Regular evaluation needed for bias detection and mitigation |
| 275 | - Human oversight required for critical applications |
| 276 | - Consider demographic representation in deployment scenarios |
| 277 | |
| 278 | ## 📝 Model Card Information |
| 279 | |
| 280 | - **Developed by**: Research Team |
| 281 | - **Model Type**: Video Classification (Binary) |
| 282 | - **Training Data**: UCF Crime Dataset (Subset) |
| 283 | - **Training Date**: 2025-06-02 00:28:28 UTC |
| 284 | - **Evaluation Metrics**: Accuracy, Precision, Recall, F1-Score |
| 285 | - **Intended Users**: Researchers, Security Professionals, Developers |
| 286 | |
| 287 | ## 📚 Citation |
| 288 | |
| 289 | If you use this model in your research, please cite: |
| 290 | |
| 291 | ```bibtex |
| 292 | @misc{Nikeytas_videomae_crime_detector_maxdata_v1, |
| 293 | title={VideoMAE Fine-tuned for Crime Detection}, |
| 294 | author={Research Team}, |
| 295 | year={2024}, |
| 296 | publisher={Hugging Face}, |
| 297 | url={https://huggingface.co/Nikeytas/videomae-crime-detector-maxdata-v1} |
| 298 | } |
| 299 | ``` |
| 300 | |
| 301 | ## 🤝 Contributing |
| 302 | |
| 303 | We welcome contributions to improve the model! Please: |
| 304 | 1. Report issues with specific examples |
| 305 | 2. Suggest improvements for bias reduction |
| 306 | 3. Share evaluation results on new datasets |
| 307 | 4. Contribute to documentation and examples |
| 308 | |
| 309 | ## 📞 Contact |
| 310 | |
| 311 | For questions, issues, or collaboration opportunities, please open an issue in the model repository or contact the development team. |
| 312 | |
| 313 | --- |
| 314 | |
| 315 | *Last updated: 2025-06-02 00:28:28 UTC* |
| 316 | *Model version: 1.0* |
| 317 | *Framework: HuggingFace Transformers* |
| 318 | |