Nagarajan, Senthil Murugan ORCID: https://orcid.org/0000-0001-9284-7724, Devarajan, Ganesh Gopal
ORCID: https://orcid.org/0000-0003-0036-7841, Jerlin M, Asha, Arockiam, Daniel
ORCID: https://orcid.org/0000-0001-5564-2332, Bashir, Ali Kashif
ORCID: https://orcid.org/0000-0003-2601-9327 and Al Dabel, Maryam M
ORCID: https://orcid.org/0000-0003-4371-8939
(2025)
Deep Multi-Source Visual Fusion With Transformer Model for Video Content Filtering.
IEEE Journal on Selected Topics in Signal Processing, 19 (4).
pp. 613-622.
ISSN 1932-4553
|
Accepted Version
Available under License Creative Commons Attribution. Download (2MB) | Preview |
Abstract
As YouTube content continues to grow, advanced filtering systems are crucial to ensuring a safe and enjoyable user experience. We present MFusTSVD, a multi-modal model for classifying YouTube video content by analyzing text, audio, and video images. MFusTSVD uses specialized methods to extract features from audio and video images, while processing text data with BERT Transformers. Our key innovation includes two new BERT-based multi-modal fusion methods: B-SMTLMF and B-CMTLRMF. These methods combine features from different data types and improve the model's ability to understand each type of data, including detailed audio patterns, leading to better content classification and speech-related separation. MFusTSVD is designed to perform better than existing models in terms of accuracy, precision, recall, and F-measure. Tests show that MFusTSVD consistently outperforms popular models like Memory Fusion Network, Early Fusion LSTM, Late Fusion LSTM, and multi-modal Transformer across different content types and evaluation measures. In particular, MFusTSVD effectively balances precision and recall, which makes it especially useful for identifying inappropriate speech and audio content, as well as broader categories, ensuring reliable and robust content moderation.
Impact and Reach
Statistics
Additional statistics for this dataset are available via IRStats2.