e-space
Manchester Metropolitan University's Research Repository

    ActionSync Video Transformation: Automated Object Removal and Responsive Effects in Motion Videos using Hybrid CNN and GRU

    Habib, Muhammad Asif ORCID logoORCID: https://orcid.org/0000-0002-2675-1975, Raza, Umar ORCID logoORCID: https://orcid.org/0000-0002-9810-1285, Jabbar, Sohail ORCID logoORCID: https://orcid.org/0009-0007-5805-0659, Farhan, Muhammad ORCID logoORCID: https://orcid.org/0000-0002-3649-5717 and Ullah, Farhan ORCID logoORCID: https://orcid.org/0000-0002-1030-1275 (2025) ActionSync Video Transformation: Automated Object Removal and Responsive Effects in Motion Videos using Hybrid CNN and GRU. IEEE Access, 13. pp. 149834-149852.

    [img]
    Preview
    Published Version
    Available under License Creative Commons Attribution.

    Download (4MB) | Preview

    Abstract

    Large data volumes, dynamic scenes, and intricate object motions make video analysis extremely difficult. Traditional methods depend on human-crafted features that are not scalable. This work develops a deep neural network-powered automated and controllable video transformation system. The main technique consists of an automated video transformation pipeline driven by spatial-temporal neural network components that are systematically connected. A 16-layer Convolutional Neural Network (CNN) encoder first extracts hierarchical visual features from individual frames to achieve spatial understanding. The CNN encodings are fed into a 512-unit Gated Recurrent Unit (GRU) sequencer, which models long-range sequential dynamics of object motions and interactions spanning thousands of frames to gain temporal context. Subsequently, an attentional transformer integrates the individual strengths of the CNN and GRU into unified space-time representations reflecting video contents, object relationships, and interactions over both spatial and temporal dimensions simultaneously. The context-aware representations inform a specialized controller module, which deliberately adjusts backgrounds and object foreground layers based on their modeled connections. Finally, a flexible effects renderer composites the transformed backgrounds and adjusted objects into novel video sequences with effects synchronized to original timelines. Evaluations on a large 51-category video benchmark demonstrate responsive object removal and background substitution effects with strong system accuracy. The framework achieves high precision, recall, and performance on error metrics. Ablations confirm that the fused CNN, GRU, and transformer components enable effective context modeling for deliberate video manipulations aligned to the original footage. Both quantitative and qualitative outcomes evidence the system’s capacity for automated, adaptive effects generation via joint spatio-temporal reasoning.

    Impact and Reach

    Statistics

    Activity Overview
    6 month trend
    2Downloads
    6 month trend
    38Hits

    Additional statistics for this dataset are available via IRStats2.

    Altmetric

    Repository staff only

    Edit record Edit record