ActionSync Video Transformation: Automated Object Removal and Responsive Effects in Motion Videos using Hybrid CNN and GRU

Habib, Muhammad Asif ORCID: https://orcid.org/0000-0002-2675-1975, Raza, Umar ORCID: https://orcid.org/0000-0002-9810-1285, Jabbar, Sohail ORCID: https://orcid.org/0009-0007-5805-0659, Farhan, Muhammad ORCID: https://orcid.org/0000-0002-3649-5717 and Ullah, Farhan ORCID: https://orcid.org/0000-0002-1030-1275 (2025) ActionSync Video Transformation: Automated Object Removal and Responsive Effects in Motion Videos using Hybrid CNN and GRU. IEEE Access, 13. pp. 149834-149852.

Preview

Published Version
Available under License Creative Commons Attribution.
Download (4MB) | Preview

Official URL: https://doi.org/10.1109/access.2025.3600178

Abstract

Large data volumes, dynamic scenes, and intricate object motions make video analysis extremely difficult. Traditional methods depend on human-crafted features that are not scalable. This work develops a deep neural network-powered automated and controllable video transformation system. The main technique consists of an automated video transformation pipeline driven by spatial-temporal neural network components that are systematically connected. A 16-layer Convolutional Neural Network (CNN) encoder first extracts hierarchical visual features from individual frames to achieve spatial understanding. The CNN encodings are fed into a 512-unit Gated Recurrent Unit (GRU) sequencer, which models long-range sequential dynamics of object motions and interactions spanning thousands of frames to gain temporal context. Subsequently, an attentional transformer integrates the individual strengths of the CNN and GRU into unified space-time representations reflecting video contents, object relationships, and interactions over both spatial and temporal dimensions simultaneously. The context-aware representations inform a specialized controller module, which deliberately adjusts backgrounds and object foreground layers based on their modeled connections. Finally, a flexible effects renderer composites the transformed backgrounds and adjusted objects into novel video sequences with effects synchronized to original timelines. Evaluations on a large 51-category video benchmark demonstrate responsive object removal and background substitution effects with strong system accuracy. The framework achieves high precision, recall, and performance on error metrics. Ablations confirm that the fused CNN, GRU, and transformer components enable effective context modeling for deliberate video manipulations aligned to the original footage. Both quantitative and qualitative outcomes evidence the system’s capacity for automated, adaptive effects generation via joint spatio-temporal reasoning.

Item Type:	Article (Article)
Peer-reviewed:	Yes
Date Deposited:	03 Nov 2025 12:51
Publisher:	Institute of Electrical and Electronics Engineers (IEEE)
Additional Information:	This is an open access article published in IEEE Access, by IEEE.
Divisions:	Organisation > Science and Engineering Organisation > Science and Engineering > Department of Engineering
Subject terms:	08 Information and Computing Sciences, 09 Engineering, 10 Technology, 40 Engineering, 46 Information and computing sciences
Data Access Statement:	Inquiries about data availability should be addressed to the authors.
URI:	https://e-space.mmu.ac.uk/id/eprint/641629
DOI:	https://doi.org/10.1109/ACCESS.2025.3600178
e-ISSN	2169-3536

Impact and Reach

Statistics

DownloadsShow export options

Activity Overview

6 month trend

9Downloads

6 month trend

77Hits

Additional statistics for this dataset are available via IRStats2.

Altmetric

Repository staff only

Edit record