A novel energy-efficient spike transformer network for depth estimation from event cameras via cross-modality knowledge distillation

Zhang, X, Han, L ORCID: https://orcid.org/0000-0003-2491-7473, Davies, S ORCID: https://orcid.org/0000-0001-5330-5527, Sobeih, T, Han, L and Dancey, D ORCID: https://orcid.org/0000-0001-7251-8958 (2025) A novel energy-efficient spike transformer network for depth estimation from event cameras via cross-modality knowledge distillation. Neurocomputing, 658. 131745. ISSN 0925-2312

Preview

Published Version
Available under License Creative Commons Attribution.
Download (7MB) | Preview

Official URL: https://doi.org/10.1016/j.neucom.2025.131745

Abstract

Depth estimation is a critical task in computer vision, with applications in autonomous navigation, robotics, and augmented reality. Event cameras, which encode temporal changes in light intensity as asynchronous binary spikes, offer unique advantages such as low latency, high dynamic range, and energy efficiency. However, their unconventional spiking output and the scarcity of labeled datasets pose significant challenges to traditional image-based depth estimation methods. To address these challenges, we propose a novel energy-efficient Spike-Driven Transformer Network (SDT) for depth estimation, leveraging the unique properties of spiking data. The proposed SDT introduces three key innovations: (1) a purely spike-driven transformer architecture that incorporates spike-based attention and residual mechanisms, enabling precise depth estimation with minimal energy consumption; (2) a fusion depth estimation head that combines multi-stage features for fine-grained depth prediction while ensuring computational efficiency; and (3) a cross-modality knowledge distillation framework that utilises a pre-trained vision foundation model (DINOv2) to enhance the training of the spiking network despite limited data availability. Experimental evaluations on synthetic and real-world event datasets demonstrate the superiority of our approach, with substantial improvements in Absolute Relative Error (49 % reduction) and Square Relative Error (39.77 % reduction) compared to existing models. The SDT also achieves a 70.2 % reduction in energy consumption (12.43 mJ vs. 41.77 mJ per inference) and reduces model parameters by 42.4 % (20.55 M vs. 35.68 M), making it highly suitable for resource-constrained environments. This work represents the first exploration of transformer-based spiking neural networks for depth estimation, providing a significant step forward in energy-efficient neuromorphic computing for real-world vision applications.

Item Type:	Article (Article)
Peer-reviewed:	Yes
Date Deposited:	12 Nov 2025 08:35
Publisher:	Elsevier
Additional Information:	This is an open access article published in Neurocomputing, by Elsevier.
Divisions:	Organisation > Science and Engineering Organisation > Science and Engineering > Department of Computing and Maths
Subject terms:	08 Information and Computing Sciences, 09 Engineering, 17 Psychology and Cognitive Sciences, Artificial Intelligence & Image Processing, 40 Engineering, 46 Information and computing sciences, 52 Psychology
Data Access Statement:	Data will be made available on request.
URI:	https://e-space.mmu.ac.uk/id/eprint/642860
DOI:	https://doi.org/10.1016/j.neucom.2025.131745
ISSN	0925-2312
e-ISSN	1872-8286

Impact and Reach

Statistics

DownloadsShow export options

Activity Overview

6 month trend

2Downloads

6 month trend

8Hits

Additional statistics for this dataset are available via IRStats2.

Altmetric

Repository staff only

Edit record