e-space
Manchester Metropolitan University's Research Repository

    A novel energy-efficient spike transformer network for depth estimation from event cameras via cross-modality knowledge distillation

    Zhang, X, Han, L ORCID logoORCID: https://orcid.org/0000-0003-2491-7473, Davies, S ORCID logoORCID: https://orcid.org/0000-0001-5330-5527, Sobeih, T, Han, L and Dancey, D ORCID logoORCID: https://orcid.org/0000-0001-7251-8958 (2025) A novel energy-efficient spike transformer network for depth estimation from event cameras via cross-modality knowledge distillation. Neurocomputing, 658. 131745. ISSN 0925-2312

    [img]
    Preview
    Published Version
    Available under License Creative Commons Attribution.

    Download (7MB) | Preview

    Abstract

    Depth estimation is a critical task in computer vision, with applications in autonomous navigation, robotics, and augmented reality. Event cameras, which encode temporal changes in light intensity as asynchronous binary spikes, offer unique advantages such as low latency, high dynamic range, and energy efficiency. However, their unconventional spiking output and the scarcity of labeled datasets pose significant challenges to traditional image-based depth estimation methods. To address these challenges, we propose a novel energy-efficient Spike-Driven Transformer Network (SDT) for depth estimation, leveraging the unique properties of spiking data. The proposed SDT introduces three key innovations: (1) a purely spike-driven transformer architecture that incorporates spike-based attention and residual mechanisms, enabling precise depth estimation with minimal energy consumption; (2) a fusion depth estimation head that combines multi-stage features for fine-grained depth prediction while ensuring computational efficiency; and (3) a cross-modality knowledge distillation framework that utilises a pre-trained vision foundation model (DINOv2) to enhance the training of the spiking network despite limited data availability. Experimental evaluations on synthetic and real-world event datasets demonstrate the superiority of our approach, with substantial improvements in Absolute Relative Error (49 % reduction) and Square Relative Error (39.77 % reduction) compared to existing models. The SDT also achieves a 70.2 % reduction in energy consumption (12.43 mJ vs. 41.77 mJ per inference) and reduces model parameters by 42.4 % (20.55 M vs. 35.68 M), making it highly suitable for resource-constrained environments. This work represents the first exploration of transformer-based spiking neural networks for depth estimation, providing a significant step forward in energy-efficient neuromorphic computing for real-world vision applications.

    Impact and Reach

    Statistics

    Activity Overview
    6 month trend
    2Downloads
    6 month trend
    8Hits

    Additional statistics for this dataset are available via IRStats2.

    Altmetric

    Repository staff only

    Edit record Edit record