Test-Time Retrieval-Augmented Adaptation for Vision-Language Models

Fan, Xinqi ORCID: https://orcid.org/0000-0002-8025-016X, Chen, Xueli, Yang, Luoxiao, Yap, Chuin Hong ORCID: https://orcid.org/0000-0003-2251-9308, Quereshi, Rizwan, Dou, Qi, Yap, Moi Hoon and Shah, Mubarak (2026) Test-Time Retrieval-Augmented Adaptation for Vision-Language Models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Presented at IEEE/CVF International Conference on Computer Vision, 19 October 2025 - 23 October 2025, Honolulu, Hawai'i, USA. (In Press)

Accepted Version
File not available for download.
Available under License In Copyright.
Download (482kB)

Abstract

Vision-language models (VLMs) have shown promise in test-time adaptation tasks due to their remarkable capabilities in understanding and reasoning about visual content through natural language descriptions. However, training VLMs typically demands substantial computational resources, and they often struggle to adapt efficiently to new domains or tasks. Additionally, dynamically estimating the test distribution from streaming data at test time remains a significant challenge. In this work, we propose a novel test-time retrieval-augmented adaptation (TT-RAA) method that enables VLMs to maintain high performance across diverse visual recognition tasks without the need for task-specific training or large computational overhead. During inference, TT-RAA employs a streaming mixture of Gaussian database (SMGD) to continuously estimate test distributions, requiring minimal storage. Then, TT-RAA retrieves the most relevant information from the SMGD, enhancing the original VLM outputs. A key limitation of CLIP-based VLMs is their inter-modal vision-language optimization, which does not optimize vision-space similarity, leading to larger intra-modal variance. To address this, we propose a multimodal retrieval augmentation module that transforms the SMGD into a unified multimodal space, enabling retrieval that aligns both vision and language modalities. Extensive experiments across both cross-domain and out-of-distribution benchmarks comprising fourteen datasets demonstrate TT-RAA’s superior performance compared to state-of-the-art methods. Ablation studies and hyperparameter analyses further validate the effectiveness of the proposed modules.

Item Type:	Conference or Workshop Item (Paper)
Published Proceedings:	Proceedings of the IEEE/CVF International Conference on Computer Vision
Peer-reviewed:	No
Date Deposited:	13 Nov 2025 14:27
Publisher:	Institute of Electrical and Electronics Engineers (IEEE)
Additional Information:	This is an author accepted manuscript of a forthcoming conference paper that will be published in the Proceedings of the IEEE/CVF International Conference on Computer Vision (2025), by IEEE.
Divisions:	Organisation > Science and Engineering Organisation > Science and Engineering > Department of Computing and Maths
URI:	https://e-space.mmu.ac.uk/id/eprint/642757

Impact and Reach

Statistics

DownloadsShow export options

Activity Overview

6 month trend

0Downloads

6 month trend

0Hits

Additional statistics for this dataset are available via IRStats2.

Repository staff only

Edit record