Optimal Pre-training for Vision Transformers in Medical Image Classification

Han, Zihao, De Wilde, Philippe, Santopietro, Marco (2026) Optimal Pre-training for Vision Transformers in Medical Image Classification. In: Advances in Computational Intelligence Systems Contributions Presented at The 24th UK Workshop on Computational Intelligence (UKCI 2025). Springer Nature ISBN 978-3-032-07937-4. E-ISBN 978-3-032-07938-1. (doi:10.1007/978-3-032-07938-1_32) (KAR id:112899)

PDF Author's Accepted Manuscript Language: English
Download this file (PDF/4MB)	Preview
Request a format suitable for use with assistive technology e.g. a screenreader
Official URL: https://doi-org.chain.kent.ac.uk/10.1007/978-3-032...

Abstract

Modality-adaptive transfer learning is crucial for advancing automated medical image analysis, particularly under data scarcity. In this work, we present a systematic study of modality-aligned pre-training for Vision Transformers (ViT) and Convolutional Neural Networks (CNN) on retinal optical coherence tomography (OCT) classification. Through controlled experiments across a broad range of data regimes (from 10 to 2000 labelled samples per class), we show that ViT models pre-trained on a physics-consistent OCT domain (breast tissue) achieve substantial performance gains in the few-shot setting, dramatically outperforming both ImageNet pre-training and random initialization. Conversely, transferring a retina-OCT-pre-trained ViT to a binary breast-OCT task lifts accuracy from 85.9% to 99.98% with only five training images per class, confirming bidirectional generalizability. Notably, this benefit does not extend to CNNs, which show little or no improvement from modality alignment. Visualization of self-attention maps reveals that modality-aligned ViTs more effectively focus on clinically relevant features when labelled data are limited, whereas all models converge as sample size increases. These findings highlight the critical interplay between network architecture, pre-training strategy, and data modality for medical imaging applications, and provide new insights into the unique transferability of self-attention-based models under real-world clinical constraints.

Item Type:	Conference proceeding
DOI/Identification number:	10.1007/978-3-032-07938-1_32
Uncontrolled keywords:	Computer Vision; Intervision; Object vision; Ophthalmology; Predictive medicine; Transformation Optics
Institutional Unit:	Schools > School of Natural Sciences > Biosciences
Former Institutional Unit:	There are no former institutional units.
Funders:	University of Kent (https://ror.org/00xkeyj56)
Depositing User:	Philippe De Wilde
Date Deposited:	28 Jan 2026 14:29 UTC
Last Modified:	04 Feb 2026 03:48 UTC
Resource URI:	https://kar.kent.ac.uk/id/eprint/112899 (The current URI for this page, for reference purposes)

University of Kent Author Information

Han, Zihao.

Creator's ORCID:
CReDIT Contributor Roles:

De Wilde, Philippe.

Creator's ORCID:	https://orcid.org/0000-0002-4332-1715
CReDIT Contributor Roles:

Santopietro, Marco.

Creator's ORCID:
CReDIT Contributor Roles:

Depositors only (login required):

Altmetric

Total Views

Total unique views of this page since July 2020. For more details click on the image.