Yu, Jongmin, Oh, Hyeontaek, Sun, Zhongtian, Lee, Younkwan, Yang, Jinhong (2025) Real-time, high-fidelity face identity swapping with a vision foundation model. IEEE Access, 13 . pp. 157160-157174. E-ISSN 2169-3536. (doi:10.1109/ACCESS.2025.3606518) (KAR id:111837)
|
PDF
Publisher pdf
Language: English
This work is licensed under a Creative Commons Attribution 4.0 International License.
|
|
|
Download this file (PDF/3MB) |
Preview |
| Request a format suitable for use with assistive technology e.g. a screenreader | |
| Official URL: https://doi.org/10.1109/ACCESS.2025.3606518 |
|
Abstract
Many recent face-swapping methods based on generative adversarial networks (GANs) or autoencoders achieve strong performance under constrained conditions but degrade significantly in high-resolution or extreme pose scenarios. Moreover, most existing models generate outputs at limited resolutions ( 128×128 ), which fall short of modern visual standards. Diffusion-based approaches have shown promise in handling such challenges, but are computationally intensive and unsuitable for real-time applications. In this work, we propose FaceChanger, a real-time face identity swap framework designed to enhance robustness across various poses and outputs at 256×256 (double the linear resolution of typical 128×128 baselines). While maintaining compatibility with conventional GAN- and autoencoder-based pipelines, FaceChanger uniquely incorporates a vision foundation model (VFM) to extract richer semantic features, which can enhance identity preservation, attribute control, and robustness to variations. In this work, we employ the Contrastive Language-Image Pre-training (CLIP) model to obtain the features. These features guide identity preservation and attribute control through newly designed VFM-based visual and textual semantic contrastive losses. Extensive evaluations on benchmarks such as the FaceForensics++ (FF++) dataset, the Multiple Pose, Illumination, and Expression (MPIE) dataset, and the large-pose Flickr face (LPFF) dataset demonstrate that FaceChanger matches or exceeds state-of-the-art performance under standard conditions and significantly outperforms them in high-resolution, pose-intensive scenarios.
| Item Type: | Article |
|---|---|
| DOI/Identification number: | 10.1109/ACCESS.2025.3606518 |
| Uncontrolled keywords: | face identity swap; face swap; vision foundation model; contrastive learning |
| Subjects: | Q Science |
| Institutional Unit: | Schools > School of Computing |
| Former Institutional Unit: |
There are no former institutional units.
|
| Depositing User: | Zhongtian Sun |
| Date Deposited: | 03 Nov 2025 11:23 UTC |
| Last Modified: | 05 Nov 2025 03:44 UTC |
| Resource URI: | https://kar.kent.ac.uk/id/eprint/111837 (The current URI for this page, for reference purposes) |
- Link to SensusAccess
- Export to:
- RefWorks
- EPrints3 XML
- BibTeX
- CSV
- Depositors only (login required):

https://orcid.org/0000-0003-0489-5203
Altmetric
Altmetric