ISSN :2582-9793

ViTARHeat: A Resolution-Agnostic, Wavelet-Enhanced Framework for High-Fidelity Inpainting Detection

Original Research (Published On: 31-Jan-2026 )
DOI : https://doi.org/10.54364/AAIML.2026.61275

Adrian-Alin Barglazan, Remus Brad and Stefani Berghia

Adv. Artif. Intell. Mach. Learn., 6 (1):4959-4975

1. Adrian-Alin Barglazan: University "Lucian Blaga" Sibiu, Romania

2. Remus Brad: University "Lucian Blaga" Sibiu

3. Stefani Berghia: University "Lucian Blaga" Sibiu

Download PDF Here

DOI: 10.54364/AAIML.2026.61275

Article History: Received on: 18-Nov-25, Accepted on: 24-Jan-26, Published on: 31-Jan-26

Corresponding Author: Adrian-Alin Barglazan

Email: adrian.barglazan@ulbsibiu.ro

Citation: Adrian-Alin Barglazană, et al. ViTARHeat: A Resolution-Agnostic, Wavelet-Enhanced Framework for High-Fidelity Inpainting Detection. Advances in Artificial Intelligence and Machine Learning. 2026;6(1):275. https://dx.doi.org/10.54364/AAIML.2026.61275


Abstract

    

Digital media authenticity is threatened by sophisticated generative image inpaint-

ing models, especially diffusion-based ones. These tools allow malicious image

removal or alteration, creating photorealistic effects that are invisible to the hu-

man eye. Inpainting detection methods based on Convolutional Neural Networks

(CNNs) mostly require fixed-resolution inputs, which limits them. This forces high-

resolution images to be downsampled, destroying the subtle, high-frequency arti-

facts and noise inconsistencies that are forgery’s traces. ViTARHeat, a dual-stream

framework for downsizing, is introduced in this paper. ViTARHeat’s architecture

combines two innovations. First, it uses a Vision Transformer with Any Reso-

lution (ViTAR) to process images at their native resolution, preserving forensic

traces. Second, it adds a parallel EWSN branch. This branch uses the Dual-Tree

Complex Wavelet Transform (DT-CWT) as a non-semantic feature extractor to

amplify inpainting’s microscopic texture anomalies and boundary discontinuities.

ViTAR provides global semantic context at native resolution, while EWSN pro-

vides a high-frequency artifact heatmap. A shared decoder fuses these streams

to create a pixel-perfect localization mask. We will show that ViTARHeat out-

performs existing methods on difficult, large-scale benchmarks like IMD2020, DE-

FACTO, and IID-Net in SOTA performance. Ablation studies will prove that

ViTAR’s resolution-agnosticism and the EWSN’s artifact-amplification are key to

its superior performance. Additionally all the materials for this paper, model code,

training / validation / testing code and our pretained models can be seen here:

https://github.com/jmaba/transformer-based-image-inpainting-forgery-detection/


Statistics

   Article View: 287
   PDF Downloaded: 6