Ms. Sapna Madan and Mridula Batra
Adv. Artif. Intell. Mach. Learn., XX (XX):-
1. Ms. Sapna Madan: Manav Rachna International Institute of Research and Studies, Faridabad, Haryana, India.
2. Mridula Batra: Manav Rachna International Institute of Research and Studies, Faridabad, Haryana, India
DOI: 10.54364/AAIML.2026.63306
Article History: Received on: 19-Feb-26, Accepted on: 15-May-26, Published on: 22-May-26
Corresponding Author: Ms. Sapna Madan
Email: sapnasatija85@gmail.com
Citation: Sapna Madan and Mridula Batra. SIA-CLIP: Learnable Sentiment Incongruity Gating for Efficient Multimodal Sarcasm Detection. Advances in Artificial Intelligence and Machine Learning. 2026. (Ahead of Print) https://dx.doi.org/10.54364/AAIML.2026.63306
Multi-modal sarcasm detection is one of the complex issues due to the complexity of crossmodal
inconsistency between text sentiment and visual information that is common in posts
on social media platforms. Current CLIP based approaches use a relatively fixed feature
extractor, viewing incongruity either implicitly or explicitly. As a result, these methods
have a low ability to capture sarcasm-restricted patterns and, often, require complex fusion
modules that scale up the computing requirement. In order to fill such gaps, we present SIACLIP
(Sentiment-Incongruity Augmented CLIP), a lightweight, decipherable model, which
uses a novel learnable sentiment incongruity gate. We have blended task-adaptive fine-tuning
of CLIP backbone, cross-modal attention fusion as well as supervised contrastive training
as applied to fused embeddings. The essence is that a dynamic gating mechanism is used
to project the sentiment clash score, which is the difference between positive and negative
judgements of Twitter-RoBERTa, into the feature space to explicitly enhance or repress signal
representations of sarcasm. Evaluation results on the MMSD2.0 (mmsd-clean) benchmark
show that SIA-CLIP achieves an accuracy of 85.50%, a macro F1-score of 84.67%, and
a sarcastic F1-score of 81.10% on the official test set, with an ROC-AUC of 0.9140 and
strong robustness (88.29% mean accuracy, standard deviation 0.35%) confirmed via stratified
5-fold cross-validation. The proposed model achieves competitive performance relative
to computationally demanding state-of-the-art methods while employing approximately 15
million trainable parameters and providing intrinsic interpretability through gate activation
values. Accompanying code will be made publicly available upon acceptance of this work.