ISSN :2582-9793

SIA-CLIP: Learnable Sentiment Incongruity Gating for Efficient Multimodal Sarcasm Detection

Original Research (Published On: 22-May-2026 )
DOI : https://doi.org/10.54364/AAIML.2026.63306

Ms. Sapna Madan and Mridula Batra

Adv. Artif. Intell. Mach. Learn., XX (XX):-

1. Ms. Sapna Madan: Manav Rachna International Institute of Research and Studies, Faridabad, Haryana, India.

2. Mridula Batra: Manav Rachna International Institute of Research and Studies, Faridabad, Haryana, India

Download PDF Here

DOI: 10.54364/AAIML.2026.63306

Article History: Received on: 19-Feb-26, Accepted on: 15-May-26, Published on: 22-May-26

Corresponding Author: Ms. Sapna Madan

Email: sapnasatija85@gmail.com

Citation: Sapna Madan and Mridula Batra. SIA-CLIP: Learnable Sentiment Incongruity Gating for Efficient Multimodal Sarcasm Detection. Advances in Artificial Intelligence and Machine Learning. 2026. (Ahead of Print) https://dx.doi.org/10.54364/AAIML.2026.63306


Abstract

    

Multi-modal sarcasm detection is one of the complex issues due to the complexity of crossmodal

inconsistency between text sentiment and visual information that is common in posts

on social media platforms. Current CLIP based approaches use a relatively fixed feature

extractor, viewing incongruity either implicitly or explicitly. As a result, these methods

have a low ability to capture sarcasm-restricted patterns and, often, require complex fusion

modules that scale up the computing requirement. In order to fill such gaps, we present SIACLIP

(Sentiment-Incongruity Augmented CLIP), a lightweight, decipherable model, which

uses a novel learnable sentiment incongruity gate. We have blended task-adaptive fine-tuning

of CLIP backbone, cross-modal attention fusion as well as supervised contrastive training

as applied to fused embeddings. The essence is that a dynamic gating mechanism is used

to project the sentiment clash score, which is the difference between positive and negative

judgements of Twitter-RoBERTa, into the feature space to explicitly enhance or repress signal

representations of sarcasm. Evaluation results on the MMSD2.0 (mmsd-clean) benchmark

show that SIA-CLIP achieves an accuracy of 85.50%, a macro F1-score of 84.67%, and

a sarcastic F1-score of 81.10% on the official test set, with an ROC-AUC of 0.9140 and

strong robustness (88.29% mean accuracy, standard deviation 0.35%) confirmed via stratified

5-fold cross-validation. The proposed model achieves competitive performance relative

to computationally demanding state-of-the-art methods while employing approximately 15

million trainable parameters and providing intrinsic interpretability through gate activation

values. Accompanying code will be made publicly available upon acceptance of this work.

Statistics

   Article View: 45
   PDF Downloaded: 8