ISSN :2582-9793

Faster, Stronger, and More Interpretable: Massive Transformer Architectures for Vision-Language Tasks

Original Research (Published On: 31-Aug-2023 )
Faster, Stronger, and More Interpretable: Massive Transformer Architectures for Vision-Language Tasks
DOI : 10.54364/AAIML.2023.1181

Wenping Wang

Adv. Artif. Intell. Mach. Learn., 3 (3):1369–1388

Wenping Wang : Individual Researcher

Download PDF Here

DOI: 10.54364/AAIML.2023.1181

Article History: Received on: 05-Jun-23, Accepted on: 23-Aug-23, Published on: 31-Aug-23

Corresponding Author: Wenping Wang

Email: wenpingw@alumni.cmu.edu

Citation: Tong Chen, Sicong Liu, Zhiran Chen, Wenyan Hu, Dachi Chen, Yuanxin Wang, Qi Lyu, Cindy X. Le, Wenping Wang (2023). Faster, Stronger, and More Interpretable: Massive Transformer Architectures for Vision-Language Tasks. Adv. Artif. Intell. Mach. Learn., 3 (3 ):1369–1388


Abstract

    

Multi-layered transformer architectures have lately dominated the domain of vision-language tasks. However, massive transformer architectures can often be inaccessible to many researchers due to their sheer model sizes, and they are often treated as black boxes with poor interpretability. In this paper, we examine the weaknesses of such architectures and propose our own solutions. In particular, we select one of the state-of-the-art models called Oscar \cite{li2020Oscar} and apply distilling techniques and attention visualization to address the aforementioned issues. Moreover, we attempt to improve the overall effectiveness of the Oscar model by making its inferred object tags more useful. We show with detailed experimentation that we can both improve the performance of vision-language tasks and make them more transparent and accessible to all researchers. We discuss the findings with detailed analysis, including the effects of tags and confidence, the training behavior of distillation, and point out future directions in the end.

Statistics

   Article View: 492
   PDF Downloaded: 22