Zijiang Yang and Tad Gonsalves
Adv. Artif. Intell. Mach. Learn., 5 (1):3216-3235
Zijiang Yang : Department of Information and Computer Science
Tad Gonsalves : Sophia University
Article History: Received on: 27-Nov-24, Accepted on: 04-Jan-25, Published on: 11-Jan-25
Corresponding Author: Zijiang Yang
Email: z-yang-4w3@eagle.sophia.ac.jp
Citation: Zijiang Yang, Tad Gonsalves (JAPAN) (2025). Flexible Transformer: A Simple Novel Transformer-based Network for Image Classification in Variant Input Image Sizes. Adv. Artif. Intell. Mach. Learn., 5 (1 ):3216-3235
Convolutional neural networks (CNNs), the most important deep learning networks for computer vision, have undergone a series of developments and improvements for image related tasks such as object recognition, image classification, semantic segmentation, etc. However, in the field of natural language processing (NLP), the novel attention-based net work Transformer had a profound impact on machine translation, which subsequently led to a boom in attention-based models for computer vision. State-of-the-art models with attention have already shown good performance for computer vision tasks due to the so phisticated design of network architectures and advanced computational efficiency tech niques. For example, self-attention learns relationships between segments or words in different positions compared to the current performance of convolutional neural networks. Inspired by Vision Transformer (ViT), we propose a simple novel transformer architecture model, called Flexible Transformer, which inherits the properties of attention-based archi tectures and is flexible for inputs of arbitrary size. Besides self-attention, the inputs in ViT are not pre-processed, such as resizing or cropping, but kept intact without altering them, which could lead to distortion or loss of information. In this paper, we want to present a novel and simple architecture that meets these requirements. Compared to state-of-arts, our model processes inputs with arbitrary image sizes without any pre-processing and pre training costs. Also, the results of the experiments show that the model can potentially provide good results with high accuracy despite limited resources. Even though the results of Flexible Transformer are not as accurate as those of Vision Transformer, they show the potential of a model with high performance in image classification tasks with variable size images. The significance of the research opens up possibilities for dealing with primitive images in deep learning tasks. Based on the original inputs, reliable results with good ac curacy could be obtained if the proposed model is optimized and further trained on large datasets.