ISSN :2582-9793

Exploring the Limits of Large Language Models: A Systematic Evaluation of Masked Text Processing Ability through MskQA and MskCal

Original Research (Published On: 09-Sep-2025 )
DOI : https://doi.org/10.54364/AAIML.2025.53235

Haru-Tada Sato and Fuka Matsuzaki

Adv. Artif. Intell. Mach. Learn., 5 (3):4222-4241

1. Haru-Tada Sato: Department of Data Science, i's Factory Corporation, Ltd.

2. Fuka Matsuzaki: Department of Data Science, i's Factory Corporation, Ltd.

Download PDF Here Citation Info via Semantic Scholar

DOI: 10.54364/AAIML.2025.53235

Article History: Received on: 21-May-25, Accepted on: 02-Sep-25, Published on: 09-Sep-25

Corresponding Author: Haru-Tada Sato

Email: satoh@isfactory.co.jp

Citation: Haru-Tada Sato, Fuka Matsuzaki. Exploring the Limits of Large Language Models: A Systematic Evaluation of Masked Text Processing Ability through MskQA and MskCal. Advances in Artificial Intelligence and Machine Learning. 2025;5 (3):235.


Abstract

    

This paper evaluates Large Language Models' (LLMs) limitations through their masked text processing capabilities. We introduce two novel tasks: MskQA, which measures reasoning on masked question-answering datasets, and MskCal, which assesses numerical reasoning on masked arithmetic problems. Testing GPT-4o and 4o-mini reveals that LLM performance depends significantly on masking rates and semantic information availability. Our experiments demonstrate that performance decreases as semantic information is reduced, with GPT-4o consistently outperforming 4o-mini, particularly in numerical reasoning tasks. The study shows that LLMs maintain reasonable accuracy with masking rates below 40%, but struggle with heavily masked computational tasks. Our findings illuminate the interaction between background knowledge and reasoning ability in masked text processing, highlighting the need for more robust methods to assess LLMs' true comprehension capabilities.

Statistics

   Article View: 1843
   PDF Downloaded: 7