Haru-Tada Sato and Fuka Matsuzaki
Adv. Artif. Intell. Mach. Learn., 5 (3):4222-4241
1. Haru-Tada Sato: Department of Data Science, i's Factory Corporation, Ltd.
2. Fuka Matsuzaki: Department of Data Science, i's Factory Corporation, Ltd.
DOI: 10.54364/AAIML.2025.53235
Article History: Received on: 21-May-25, Accepted on: 02-Sep-25, Published on: 09-Sep-25
Corresponding Author: Haru-Tada Sato
Email: satoh@isfactory.co.jp
Citation: Haru-Tada Sato, Fuka Matsuzaki. Exploring the Limits of Large Language Models: A Systematic Evaluation of Masked Text Processing Ability through MskQA and MskCal. Advances in Artificial Intelligence and Machine Learning. 2025;5 (3):235.
This paper evaluates Large Language Models' (LLMs) limitations through their masked text processing capabilities. We introduce two novel tasks: MskQA, which measures reasoning on masked question-answering datasets, and MskCal, which assesses numerical reasoning on masked arithmetic problems. Testing GPT-4o and 4o-mini reveals that LLM performance depends significantly on masking rates and semantic information availability. Our experiments demonstrate that performance decreases as semantic information is reduced, with GPT-4o consistently outperforming 4o-mini, particularly in numerical reasoning tasks. The study shows that LLMs maintain reasonable accuracy with masking rates below 40%, but struggle with heavily masked computational tasks. Our findings illuminate the interaction between background knowledge and reasoning ability in masked text processing, highlighting the need for more robust methods to assess LLMs' true comprehension capabilities.