6:40 PM - 7:00 PM
[1B5-OS-41c-04] JDERW: Japanese LLM Deduction benchmark requiring a world model
Keywords:World Model, Deduction, benchmark
Recent studies suggest that Large Language Models (LLMs) demonstrate capabilities beyond simple next-token prediction, leading to discussions about their potential acquisition of world models. This paper introduces Basic-JDERW, a deductive reasoning benchmark dataset that requires fundamental world understanding. The dataset comprises 103 QA tasks that necessitate the application of basic world models, ranging from physical phenomena comprehension to common sense reasoning and action planning, categorized into six types: causal reasoning, temporal reasoning, spatial reasoning, abstract concept reasoning, common sense reasoning, and planning. Through evaluation experiments with eight LLMs, we analyzed model performance across categories and examined correlations with existing benchmarks. Notably, llama3.3-70B-instruct demonstrated superior performance in categories requiring physical understanding, such as temporal and spatial reasoning. This research offers new perspectives on evaluating basic world understanding capabilities glimpsed through LLMs' reasoning abilities and aims to contribute to understanding the relationship between linguistic reasoning and world comprehension capabilities.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.