JSAI2025

Presentation information

Organized Session

Organized Session » OS-1

[1P3-OS-1a] OS-1

Tue. May 27, 2025 1:40 PM - 3:20 PM Room P (Room 801-2)

オーガナイザ:鈴木 健二(ソニーグループ),原 聡(電気通信大学),谷中 瞳(東京大学),菅原 朔(国立情報学研究所)

3:00 PM - 3:20 PM

[1P3-OS-1a-04] Evaluation of "Evaluation"

On Construct Validity in Evaluation of Language Models

〇Takuma SATO Sato1,2, Saku Sugawara2 (1. Nara Institute of Science and Technology, 2. National Institute of Informatics)

Keywords:Evaluation, Natural Language Processing, Benchmark, Construct Validity

In the research and development of natural language processing (NLP), measuring and evaluating model performance play a crucial role, leading to the proposal of numerous benchmarks and performance evaluation tasks.
The quality of each evaluation method is often judged based on practical considerations such as task comprehensiveness and ease of implementation.
However, it is also essential to assess whether a given evaluation method adequately measures the traits of the model it aims to evaluate and whether the interpretations and inferences drawn from the measurement results are justified.
This aspect is critical in the ``evaluation of evaluation itself''.
In this paper, we provide an overview of construct validity theory in psychometrics and its validation methods, propose an empirical approach to confirming construct validity, and examine the current state of practice in NLP evaluation.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password