Verification of Using LLM for Automating Dialogue Data Evaluation

Yuki Kubo; Tomoya Yamashita; Masanori Yamada

[4Xin2-53] Verification of Using LLM for Automating Dialogue Data Evaluation

〇Yuki Kubo¹, Tomoya Yamashita¹, Masanori Yamada¹ (1.NTT Social Informatics Laboratories)

Keywords:Dialogue System, Evaluation, Large Language Model

There are many methods for building dialogue systems, but research on evaluating dialogues remains challenging. Metrics like the quality of dialogue, which are difficult to quantify, are often evaluated by human judgement. Recently, methods using LLMs for evaluating dialogue data have been proposed. LLMs evaluate relatively similarly to human, but the evaluation is not similar sufficiently. The Elo rating system, which evaluates data by comparing two data, is assumed that it does not need to consider the difference of standards by evaluators. So, Elo rating system is expected to increase accuracy. In some cases, Elo rating system may not increase accuracy, like the distribution of evaluation values is biased. In this study, we examine whether the Elo rating system increases accuracy of evaluation in various distributions of evaluation values.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[4Xin2] Poster session 2

[4Xin2-53] Verification of Using LLM for Automating Dialogue Data Evaluation

Password