Automatic evaluation of open-domain dialogue systems using automatically-augmented references

Yuma Tsuta; Naoki Yoshinaga; Masashi Toyoda

[4Rin1-36] Automatic evaluation of open-domain dialogue systems using automatically-augmented references

〇Yuma Tsuta¹, Naoki Yoshinaga², Masashi Toyoda² (1.University of Tokyo, 2.Institution of Industrial Science)

Keywords:Natural Language Processing, Non-task-oriented Dialogue, Evaluation Metric

In open-domain dialogues, the content and style of responses can vary.

However, it is difficult to consider the diversity of responses when evaluating responses generated by dialogue systems, since basically only one response can be extracted as a reference response from real conversations.

To address this problem, ΔBLEU uses reference responses that are extended with responses in massive dialogue data and are manually annotated with appropriateness as a response.

Because the human annotation is costly, we cannot utilize ΔBLEU for a large-scale evaluation of open-domain dialogue systems that should be evaluated in various contexts.

We propose a fully-automatic evaluation method ΔBLEU-auto that annotates the appropriateness of extended responses used in ΔBLEU by a classifier trained with automatically-collected training data.

Experimental results confirmed that ΔBLEU-auto is comparable to ΔBLEU in terms of correlation with human judgement, and also improves the state-of-the-art evaluation method, RUBER, by integrating our ΔBLEU-auto into RUBER.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[4Rin1] Interactive 2

[4Rin1-36] Automatic evaluation of open-domain dialogue systems using automatically-augmented references

Password