[4Rin1-36] Automatic evaluation of open-domain dialogue systems using automatically-augmented references
Keywords:Natural Language Processing, Non-task-oriented Dialogue, Evaluation Metric
However, it is difficult to consider the diversity of responses when evaluating responses generated by dialogue systems, since basically only one response can be extracted as a reference response from real conversations.
To address this problem, ΔBLEU uses reference responses that are extended with responses in massive dialogue data and are manually annotated with appropriateness as a response.
Because the human annotation is costly, we cannot utilize ΔBLEU for a large-scale evaluation of open-domain dialogue systems that should be evaluated in various contexts.
We propose a fully-automatic evaluation method ΔBLEU-auto that annotates the appropriateness of extended responses used in ΔBLEU by a classifier trained with automatically-collected training data.
Experimental results confirmed that ΔBLEU-auto is comparable to ΔBLEU in terms of correlation with human judgement, and also improves the state-of-the-art evaluation method, RUBER, by integrating our ΔBLEU-auto into RUBER.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.