[1Win4-101] An Automated Evaluation Method for Business Requirements Using LLMs
Keywords:LLM Evaluation, LLM-as-a-Judge, Evaluation Criteria, System Development
Large language models (LLMs) will improve efficiency in system development tasks at companies. When companies apply LLMs to system development, they must evaluate how well these models meet task requirements. One approach, known as LLM-as-a-Judge, automatically calculates evaluation scores from criteria that human experts design. However, experts must understand both business requirements and various LLMs to design these criteria. In this study, we propose a method that automatically evaluates LLMs based on business requirements without requiring such specialized knowledge. We ask two or more LLMs to answer business-related questions and then use another LLM to compare these answers and generate business-specific evaluation criteria. We then perform an absolute evaluation using weighted criteria computed through the Analytic Hierarchy Process (AHP). We experiment with five LLMs using system design document reviews and confirm that GPT-4 series LLMs satisfy the task requirements.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.