[2Win5-57] A template-based metric for multi-attribute in Text-to-Image Generation
Keywords:Artificial intelligence, Image synthesis, Diffusion processes, Performance evaluation
Generative models like Stable Diffusion (SD) have gained popularity, necessitating robust methods for evaluating the alignment between text prompts and generated images. An existing Text-Image Alignment Metric (TIAM) employs a template-based approach to comprehensively evaluate the alignment in terms of the number and color of objects. However, SD models can represent a diverse range of attributes beyond color. Thus, more inclusive evaluation is required. This study aims to extend TIAM by incorporating attention maps obtained during the image generation process and vision-language models to assess diverse attributes beyond color, such as size, age, shape, and material. To validate the effectiveness of the proposed method, we conducted a survey comparing results with human judgments. The results indicate that the proposed method outperforms the baseline, exhibiting stronger correlation with human assessments. Additionally, we applied the proposed method to SD1.4 to analyze generation capabilities. The results revealed that generation performance varies depending on the attribute type, and the performance decreases as the number of attributes specified in the prompt increases.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.