2:20 PM - 2:40 PM
[2H4-GS-11-03] An analysis of the safety of a Japanese-based LLM against stereotypical prompts
Keywords:Large Language Model, Stereotype, Japanese-based LLM, Toxicity Analysis, Sentiment Analysis
As large language models (LLMs) gain increasing attention, concerns have also been raised about stereotypical outputs and underlying social biases. While extensive research has been conducted on English-based LLMs, studies on Japanese models remain limited. This study examines the safety of Japanese LLMs in responding to stereotype-triggering prompts. We constructed 3,612 prompts by combining 301 social groups with 12 stereotype-inducing templates in Japanese and conducted three tasks using models trained on Japanese, English, and Chinese. Our findings show that LLM-jp had the lowest refusal rate and was more likely to generate toxic and negative responses compared to other models. Additionally, prompt format significantly influenced all models, and the generated responses included exaggerated reactions toward specific social groups, varying across models. These results highlight the need to improve safety mechanisms in Japanese LLMs and contribute to discussions on bias mitigation and their safe and responsible deployment.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.