〇Yuki Wakai2, Kunihiro Ito1, Hisashi Kashima2
(1. NEC Corporation, 2. Kyoto University)
Keywords:Large language model, AI safety, Trustworthy AI, Jailbreak, In-context defense
Although large language models (LLMs) have been applied in a wide range of fields in society, ``jailbreak'' attacks that exploit their vulnerability have raised serious security concerns. Jailbreak employs strategic prompts to circumvent their safeguards and induces outputs that are not intended by developers. Jailbreak strategies have rapidly evolved and diversified, making it nearly impossible to comprehensively address them during the training phase. Therefore, growing attention has been paid to in-context defenses, which aim to prevent inappropriate outputs by adding ideal responses or expected behaviors to user inputs. However, these additional prompts degrade the quality of responses in existing methods, for example, by causing refusal of responses even to normal inputs, which presents a significant challenge for practical application. This paper proposes a novel method for in-context defenses called ``Role Specification.'' In experiments using Llama-2-7b-chat, our proposed method (1) demonstrated superior defense performance against jailbreaks without compromising response quality, and (2) accomplished a more favorable trade-off between response quality and defense performance in conjunction with existing methods.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.