JSAI2025

Presentation information

Organized Session

Organized Session » OS-42

[3F5-OS-42b] OS-42

Thu. May 29, 2025 3:40 PM - 5:20 PM Room F (Room 1001)

オーガナイザ:金子 正弘(MBZUAI),小島 武(東京大学),磯沼 大(The University of Edinburgh/東京大学),丹羽 彩奈(MBZUAI),大葉 大輔(ELYZA/東京科学大学),村上 明子(AIセーフティーインスティチュート),関根 聡(情報学研究所),内山 将夫(情報通信研究機構),Danushka Bollegala(The University of Liverpool/Amazon)

4:20 PM - 4:40 PM

[3F5-OS-42b-03] Enhancing In-Context Defenses against Jailbreaking of Large Language Model via Role Specification

〇Yuki Wakai2, Kunihiro Ito1, Hisashi Kashima2 (1. NEC Corporation, 2. Kyoto University)

Keywords:Large language model, AI safety, Trustworthy AI, Jailbreak, In-context defense

Although large language models (LLMs) have been applied in a wide range of fields in society, ``jailbreak'' attacks that exploit their vulnerability have raised serious security concerns. Jailbreak employs strategic prompts to circumvent their safeguards and induces outputs that are not intended by developers. Jailbreak strategies have rapidly evolved and diversified, making it nearly impossible to comprehensively address them during the training phase. Therefore, growing attention has been paid to in-context defenses, which aim to prevent inappropriate outputs by adding ideal responses or expected behaviors to user inputs. However, these additional prompts degrade the quality of responses in existing methods, for example, by causing refusal of responses even to normal inputs, which presents a significant challenge for practical application. This paper proposes a novel method for in-context defenses called ``Role Specification.'' In experiments using Llama-2-7b-chat, our proposed method (1) demonstrated superior defense performance against jailbreaks without compromising response quality, and (2) accomplished a more favorable trade-off between response quality and defense performance in conjunction with existing methods.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password