JSAI2025

Presentation information

Organized Session

Organized Session » OS-32

[3L6-OS-32] OS-32

Thu. May 29, 2025 5:40 PM - 7:20 PM Room L (Room 1007)

オーガナイザ:高槻 瞭大(AIアライメントネットワーク/東京大学),峰岸 剛基(東京大学),宮西 洋輔(サイバーエージェント/北陸先端科学技術大学院大学),高木 優(国立情報学研究所)

5:40 PM - 6:00 PM

[3L6-OS-32-01] Mechanistic Interpretability: A New Trend in Interpretability Research

〇Koshiro Aoki1, Ryota Takatsuki2,3, Gouki Minegishi3 (1. Waseda University, 2. AI Alignment Network, 3. The University of Tokyo)

Keywords:Mechanistic Interpretability, Interpretability, Explainability, AI alignment

Mechanistic Interpretability (MI) is an emerging field that aims to uncover the internal mechanisms of AI systems, particularly deep neural networks. MI seeks to identify not only input-output relationships but also the causal structures within models. With the development of large language models, MI has attracted growing attention from the perspectives of AI safety and reliability. However, this field’s rapid growth has led researchers to adopt disparate concepts and methods, resulting in a lack of a unified framework. Moreover, the precise meaning of "mechanistic" remains ambiguous, and the distinction between MI and existing interpretability methods has yet to be clearly established. In this paper, we first survey the historical and cultural background of MI, clarify its differences from traditional interpretability approaches, and propose a conceptual framework that organizes key ideas in MI. Additionally, we discuss MI methods and their limitations, ranging from observational to interventional approaches. Finally, we explore current challenges in MI research and offer directions for future work to understand increasingly complex AI systems and ensure their safety.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password