[3Win5-83] Introducing Extrinsic Causality for Mechanistic Interpretability
Keywords:causality, interpretability, integrated gradient
Mechanistic Interpretability (MI) aims at the causal interpretation of the Language Models (LMs). Typically, MI studies explore a circuit representing a specific concept by deactivating its components or mimic LM’s inner workings by replacing the activation with a white-box algorithm. Although these approaches provide valuable insights for LM’s inner workings (intrinsic causality), the inner workings are affected by the causality of the problem they are addressing, or extrinsic causality. To understand this dual causality, I provide the formal framework for estimating the causal effect of extrinsic causality over intrinsic causality. Then I reframe my previous findings into this dual causality problem; the bias of pretraining on the image-text interaction in three Transformer variants performing hateful memes detection. The result shows that our framework can quantify the pertaining bias affects the entire causal relationship of this problem and a hypothetical causal effect where only a subset of the attention matrix causes the model output. In conclusion, this paper contributes to MI communities by providing the initial findings about a missing component in MI; interaction with the world.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.