Investigating Gender Bias in Multilingual Large Language Models Using Sparse Auto-Encoders

Tota Abe

7:00 PM - 7:20 PM

[3L6-OS-32-05] Investigating Gender Bias in Multilingual Large Language Models Using Sparse Auto-Encoders

〇Tota Abe¹, Namgi Han¹, Yusuke Miyao¹ (1. Univ. of Tokyo)

Keywords:Sparse Auto-Encoder, Gender Bias, Large Language Model, Mechanistic Interpretability

This research investigates how multilingual Large Language Models (LLMs) encode gender biases in English and Japanese.
It is plausible that gender biases appear differently according to the language in which we train LLMs.
However, it remains to be discovered how multilingual LLMs learn and encode gender biases for different languages.
We extract gender bias features for multiple languages using Sparse Auto-Encoders (SAEs) and see if the features are identical among languages.
More specifically, we give multilingual LLMs gender-stereotypical and anti-gender-stereotypical texts.
We extract interpretable features from neurons in the inner layers of LLMs using SAEs and look for the features that fire differently between the two texts.
Then, we compare the feature activations between the English and Japanese cases.
The experimental results indicate that gender bias is encoded in the distinct parts of multilingual LLMs according to the languages.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[3L6-OS-32] OS-32

[3L6-OS-32-05] Investigating Gender Bias in Multilingual Large Language Models Using Sparse Auto-Encoders

Password