Verification of World Model Emergence in Language Models

Naoya Nishiura

10:00 AM - 10:20 AM

[3O1-OS-16b-04] Verification of World Model Emergence in Language Models

Internal Representation Analysis with Contribution-based Pruning Using Probes

〇Naoya Nishiura¹, Koshiro Aoki², Daisuke Takeda³, Wataru Kumagai⁴, Yutaka Matsuo⁴ (1. Nara Institute of Science and Technology, 2. WASEDA University, 3. The University of Tokyo, 4. Graduate School of Engineering, the University of Tokyo)

Keywords:World Model, LLM, Internal Representations, Pruning, Interpretability

The emergence of world models in language models has been a subject of research. One such study showed that Othello-GPT, a language model, trained to predict legal moves in Othello, spontaneously acquired a world representation of the game. This study provides insight into the emergence of world models through the intervention of internal representations. In this paper, we utilizes Othello-GPT, probes and SHapley Additive exPlanations (SHAP), which computes the contribution to the prediction.

Using these methods, we identified the contribution of each neuron in the inner layer to the current state of the Othello board. We then pruned the neurons in Othello-GPT based on their contribution values. As a result, the accuracy of predicting legal moves was higher when pruning from the neuron with the lowest contribution value than when pruning from the neuron with the highest contribution value. This result suggests that Othello-GPT utilizes internal representations to predict legal moves.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[3O1-OS-16b] OS-16

[3O1-OS-16b-04] Verification of World Model Emergence in Language Models

Internal Representation Analysis with Contribution-based Pruning Using Probes

Password