10:00 AM - 10:20 AM
[3O1-OS-16b-04] Verification of World Model Emergence in Language Models
Internal Representation Analysis with Contribution-based Pruning Using Probes
Keywords:World Model, LLM, Internal Representations, Pruning, Interpretability
The emergence of world models in language models has been a subject of research. One such study showed that Othello-GPT, a language model, trained to predict legal moves in Othello, spontaneously acquired a world representation of the game. This study provides insight into the emergence of world models through the intervention of internal representations. In this paper, we utilizes Othello-GPT, probes and SHapley Additive exPlanations (SHAP), which computes the contribution to the prediction.
Using these methods, we identified the contribution of each neuron in the inner layer to the current state of the Othello board. We then pruned the neurons in Othello-GPT based on their contribution values. As a result, the accuracy of predicting legal moves was higher when pruning from the neuron with the lowest contribution value than when pruning from the neuron with the highest contribution value. This result suggests that Othello-GPT utilizes internal representations to predict legal moves.
Using these methods, we identified the contribution of each neuron in the inner layer to the current state of the Othello board. We then pruned the neurons in Othello-GPT based on their contribution values. As a result, the accuracy of predicting legal moves was higher when pruning from the neuron with the lowest contribution value than when pruning from the neuron with the highest contribution value. This result suggests that Othello-GPT utilizes internal representations to predict legal moves.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.