[2Win5-34] Disentangling Knowledge Acquisition of LLMs through Direct Corpus Exploration
Keywords:LLM, Knowledge Acquisition
While Large Language Models (LLMs) have demonstrated impressive knowledge acquisition during pre-training, the mechanisms of this process remain poorly understood. Previous research has established a correlation between the frequency of knowledge instances in training corpus and the degree of knowledge acquisition. However, existing methodologies suffer from two key limitations: insufficient experimental validation of frequency, and inadequate consideration of conflicting knowledge within training data. To address these gaps, we conduct a direct investigation of pre-training corpus to unravel the knowledge acquisition process in LLMs. Our experiments demonstrate that higher frequency of knowledge leads to more robust knowledge acquisition. Furthermore, we discover that conflicting knowledge instances within the corpus impact the degree of knowledge acquisition. Notably, our analysis suggests the existence of latent conflicts that may hinder knowledge acquisition even in cases where conflicts are not immediately apparent on the surface level.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.