5:40 PM - 6:00 PM
[3K6-IS-2c-01] Transforming Low-quality Technical Documents into Narrative Sentences for Adapting LLMs to Niche Technical Domains
Keywords:LLM (Large Language Model), Continual Pretraining, Domain adaptation of LLM, Pretraining Data Quality, Low-quality Data
We investigate whether Large Language Models (LLMs) can effectively learn from industrial data that is limited in quantity and often lacks the coherent narrative flow found in general-purpose training corpora.
Our objective is to address the distinctive challenges posed by such industrial data, which can hinder domain adaptation.
To do so, we propose a data-quality based evaluation method that derives question-answer pairs from the training corpus, identifies the corresponding source chunk for each pair, and labels that chunk as high- or low-quality based on features such as structure, repetition, and punctuation. We then measure how each labeled subset contributes to domain-adapted performance.
Results show that LLMs derive most of their domain knowledge from high-quality data, suggesting that low-quality data is underutilized.
To overcome this limitation, we introduce a multi-step chain-of-thought approach that refines low-quality text into coherent narratives while preserving essential information.
This transformation significantly boosts performance: domain-relevance win-rates increase from 59% to 73%, and correctness from 32% to 55%.
Overall, our findings highlight the importance of data quality and offer a practical strategy for enhancing LLM effectiveness in real-world industrial settings.
Our objective is to address the distinctive challenges posed by such industrial data, which can hinder domain adaptation.
To do so, we propose a data-quality based evaluation method that derives question-answer pairs from the training corpus, identifies the corresponding source chunk for each pair, and labels that chunk as high- or low-quality based on features such as structure, repetition, and punctuation. We then measure how each labeled subset contributes to domain-adapted performance.
Results show that LLMs derive most of their domain knowledge from high-quality data, suggesting that low-quality data is underutilized.
To overcome this limitation, we introduce a multi-step chain-of-thought approach that refines low-quality text into coherent narratives while preserving essential information.
This transformation significantly boosts performance: domain-relevance win-rates increase from 59% to 73%, and correctness from 32% to 55%.
Overall, our findings highlight the importance of data quality and offer a practical strategy for enhancing LLM effectiveness in real-world industrial settings.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.