JSAI2025

Presentation information

International Session

International Session » IS-2 Machine learning

[3K6-IS-2c] Machine learning

Thu. May 29, 2025 5:40 PM - 7:20 PM Room K (Room 1006)

Chair: 三浦 輝久

5:40 PM - 6:00 PM

[3K6-IS-2c-01] Transforming Low-quality Technical Documents into Narrative Sentences for Adapting LLMs to Niche Technical Domains

〇Ekant Muljibhai Amin1, Yuta Koreeda1, Yasuhiro Sogawa1 (1. Advanced AI Innovation Center, Hitachi, Ltd.)

Keywords:LLM (Large Language Model), Continual Pretraining, Domain adaptation of LLM, Pretraining Data Quality, Low-quality Data

We investigate whether Large Language Models (LLMs) can effectively learn from industrial data that is limited in quantity and often lacks the coherent narrative flow found in general-purpose training corpora.
Our objective is to address the distinctive challenges posed by such industrial data, which can hinder domain adaptation.
To do so, we propose a data-quality based evaluation method that derives question-answer pairs from the training corpus, identifies the corresponding source chunk for each pair, and labels that chunk as high- or low-quality based on features such as structure, repetition, and punctuation. We then measure how each labeled subset contributes to domain-adapted performance.
Results show that LLMs derive most of their domain knowledge from high-quality data, suggesting that low-quality data is underutilized.
To overcome this limitation, we introduce a multi-step chain-of-thought approach that refines low-quality text into coherent narratives while preserving essential information.
This transformation significantly boosts performance: domain-relevance win-rates increase from 59% to 73%, and correctness from 32% to 55%.
Overall, our findings highlight the importance of data quality and offer a practical strategy for enhancing LLM effectiveness in real-world industrial settings.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password