Transforming Low-quality Technical Documents into Narrative Sentences for Adapting LLMs to Niche Technical Domains

Ekant Muljibhai Amin

17:40 〜 18:00

[3K6-IS-2c-01] Transforming Low-quality Technical Documents into Narrative Sentences for Adapting LLMs to Niche Technical Domains

〇Ekant Muljibhai Amin¹, Yuta Koreeda¹, Yasuhiro Sogawa¹ (1. Advanced AI Innovation Center, Hitachi, Ltd.)

キーワード：LLM (Large Language Model), Continual Pretraining, Domain adaptation of LLM, Pretraining Data Quality, Low-quality Data

We investigate whether Large Language Models (LLMs) can effectively learn from industrial data that is limited in quantity and often lacks the coherent narrative flow found in general-purpose training corpora.
Our objective is to address the distinctive challenges posed by such industrial data, which can hinder domain adaptation.
To do so, we propose a data-quality based evaluation method that derives question-answer pairs from the training corpus, identifies the corresponding source chunk for each pair, and labels that chunk as high- or low-quality based on features such as structure, repetition, and punctuation. We then measure how each labeled subset contributes to domain-adapted performance.
Results show that LLMs derive most of their domain knowledge from high-quality data, suggesting that low-quality data is underutilized.
To overcome this limitation, we introduce a multi-step chain-of-thought approach that refines low-quality text into coherent narratives while preserving essential information.
This transformation significantly boosts performance: domain-relevance win-rates increase from 59% to 73%, and correctness from 32% to 55%.
Overall, our findings highlight the importance of data quality and offer a practical strategy for enhancing LLM effectiveness in real-world industrial settings.

講演PDFパスワード認証
論文PDFの閲覧にはログインが必要です。参加登録者の方は「参加者用ログイン」画面からログインしてください。あるいは論文PDF閲覧用のパスワードを以下にご入力ください。

講演情報

[3K6-IS-2c] Machine learning

[3K6-IS-2c-01] Transforming Low-quality Technical Documents into Narrative Sentences for Adapting LLMs to Niche Technical Domains

パスワード