2025年度 人工知能学会全国大会(第39回)

講演情報

国際セッション

国際セッション » IS-2 Machine learning

[3K6-IS-2c] Machine learning

2025年5月29日(木) 17:40 〜 19:20 K会場 (会議室1006)

Chair: 三浦 輝久

17:40 〜 18:00

[3K6-IS-2c-01] Transforming Low-quality Technical Documents into Narrative Sentences for Adapting LLMs to Niche Technical Domains

〇Ekant Muljibhai Amin1, Yuta Koreeda1, Yasuhiro Sogawa1 (1. Advanced AI Innovation Center, Hitachi, Ltd.)

キーワード:LLM (Large Language Model), Continual Pretraining, Domain adaptation of LLM, Pretraining Data Quality, Low-quality Data

We investigate whether Large Language Models (LLMs) can effectively learn from industrial data that is limited in quantity and often lacks the coherent narrative flow found in general-purpose training corpora.
Our objective is to address the distinctive challenges posed by such industrial data, which can hinder domain adaptation.
To do so, we propose a data-quality based evaluation method that derives question-answer pairs from the training corpus, identifies the corresponding source chunk for each pair, and labels that chunk as high- or low-quality based on features such as structure, repetition, and punctuation. We then measure how each labeled subset contributes to domain-adapted performance.
Results show that LLMs derive most of their domain knowledge from high-quality data, suggesting that low-quality data is underutilized.
To overcome this limitation, we introduce a multi-step chain-of-thought approach that refines low-quality text into coherent narratives while preserving essential information.
This transformation significantly boosts performance: domain-relevance win-rates increase from 59% to 73%, and correctness from 32% to 55%.
Overall, our findings highlight the importance of data quality and offer a practical strategy for enhancing LLM effectiveness in real-world industrial settings.

講演PDFパスワード認証
論文PDFの閲覧にはログインが必要です。参加登録者の方は「参加者用ログイン」画面からログインしてください。あるいは論文PDF閲覧用のパスワードを以下にご入力ください。

パスワード