JSAI2025

Presentation information

General Session

General Session » GS-2 Machine learning

[3S6-GS-2] Machine learning:

Thu. May 29, 2025 5:40 PM - 7:20 PM Room S (Room 701-2)

座長:渡邊 千紘(NTT)

6:20 PM - 6:40 PM

[3S6-GS-2-03] Continuous Japanese Pre-Training for Qwen2.5-32B/7B

〇Shinya Otani1, Kyo Hattori1, Keisuke Fujimoto1, Kentaro Nakanishi1, Tomoki Manabe1, Hiroshi Kiyota1, Shogo Muranushi1, Takuma Kume1, Masafumi Kinoshita1 (1. ABEJA, Inc.)

Keywords:AI, LLM, Generative AI

In this study, we conducted continuous pre-training focused on Japanese for the Qwen model series developed by Alibaba Cloud, namely “Qwen2.5-32B-Instruct” and “Qwen2.5-7B-Instruct,” and evaluated their effectiveness on Japanese tasks. To balance high performance with feasible parameter sizes for real-world applications, we conducted continuous pre-training using a mixed dataset of roughly 100 billion tokens in both Japanese and English. We further applied a merging approach via ChatVector to enhance instruction-following capabilities. Evaluations using MT-Bench-Japanese and ELYZA-tasks-100 showed that the 32B model achieved scores of 8.294 and 4.37, respectively, demonstrating competitiveness comparable to closed large language models. The combined benchmark results even surpassed those of Qwen2.5-72B-Instruct, confirming the benefits of Japanese-focused continuous pre-training. On the other hand, some outputs still contain Chinese texts, suggesting potential influence from ChatVector or training data of base model. Future work will include removing mixed-language data and applying both domain- and task-specific post-training to further improve performance and address these issues.

Please log in with your participant account.
» Participant Log In