Continuous Japanese Pre-Training for Qwen2.5-32B/7B

Shinya Otani

6:20 PM - 6:40 PM

[3S6-GS-2-03] Continuous Japanese Pre-Training for Qwen2.5-32B/7B

〇Shinya Otani¹, Kyo Hattori¹, Keisuke Fujimoto¹, Kentaro Nakanishi¹, Tomoki Manabe¹, Hiroshi Kiyota¹, Shogo Muranushi¹, Takuma Kume¹, Masafumi Kinoshita¹ (1. ABEJA, Inc.)

Keywords:AI, LLM, Generative AI

In this study, we conducted continuous pre-training focused on Japanese for the Qwen model series developed by Alibaba Cloud, namely “Qwen2.5-32B-Instruct” and “Qwen2.5-7B-Instruct,” and evaluated their effectiveness on Japanese tasks. To balance high performance with feasible parameter sizes for real-world applications, we conducted continuous pre-training using a mixed dataset of roughly 100 billion tokens in both Japanese and English. We further applied a merging approach via ChatVector to enhance instruction-following capabilities. Evaluations using MT-Bench-Japanese and ELYZA-tasks-100 showed that the 32B model achieved scores of 8.294 and 4.37, respectively, demonstrating competitiveness comparable to closed large language models. The combined benchmark results even surpassed those of Qwen2.5-72B-Instruct, confirming the benefits of Japanese-focused continuous pre-training. On the other hand, some outputs still contain Chinese texts, suggesting potential influence from ChatVector or training data of base model. Future work will include removing mixed-language data and applying both domain- and task-specific post-training to further improve performance and address these issues.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[3S6-GS-2] Machine learning:

[3S6-GS-2-03] Continuous Japanese Pre-Training for Qwen2.5-32B/7B

Password