10:00 AM - 10:20 AM
[3G1-GS-6-04] Development of a Large Language Model Emphasizing Japanese Dialogue and Text Generation Performance
Report on “Tanuki” LLM Development Project Through Public Recruitment and Open Collaboration
Keywords:LLM, Synthetic data, GENIAC
In recent years, large language models (LLM) have been advancing rapidly worldwide, emphasizing the growing importance of cultivating capabilities within Japan. This paper presents an LLM development project led by the Matsuo and Iwasawa Lab as part of the GENIAC project, whose primary goal is to foster domestic expertise and reinforce national development capacity. Volunteers from the public worked with the lab to create 8B and 8×8B models from scratch. When we began our research in April 2024, domestically developed models still faced certain challenges in dialogue and text generation. On the other hand, our approach focused on improving dialogue and composition through synthetic data. Evaluations using the widely recognized “Japanese MT-Bench” indicated that our 8B model surpassed existing 10B-class models, while our 8×8B model performed on par with GPT-3.5, placing it at the forefront among domestically developed LLMs. Both models and their training code have been released under the Apache License 2.0, contributing to academic research and industrial applications of Japanese LLMs.
Please log in with your participant account.
» Participant Log In