Development of a Large Language Model Emphasizing Japanese Dialogue and Text Generation Performance

○Katsuhiko Nishizawa

10:00 AM - 10:20 AM

[3G1-GS-6-04] Development of a Large Language Model Emphasizing Japanese Dialogue and Text Generation Performance

Report on “Tanuki” LLM Development Project Through Public Recruitment and Open Collaboration

○Katsuhiko Nishizawa¹, Kan Hatakeyama², Takao Mori³, Minami Someya ⁴,Yasushi Nishijima, Kazutaka Nishimae⁵, Susumu Ota⁶, Keno Harada⁷, Yohei Kobashi⁷, Takeshi Kojima⁷, Yusuke Iwasawa⁷, Yutaka Matsuo⁷ (1. Panasonic Holdings Corporation, 2. Institute of Science Tokyo, 3. Denso Corporation, 4. INSTITUTE of INFORMATION SECURITY, 5. Cross-Industrial Data Science Laboratories, 6. Tokyo University of Technology, 7. The University of Tokyo)

Keywords:LLM, Synthetic data, GENIAC

In recent years, large language models (LLM) have been advancing rapidly worldwide, emphasizing the growing importance of cultivating capabilities within Japan. This paper presents an LLM development project led by the Matsuo and Iwasawa Lab as part of the GENIAC project, whose primary goal is to foster domestic expertise and reinforce national development capacity. Volunteers from the public worked with the lab to create 8B and 8×8B models from scratch. When we began our research in April 2024, domestically developed models still faced certain challenges in dialogue and text generation. On the other hand, our approach focused on improving dialogue and composition through synthetic data. Evaluations using the widely recognized “Japanese MT-Bench” indicated that our 8B model surpassed existing 10B-class models, while our 8×8B model performed on par with GPT-3.5, placing it at the forefront among domestically developed LLMs. Both models and their training code have been released under the Apache License 2.0, contributing to academic research and industrial applications of Japanese LLMs.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[3G1-GS-6] Language media processing:

[3G1-GS-6-04] Development of a Large Language Model Emphasizing Japanese Dialogue and Text Generation Performance

Report on “Tanuki” LLM Development Project Through Public Recruitment and Open Collaboration

Password