JSAI2024

Presentation information

General Session

General Session » GS-3 Knowledge utilization and sharing

[2O1-GS-3] Knowledge utilization and sharing:

Wed. May 29, 2024 9:00 AM - 10:40 AM Room O (Music studio hall)

座長:石川 開(日本電気株式会社)[[オンライン]]

9:20 AM - 9:40 AM

[2O1-GS-3-02] KOGITUNE: Distributed Dataset Framework for Training Large Language Models

〇Nao Souma1, Momoka Obara1, Kimio Kuramitsu1, Takahiro Katagiri2, Yashuhiko Yokote3, Yutaka Ishikawa4 (1. Japan Women’s University, 2. Nagoya University, 3. RIKEN, 4. National Institute of Informatics)

Keywords:Large Language Models, Training Framework

The performance of large language models relies on massive datasets, often exceeding hundreds of gigabytes, that are preprocessed to high quality. To develop datasets of this scale, a distributed framework that spans multiple organizations is necessary as it is challenging for a single organization to do so. KOGITUNE has been designed to facilitate the training of Large Language Models (LLMs) with distributed datasets. The main concept involves performing dataset preprocessing and tensorization on external machines independently, and then delivering it on-demand to the GPU side. This approach aims to achieve high GPU utilization rates during training. KOGITUNE also includes practical features, such as the ability to adjust the mixing ratios of multiple corpora. This paper presents the design and implementation of KOGITUNE and reports on the experience of developing LLMs, which range from 0.06B to 1.3B, using KOGITUNE.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password