9:20 AM - 9:40 AM
[2O1-GS-3-02] KOGITUNE: Distributed Dataset Framework for Training Large Language Models
Keywords:Large Language Models, Training Framework
The performance of large language models relies on massive datasets, often exceeding hundreds of gigabytes, that are preprocessed to high quality. To develop datasets of this scale, a distributed framework that spans multiple organizations is necessary as it is challenging for a single organization to do so. KOGITUNE has been designed to facilitate the training of Large Language Models (LLMs) with distributed datasets. The main concept involves performing dataset preprocessing and tensorization on external machines independently, and then delivering it on-demand to the GPU side. This approach aims to achieve high GPU utilization rates during training. KOGITUNE also includes practical features, such as the ability to adjust the mixing ratios of multiple corpora. This paper presents the design and implementation of KOGITUNE and reports on the experience of developing LLMs, which range from 0.06B to 1.3B, using KOGITUNE.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.