Event-Driven GPUDirect Inference for Reducing Overhead in Inference Serving

Kenji kenji Tanaka; Kento Kitamura; Kazunori Seno

[1Win4-67] Event-Driven GPUDirect Inference for Reducing Overhead in Inference Serving

〇Kenji kenji Tanaka¹, Kento Kitamura¹, Kazunori Seno¹ (1.NTT IOWN Innovation Center)

Keywords:Inference Serving System, GPUDirect RDMA, DOCA

In this study, we developed a novel event-driven streaming GPU computing system that integrates DOCA GPUNetIO and CUDA Graph to support an AI-driven cyber-physical system operating on NTT’s next-generation data center infrastructure (IOWN). The goal is to enable concurrent execution of multiple models while minimizing latency overhead and GPU power consumption. Compared to existing methods, the proposed approach reduces inference overhead by 20% and increases throughput by 173.2%. Furthermore, by employing event-driven inference, our system can process inference requests for up to five models simultaneously without resource contention.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[1Win4] Poster session 1

[1Win4-67] Event-Driven GPUDirect Inference for Reducing Overhead in Inference Serving

Password