[1Win4-67] Event-Driven GPUDirect Inference for Reducing Overhead in Inference Serving
Keywords:Inference Serving System, GPUDirect RDMA, DOCA
In this study, we developed a novel event-driven streaming GPU computing system that integrates DOCA GPUNetIO and CUDA Graph to support an AI-driven cyber-physical system operating on NTT’s next-generation data center infrastructure (IOWN). The goal is to enable concurrent execution of multiple models while minimizing latency overhead and GPU power consumption. Compared to existing methods, the proposed approach reduces inference overhead by 20% and increases throughput by 173.2%. Furthermore, by employing event-driven inference, our system can process inference requests for up to five models simultaneously without resource contention.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.