How the CUDA Runtime Works
Suppose Process A is an application that executes GPU tasks.
When Process A calls the CUDA Runtime API to launch GPU jobs, the CUDA Runtime creates a separate GPU context for that process. The GPU task scheduler then selects a context and executes the kernels.
- Tasks within the same context can run concurrently.
- Tasks across different contexts are executed sequentially, not concurrently.
Why MPS is Needed
In a multi-process application with GPU-bound workloads, if you don’t use MPS, each process creates its own context. Since kernels from different contexts are scheduled sequentially, overall GPU utilization and performance may be much lower than expected.
What MPS Does
MPS (Multi-Process Service) addresses this limitation by creating a single global context that multiple processes can share.
Each process can still use the same CUDA Runtime API without any code changes. Instead of directly creating its own context, the process connects to the MPS server. The MPS server:
- Intercepts CUDA Runtime/Driver API calls
- Manages the global context
- Allows kernels from different processes to execute concurrently on the GPU
This results in significantly better performance when running GPU-heavy multiprocess workloads.
(from the NVIDIA MPS documentation)