How the CUDA Runtime Works

Suppose Process A is an application that executes GPU tasks.

When Process A calls the CUDA Runtime API to launch GPU jobs, the CUDA Runtime creates a separate GPU context for that process. The GPU task scheduler then selects a context and executes the kernels.

Tasks within the same context can run concurrently.
Tasks across different contexts are executed sequentially, not concurrently.

CUDA runtime contexts

Why MPS is Needed

In a multi-process application with GPU-bound workloads, if you don’t use MPS, each process creates its own context. Since kernels from different contexts are scheduled sequentially, overall GPU utilization and performance may be much lower than expected.

What MPS Does

MPS (Multi-Process Service) addresses this limitation by creating a single global context that multiple processes can share.

Each process can still use the same CUDA Runtime API without any code changes. Instead of directly creating its own context, the process connects to the MPS server. The MPS server:

Intercepts CUDA Runtime/Driver API calls
Manages the global context
Allows kernels from different processes to execute concurrently on the GPU

This results in significantly better performance when running GPU-heavy multiprocess workloads.

MPS global context

(from the NVIDIA MPS documentation)