Preface: TensorRT is NVIDIA’s general-purpose inference SDK that compiles and optimizes a wide variety of AI models (CNNs, computer vision, traditional neural networks) to run as fast as possible on NVIDIA GPUs.
TensorRT-LLM is a specialized, open-source library built on top of TensorRT specifically tailored to optimize and execute Large Language Models (LLMs).
Background: How the Diagram Corresponds to the Vulnerability?
The diagram maps out how improper memory management between the host (CPU) and device (GPU) exposes a system to this flaw:
- Static Buffer Allocation: Step #3 allocates a rigid GPU memory space using
cuda.mem_alloc(input_data.nbytes). This sets up a buffer size based entirely on the initial shape of theinput_data. - Untrusted Runtime Input: As shown in text boxes 3 and 4, if a remote attacker sends a maliciously crafted input that modifies the shape or size at runtime, the application fails to recalculate the allocation bounds.
- Out-of-Bounds Copy: When Step #4 (
cuda.memcpy_htod) executes, it forces the larger data stream into the pre-allocated smaller buffer. This overflows the boundary and writes data directly into adjacent GPU memory locations, causing a classic CWE-787 Out-of-bounds Write.
Remediations
- Update the Software: NVIDIA released an advisory specifying that upgrading to TensorRT v10.16.1 or newer mitigates these risks.
- Input Boundary Checks: Always strictly validate input dimensions before initiating data copies to device memory.
- Leverage Native Profiles: If deploying models with varying input dimensions, use TensorRT’s built-in optimization profiles for dynamic shapes rather than manually overriding raw host-to-device pointers without size verification.
Official announcement: Please refer to the link for details – https://nvidia.custhelp.com/app/answers/detail/a_id/5836