Category Archives: AI and ML

CVE-2026-24162 – About NVIDIA Merlin Transformers4Rec for Linux platform  (1st Jun 2026)

Preface: Data engineers perform seamless preprocessing, a foundational stage where they gather messy, raw data from diverse sources, clean it (handling missing values, outliers, inconsistencies), integrate disparate datasets, and transform it into a unified, structured format, making it ready and reliable for data scientists to perform advanced feature engineering (creating new, meaningful features) and ultimately build better machine learning models. This ensures a high-quality, consistent input, preventing “garbage in, garbage out” for the modeling phase.

Background: NVIDIA Merlin relies directly on RAPIDS cuDF to handle high-performance, GPU-accelerated dataframe operations for recommender systems. The specific ecosystem library used for this within Merlin is NVTabular. NVTabular and RAPIDS (cuDF/cuML) for preprocessing and feature engineering.

For example: interaction data in cuDF, feed it through a Merlin processing pipeline, and extract the resulting GPU data arrays to train a cuML machine learning model.

cuML is a suite of GPU-accelerated machine learning algorithms and mathematical primitives within the NVIDIA RAPIDS ecosystem, designed to act as a fast, drop-in replacement for Scikit-learn. It allows data scientists to achieve 10-50x faster training times on large datasets by leveraging GPU parallelism.

Where serialization risks actually happen in cuML?

An “improper deserialization of untrusted data” vulnerability (like those involving Python’s pickle module) only occurs if you later attempt to load a previously saved model or object from an unknown or unverified source.

To patch and avoid this vulnerability, NVIDIA and the broader ML ecosystem mandate moving away from arbitrary Python object pickling. Instead, systems should use:

•Safetensors: For saving native deep learning model weights safely (since it restricts execution entirely to pure tensor data and avoids code execution pathways).

•ONNX: For standardized, non-executable model formats

Vulnerability details: CVE-2026-24162 NVIDIA Transformers4Rec for Linux contains a vulnerability where an attacker could cause improper deserialization of untrusted data. A successful exploit of this vulnerability might lead to code execution, data tampering, and information disclosure.

Official announcement: Please refer to the link for details – https://nvd.nist.gov/vuln/detail/CVE-2026-24162

CVE-2026-24212: NVIDIA Isaac Launchable contains a vulnerability (29th May 2026)

Preface: The primary purpose of Isaac Launchable is to provide a turn-key, web-browser-based cloud setup via NVIDIA Brev for developers who lack local hardware. Tesla operates its own multi-billion-dollar on-premise supercomputers (like the Tesla Dojo cluster and massive custom NVIDIA H100/H200 data centers). They do not need a standardized, plug-and-play browser template to rent individual cloud GPUs. Tesla utilizes NVIDIA Isaac Sim—a robotics simulation and synthetic data generation platform—for developing and training its AI-powered robots.

Background: The core design objective of the isaac-launchable project (commonly referred to as “Launchable”) is to democratize and simplify access to NVIDIA’s heavy-duty robotics simulation tools by removing local hardware barriers and complex installation configurations. In an Isaac Launchable cloud environment (running inside the NVIDIA Brev container ecosystem), control commands are sent to a robot within a script executed inside the cloud-hosted VS Code terminal. The command pipeline relies on Isaac Lab and the Omniverse Physics Engine (PhysX). The cloud python script computes the robot’s target state (e.g., target joint positions, velocities, or joint efforts) and writes them directly to the simulation’s articulation buffers.

Instead of fighting for “market share” against other companies, Isaac Launchable competes with traditional local setups.

•Traditional Method: Manual Docker and local container workflows (e.g., standard ROS 2 setups on native Linux machines).

•Launchable Method: Zero-friction cloud deployment. Its “market share” is growing rapidly among researchers, universities, and agile startups who do not have the capital to purchase dedicated $10,000+ RTX enterprise workstations but need immediate access to physics training environments.

Vulnerability details: According to the NVIDIA Security Advisory, CVE-2026-24212 is specifically classified as CWE-319 (Cleartext Transmission of Sensitive Information)within the NVIDIA Isaac Launchable component for Linux.

  • The vulnerable mechanism: The issue lies within the background communication channel or telemetry transit layer managed by the isaac-launchable utility itself. It transmits internal credentials, API keys, or security tokens in unencrypted plaintext over the network.

Official announcement: Please refer to the link for details –

https://nvidia.custhelp.com/app/answers/detail/a_id/5830

CVE-2026-24188: About NVIDIA TensorRT (26th May 2026)

Preface: TensorRT is NVIDIA’s general-purpose inference SDK that compiles and optimizes a wide variety of AI models (CNNs, computer vision, traditional neural networks) to run as fast as possible on NVIDIA GPUs.

TensorRT-LLM is a specialized, open-source library built on top of TensorRT specifically tailored to optimize and execute Large Language Models (LLMs).

Background: How the Diagram Corresponds to the Vulnerability?

The diagram maps out how improper memory management between the host (CPU) and device (GPU) exposes a system to this flaw:

  1. Static Buffer Allocation: Step #3 allocates a rigid GPU memory space using cuda.mem_alloc(input_data.nbytes). This sets up a buffer size based entirely on the initial shape of the input_data.
  2. Untrusted Runtime Input: As shown in text boxes 3 and 4, if a remote attacker sends a maliciously crafted input that modifies the shape or size at runtime, the application fails to recalculate the allocation bounds.
  3. Out-of-Bounds Copy: When Step #4 (cuda.memcpy_htod) executes, it forces the larger data stream into the pre-allocated smaller buffer. This overflows the boundary and writes data directly into adjacent GPU memory locations, causing a classic CWE-787 Out-of-bounds Write.

Remediations

  • Update the Software: NVIDIA released an advisory specifying that upgrading to TensorRT v10.16.1 or newer mitigates these risks.
  • Input Boundary Checks: Always strictly validate input dimensions before initiating data copies to device memory.
  • Leverage Native Profiles: If deploying models with varying input dimensions, use TensorRT’s built-in optimization profiles for dynamic shapes rather than manually overriding raw host-to-device pointers without size verification.

Official announcement: Please refer to the link for details – https://nvidia.custhelp.com/app/answers/detail/a_id/5836

CVE-2025-33255: About NVIDIA TensorRT-LLM (22nd May 2026)

Preface: DeepSpeed MII, an open-source Python library developed by Microsoft, aims to make powerful model inference accessible, emphasizing high throughput, low latency, and cost efficiency. TensorRT LLM, an open-source framework from NVIDIA, is designed for optimizing and deploying large language models on NVIDIA GPUs.

Background: TensorRT-LLM is a library developed by NVIDIA to optimize and run large language models (LLMs) efficiently on NVIDIA GPUs. It provides a Python API to define and manage these models, ensuring high performance during inference.

The Python Executor within TensorRT-LLM is a component that orchestrates the execution of inference tasks. It manages the scheduling and execution of requests, ensuring that the GPU resources are utilized efficiently. The Python Executor handles various tasks such as batching requests, managing model states, and coordinating with other components like the model engine and the scheduler.

MPI (Message Passing Interface) helps distribute workloads across multiple GPUs by allowing independent CPU processes to manage different GPUs and coordinate their operations. Because GPUs cannot communicate directly across network nodes, MPI coordinates the sending and receiving of data between nodes while utilizing hardware-accelerated paths to shift workloads off the CPU.

Vulnerability details: CVE-2025-33255 NVIDIA TensorRT-LLM for any platform contains a vulnerability in MPI server, where an attacker could cause an unsafe deserialization. A successful exploit of this vulnerability might lead to code execution, denial of service, data tampering, or information disclosure.

Note: To completely mitigate the risk shown in attached diagram, ensure your deployment workflow includes these two final rules:

  1. Isolate MPI Traffic: Set up your cluster so that the network fabric connecting Nodes 1–4 sits on a private, isolated VLAN or subnet with no external internet ingress.
  2. Upgrade the Image: Verify that your docker pull command grabs a TensorRT-LLM container image version released after the May 2026 security patch advisory.

Official announcement: Please refer to the link for details – https://nvidia.custhelp.com/app/answers/detail/a_id/5805

CVE-2026-24207: About NVIDIA Triton Inference Server (21st May 2026)

Preface: The NVIDIA Triton Inference Server natively supports gRPC as one of its primary communication protocols for the client API. Furthermore, gRPC can also be used for health checks, statistics, and model loading/unloading operations, not just inference requests. Inference requests arrive at the server via either HTTP/REST or GRPC or by the C API and are then routed to the appropriate per-model scheduler.

Background: NVIDIA’s security bulletin did not provide details. I speculate the cause of CVE-2026-24207 is as follows:

The Bypass Logic

A standard gRPC request path is canonical: /package.Service/Method. If an attacker crafts a raw HTTP/2 frame where the :path pseudo-header is package[.]Service/Method (missing the leading /), the following happens:

Step1 – Routing Success: The gRPC server sees the request and correctly identifies which handler to trigger, even without the leading slash.

Step2 – Match Failure: The authorization engine (like grpc/authz) checks the path against its rules. It looks for a literal match for /package[.]Service/Method. Since the incoming path is package[.]Service/Method, the Deny rule does not trigger.

Step3 – Fallback Triggered: Because the specific deny rule failed to match, the engine falls back to its next rule, which is typically a “catch-all” Allow rule.

My question is that gRPC has an authorization bypass vulnerability affecting all gRPC-Go (google[.]golang[.]org/grpc) versions prior to 1.79.3. However, Triton’s gRPC functionality is primarily implemented in src/grpc/grpc_server[.]cc. Can I say that the CVE-2026-24207 vulnerability occurs on the client side rather than the server side? Because for edge deployments, Triton Server is also provided as a shared library, and its API allows the full functionality of the server to be directly integrated into the application. What are your thoughts on this?

If you are using the standard Triton Inference Server binary (which is built in C++), it uses the C++ gRPC implementation, not the Go version. Therefore, it is not vulnerable to CVE-2026-24207 on the server side.

Vulnerability details: CVE-2026-24207 – NVIDIA Triton Inference Server contains a vulnerability where an attacker could cause an authentication bypass. A successful exploit of this vulnerability might lead to code execution, escalation of privileges, data tampering, denial of service, or information disclosure.

Official announcement: Please refer to the link for details – https://nvidia.custhelp.com/app/answers/detail/a_id/5828

CVE-2026-46300 (Fragnesia) is a Linux kernel privilege escalation in the XFRM ESP-in-TCP subsystem. Does it affect GX-grade supercomputers? (18th May 2026)

Preface: If BlueField DPU supports configuring IPsec rules using strongSwan 5.9.0bf, does it use kernel IPsec in ARM?

Yes, when using strongSwan 5.9.0bf on the BlueField DPU, it utilizes the Linux kernel IPsec stack (xfrm) running on the ARM cores to manage and configure security associations, which can then be offloaded to the hardware acceleration engines.

Background: The only scenario where a GPU or advanced SoC interacts with the Linux kernel’s XFRM subsystem is during IPsec Network Offloading (SmartNICs / DPUs).

If an enterprise SoC or Data Processing Unit (like an NVIDIA BlueField DPU) handles high-speed network traffic, the Linux XFRM subsystem can act as a control plane. It passes the encryption policies (SAs and SPIs) down to the chip’s network engine so that standard internet IPsec traffic can be encrypted at wire speed directly on the network interface card (NIC) hardware rather than taxing the main host CPU.

Vulnerability details: Fragnesia is a Linux local privilege escalation vulnerability that is a member of the Dirty Frag vulnerability class.

Are there any remedies available for CVE-2026-46300?

Patch Your Kernel:

Update your Linux kernel immediately. Patches were released by major distributions (AlmaLinux, Ubuntu, Red Hat, Debian, Amazon Linux) around May 14-16, 2026.

Apply Temporary Mitigation (If Patching is Delayed): Disable the vulnerable modules (esp4, esp6, and rxrpc) to block the exploit.Run: sudo rmmod esp4 esp6 rxrpcCreate blacklist file: echo -e “install esp4 /bin/false\ninstall esp6 /bin/false\ninstall rxrpc /bin/false” | sudo tee /etc/modprobe[.]d/fragnesia[.]conf

Clear Page Cache: If you suspect a machine was targeted before patching, run sync; echo 3 | sudo tee /proc/sys/vm/drop_caches to evict potentially corrupted cached pages.

Official announcement: Please refer to the link for details – https://github.com/v12-security/pocs/tree/main/fragnesia

A more imaginative assumption on TDXRay: Microarchitectural Side-Channel Analysis of Intel TDX for Real-World Workloads (15th MAY 2026)

Preface: In these scenarios (see attached diagram), microarchitecture side-channel attacks targeting Intel TDX can directly impact and jeopardize the security of AMD accelerators.

Even though the AMD Instinct APU operates on a completely different silicon package, the two architectures are fundamentally tied together by a shared software stack, device driver interface, and physical interconnect fabric.

The specific risks regarding how TDXRay and cross-domain side-channel leakage bypass the hardware boundary in your diagram are detailed below:

Technical details:

1. Host-Side Driver Leakage (The Primary Target)

As illustrated in attached diagram, the ROCm Driver and HIP Runtime execute inside the Intel TDX Virtual Machine / Trust Domain.

•When primitives like those found in the TDXRay research paper (e.g., page-level or cache-line tracking) are utilized by an untrusted host hypervisor, they target the Intel CPU’s caches and memory controller.

•Because the Intel CPU must actively prepare, schedule, and feed data arrays (h_a, h_b) to the AMD accelerator, the memory access patterns of the ROCm driver itself are leaked.

•An attacker can infer exactly when the AMD kernel is being launched, what memory addresses are being mapped, and the size or stride of the datasets being transferred.

2. Interconnect Fabric Bottlenecks & Shared Cache Timing

The highlighted section in your diagram notes that memcpy can leak info via cache and memory controller interaction.

•During hipMemcpyHostToDevice or hipMemcpyDeviceToHost, data travels across the PCIe Gen 5 / CXL Interconnect Fabric.

•If a malicious actor on the host hypervisor induces resource contention on the shared Intel CPU core or memory bus, they can observe subtle latency shifts.

•By monitoring the timing delays of the Intel CPU waiting for the AMD APU to complete its tasks (hipDeviceSynchronize), the attacker can infer secret-dependent execution paths inside the AMD hardware without ever probing the AMD chip directly.

3. The Cross-Domain Threat Model (AMD SEV-SNP Parallel)

According to AMD’s Official Security Bulletin (AMD-SB-3044) published regarding the TDXRay findings, these types of microarchitectural host-side tracing methodologies fall within a category of behaviors that affect both Intel TDX and AMD SEV-SNP.

If an application leaks data structure layouts through its memory access patterns on the Intel host, the fact that the actual matrix operations happen on an AMD chip does not protect the workflow’s overall confidentiality.

Official announcement: Please refer to the link for details – https://www.amd.com/en/resources/product-security/bulletin/amd-sb-3044.html

CVE-2026-43284: Dirty Frag tricks the IPsec/TCP stack into doing the “dirty work”(13th May 2026)

Preface: The “Dirty Frag” attack chains two separate flaws in the Linux kernel’s networking stack: one in the ESP(Encapsulating Security Payload) protocol used by IPsec and another in the RxRPC protocol used for the AFS distributed file system. If you do not use IPsec, disabling its modules removes one of the major attack paths.

Background: The “Dirty Frag” vulnerability is deemed difficult to patch immediately due to its exploitation of a long-standing core Linux kernel optimization, which initially lacked official, widespread patches upon disclosure. While disabling ESP modules helps, effective mitigation requires blacklisting both ESP and RxRPC modules, or patching the kernel directly.

How to mitigate vulnerabilities:

Step 1:Block the ESP and RxRPC modules: Create a configuration file (e.g., /etc/modprobe.d/dirtyfrag.conf) to ensure the modules cannot be auto-loaded by an exploit:

bash

install esp4 /bin/false
install esp6 /bin/false
install rxrpc /bin/false

Step 2:Unload current modules: Remove the modules if they are currently active in memory:

bash

sudo modprobe -r esp4 esp6 rxrpc
 

Step 3:Clear the Page Cache: The exploit works by corrupting the page cache. After applying the blocks, clear the cache to ensure no malicious changes persist in RAM:

bash

sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
 

Official announcement: Please refer to the link for details – https://nvd.nist.gov/vuln/detail/CVE-2026-43284

To address the vulnerability identified in CVE-2026-24222 (and the related SSRF risk in CVE-2026-24231) – 5th May -2026

Preface: While NVIDIA has not “dropped” support for the core OpenClaw framework, in some specific cases they have moved away from its standard form.

Background: Because NemoClaw “bakes” certain variables into the sandbox configuration during onboarding, if they are not correctly scoped or sanitized, they remain accessible to the agent process even though it should be isolated.

As a result, this allows an attacker to exfiltrate critical secrets (like the NVIDIA_API_KEY or TELEGRAM_BOT_TOKEN mentioned) through the agent’s existing communication channels.

To address the vulnerability identified in CVE-2026-24222 (and the related SSRF risk in CVE-2026-24231), admin should use the following CLI flags during sandbox creation or update. These flags, introduced in NemoClaw v0.0.18, are designed to strictly control which host environment variables are “baked” into the sandbox environment.

For details, see attached diagram.

Vulnerability details:

CVE-2026-2422 NVIDIA NemoClaw contains a vulnerability in the sandbox environment initialization component where a remote attacker may cause improper access control by sending prompt-injected content that causes the agent to read and exfiltrate host environment variables not properly restricted during sandbox creation. A successful exploit of this vulnerability may lead to information disclosure.

CVE-2026-24231 NVIDIA NemoClaw contains a vulnerability in the validateEndpointUrl() SSRF protection component where an attacker may cause a server-side request forgery by supplying a crafted endpoint URL referencing the 0[.]0[.]0[.]0/8 address range via a blueprint configuration file or CLI flag. A successful exploit of this vulnerability may lead to information disclosure.

Official announcement: Please refer to the link for details – https://nvidia.custhelp.com/app/answers/detail/a_id/5837

Recommended Action:
NVIDIA has released a software update for NVIDIA NemoClaw to address this issue. Users should update to version v0.0.18 or later immediately, as the privilege escalation fixes are critical.

CVE-2026-24178: About NVIDIA NVFlare Dashboard (29th Apr 2026)

Preface: NVIDIA FLARE allows research and data scientists to adapt existing ML/DL workflow to federated learning paradigm.

Background: A critical Insecure Direct Object Reference (IDOR) vulnerability was identified in the NVIDIA NVFlare Dashboard (CVE-2026-24178). In federated learning environments—where privacy is paramount (e.g., HIPAA-compliant medical research)—this flaw allowed unauthorized users to bypass access controls and interact with data belonging to other participants.

The Dashboard’s RESTful API previously relied on user-supplied identifiers (such as job_id or user_id) to retrieve records. While the system verified that a user was logged in (Authentication), it failed to verify if that user actually owned or was authorized to access the specific record requested (Authorization). This allowed an attacker to simply change a numeric ID in an API request to view, modify, or delete sensitive information outside their scope.

Vulnerability details: NVIDIA NVFlare Dashboard contains a vulnerability in the user management and authentication system where an unauthenticated attacker may cause authorization bypass through user-controlled key. A successful exploit of this vulnerability may lead to privilege escalation, data tampering, information disclosure, code execution, and denial of service.

Remediation: The Patch
The vulnerability is fully addressed in NVIDIA FLARE SDK v2.7.2. The fix implements Attribute-Based Access Control (ABAC) by:

  • Decoupling Trust: The backend no longer trusts the ID provided in the request URL/body as the sole source of authority.
  • Enforcing Ownership: Every database query now automatically injects an owner_id or org_id filter derived from a secure, server-side session.
  • Silent Rejection: Unauthorized requests now correctly return a 403 Forbidden error, ensuring data isolation between collaborating parties.

Official announcement: Please refer to the link for details –

https://nvidia.custhelp.com/app/answers/detail/a_id/5819