Category Archives: AI and ML

CVE-2025-23310: The NVIDIA Triton Inference Server for Windows and Linux suffers from a stack buffer overflow due to specially crafted input. (5th Aug 2025)

Preface: The NVIDIA Triton Inference Server API supports both HTTP/REST and GRPC protocols. These protocols allow clients to communicate with the Triton server for various tasks such as model inferencing, checking server and model health, and managing model metadata and statistics.

Background: NVIDIA Triton™ Inference Server, part of the NVIDIA AI platform and available with NVIDIA AI Enterprise, is open-source software that standardizes AI model deployment and execution across every workload.

The Asynchronous Server Gateway Interface (ASGI) is a calling convention for web servers to forward requests to asynchronous-capable Python frameworks, and applications. It is built as a successor to the Web Server Gateway Interface (WSGI).

NVIDIA Triton Inference Server integrates a built-in web server to expose its functionality and allow clients to interact with it. This web server is fundamental to how Triton operates and provides access to its inference capabilities on both Windows and Linux environments.

Vulnerability details: CVE-2025-23310 – NVIDIA Triton Inference Server for Windows and Linux contains a vulnerability where an attacker could cause stack buffer overflow by specially crafted inputs. A successful exploit of this vulnerability might lead to remote code execution, denial of service, information disclosure, and data tampering.

Official announcement: Please refer to the link for details –

https://nvidia.custhelp.com/app/answers/detail/a_id/5687

The whole world is paying attention to Nvidia, but supercomputers using AMD are the super ones! (July 28, 2025)

Preface: The El Capitan system at the Lawrence Livermore National Laboratory, California, USA remains the No. 1 system on the TOP500. The HPE Cray EX255a system was measured with 1.742 Exaflop/s on the HPL benchmark. El Capitan has 11,039,616 cores and is based on AMD 4th generation EPYC™ processors with 24 cores at 1.8 GHz and AMD Instinct™ MI300A accelerators. It uses the HPE Slingshot interconnect for data transfer and achieves an energy efficiency of 58.9 Gigaflops/watt. The system also achieved 17.41 Petaflop/s on the HPCG benchmark which makes it the new leader on this ranking as well. June 2025

Background: Does El Capitan Use Docker or Kubernetes? El Capitan does not use Docker directly, but it does use Kubernetes—specifically:

Kubernetes is deployed on Rabbit and worker nodes. It is part of a stateless orchestration layer integrated with the Tri-Lab Operating System Stack (TOSS).

Kubernetes is used alongside Flux (the resource manager) and Rabbit (the near-node storage system) to manage complex workflows.

Why Kubernetes Instead of Docker Alone?

While Docker is lightweight and flexible, Kubernetes offers orchestration, which is critical for:

  • Managing thousands of concurrent jobs.
  • Coordinating data movement and storage across Rabbit nodes.
  • Supporting AI/ML workflows and in-situ analysis.

But Kubernetes has a larger memory and CPU footprint than Docker alone.

Technical details: HPE Cray Operating System (COS) is a specialized version of SUSE Linux Enterprise Server designed for high-performance computing, rather than being a variant of Red Hat Enterprise Linux. It’s built to run large, complex applications at scale and enhance application efficiency, reliability, management, and data access. While COS leverages SUSE Linux, it incorporates features tailored for supercomputing environments, such as enhanced memory sharing, power monitoring, and advanced kernel debugging.

What Does Cray Modify?
Cray (now part of HPE) primarily modifies:
-The Linux kernel for performance tuning, scalability, and hardware support
-Adds HPC-specific enhancements, such as:
Optimized scheduling
NUMA-aware memory management
High-speed interconnect support (e.g., Slingshot)
Enhanced I/O and storage stack
-Integrates with Cray Shasta architecture and Slingshot interconnect

These modifications are layered on top of SUSE Linux, meaning the base OS remains familiar and enterprise-grade, but is tailored for supercomputing.

End.

Our world is full of challenges and hardships. But you must be happy every day!

Security Focus: CVE‑2025‑23284 NVIDIA vGPU software contains a vulnerability (25-07-2025)

Preface: Memory Allocation Flow:

  1. User-space request (e.g., CUDA malloc or OpenGL buffer allocation).
  2. Driver calls memmgrCreateHeap_IMPL() to create a memory heap.
  3. Heap uses pmaAllocatePages() to get physical memory.
  4. Virtual address space is mapped using UVM or MMU walker.
  5. Memory is returned to user-space or GPU context.

Background:

An OS-agnostic binary is a compiled program designed to run on multiple operating systems without requiring separate builds for each. This means the binary file can be executed on different OS platforms without modification, achieving a level of portability that’s not common with traditional compiled software.

The core loadable module within the NVIDIA vGPU software package is the NVIDIA kernel driver, specifically named nvidia[.]ko. This module facilitates communication between the guest virtual machine (VM) and the physical NVIDIA GPU. It’s split into two main components: an OS-agnostic binary and a kernel interface layer. The OS-agnostic component, for example, nv-kernel[.]o_binary for the nvidia[.]ko module, is provided as a pre-built binary to save time during installation. The kernel interface layer is specific to the Linux kernel version and configuration.

Vulnerability details:

CVE-2025-23285: NVIDIA vGPU software contains a vulnerability in the Virtual GPU Manager, where a malicious guest could cause a stack buffer overflow. A successful exploit of this vulnerability might lead to code execution, denial of service, information disclosure, or data tampering.

CVE2025-23283: NVIDIA vGPU software for Linux-style hypervisors contains a vulnerability in the Virtual GPU Manager, where a malicious guest could cause stack buffer overflow. A successful exploit of this vulnerability might lead to code execution, denial of service, escalation of privileges, information disclosure, or data tampering.

Official announcement: Please see the url for details –

https://nvidia.custhelp.com/app/answers/detail/a_id/5670

CVE-2023-4969 – Researchers from Trail of Bits reported a potential vulnerability, titled “LeftoverLocals.”, actually this GPU design weakness are fickle! (21-07-2025)

Preface: “LeftoverLocals” allows recovery of data from GPU local memory created by other processes on Apple, Qualcomm, AMD, and Imagination GPUs. LeftoverLocals affects the security posture of the entire GPU application, especially LLM and machine learning models running on affected GPU platforms. NVD published on January 16, 2024. So far, AMD appears to be the only company actively taking remediation measures.

Background: Researchers from Trail of Bits reported a potential vulnerability, titled “LeftoverLocals” article to public on 16th January 2024. The corrective action was taken by AMD in following schedule.

2025-07-18: Updated the Mitigation section for AMD Radeon Graphics

2025-06-23: Updated the Mitigation section for Data Center Graphics, AMD Radeon Graphics, and revised Client Processors table

2025-04-07: Updated the Mitigation section for Data Center Graphics, AMD Radeon Graphics, and Client Processors

2025-02-11: Updated the Mitigation section – Data Center Graphics

2025-01-15: Mitigation section has been updated and AMD Ryzen™ AI 300 Series Processor (Formerly codenamed) “Strix Point” FP8 has been added to the Client Processors list

2024-11-07: Mitigation has been updated for MI300 and MI300A

Updated driver version from 24.x.y to 25.x.y

2024-10-30: Updated mitigation targets

2024-08-02: Updated AMD Software: Adrenalin Edition and PRO Edition versions.

Removed: AMD Ryzen™ 3000 Series Processors with Radeon™ Graphics (Not affected)

Added: AMD Ryzen™ 8000 Series Processors with Radeon™ Graphics and AMD Ryzen™ 7030 Series Processors with Radeon™ Graphics

2024-07-30: Updated the Mitigation section of AMD RadeonTM Graphics and Client processors product tables

Updated Data Center Graphics Inter-VM and Bare Metal/Intra-VM Mitigation product tables

Updated mitigation section month for driver update rollout

2024-05-07: Added Vega products and Mitigation section with Product tables

2024-01-26: Updated Graphics and Data Center Graphics products

2024-01-16: Initial publication

Vulnerability details: CVE-2023-4969: A GPU kernel can read sensitive data from another GPU kernel (even from another user or app) through an optimized GPU memory region called _local memory_ on various architectures.

Official announcement: Please refer to the official link for details – https://www.amd.com/en/resources/product-security/bulletin/amd-sb-6010.html

Remark: In step 5, CU2 is written incorrectly. The correct word should be CU.

CVE-2025-23270: NVIDIA Jetson Linux contains a vulnerability in UEFI Management mode (20th July 2025)

Preface: To enter UEFI Management mode on a Jetson device, you’ll typically need to access it during the boot process by pressing a specific key (like F2, F10, or Del) before the OS starts loading. Once in UEFI, you can configure settings related to booting, such as boot order and device selection.

Background: CUDA is a parallel computing platform and programming model developed by NVIDIA, designed to leverage the power of GPUs for general-purpose computing. Linux for Tegra (L4T) is NVIDIA’s customized Linux distribution based on Ubuntu, optimized for their Tegra family of system-on-chips (SoCs), including those used in Jetson development kits. Essentially, L4T provides the operating system and necessary drivers for running CUDA-enabled applications on NVIDIA’s embedded platforms.

NVIDIA Jetson Linux is a customized version of the Linux operating system specifically designed for NVIDIA Jetson embedded computing modules. It provides a complete software stack, including the Linux kernel, bootloader, drivers, and libraries, tailored for the Jetson platform’s hardware and intended for edge AI and robotics applications.

Vulnerability details:

CVE-2025-23270 NVIDIA Jetson Linux contains a vulnerability in UEFI Management mode, where an unprivileged local attacker may cause exposure of sensitive information via a side channel vulnerability. A successful exploit of this vulnerability might lead to code execution, data tampering, denial of service, and information disclosure.

CVE-2025-23269 NVIDIA Jetson Linux contains a vulnerability in the kernel where an attacker may cause an exposure of sensitive information due to a shared microarchitectural predictor state that influences transient execution. A successful exploit of this vulnerability may lead to information disclosure.

Official announcement: Please see the link for details

https://nvidia.custhelp.com/app/answers/detail/a_id/5662

“When error occurs, the data remaining on cache memory. When OS started, a malicious program stored in device then executes read on shared memory.”

CVE-2025-23266 and CVE-2025-23266: NVIDIA Container Toolkit design weakness (16-07-2025)

Preface: Docker Compose is a tool that makes it easier to define and manage multi-container Docker applications. It simplifies running interconnected services, such as a frontend, backend API, and database, by allowing them to be launched and controlled together.

Docker Compose is a utility for defining and running multi-container Docker applications. Furthermore, Docker Compose responsible manages the container lifecycle. Container lifecycle management is a critical process of overseeing the creation, deployment, and operation of a container until its eventual decommissioning.

Background: Docker Compose v2.30.0 has introduced lifecycle hooks, making it easier to manage actions tied to container start and stop events. This feature lets developers handle key tasks more flexibly while keeping applications clean and secure.

Vulnerability details:

CVE-2025-23266: NVIDIA Container Toolkit for all platforms contains a vulnerability in some hooks used to initialize the container, where an attacker could execute arbitrary code with elevated permissions. A successful exploit of this vulnerability might lead to escalation of privileges, data tampering, information disclosure, and denial of service.

CVE-2025-23267: NVIDIA Container Toolkit for all platforms contains a vulnerability in the update-ldcache hook, where an attacker could cause a link following by using a specially crafted container image. A successful exploit of this vulnerability might lead to data tampering and denial of service.

Official announcement: Please refer to url for details

https://nvidia.custhelp.com/app/answers/detail/a_id/5659

Ref: Does Disabling Hooks Disable Container Lifecycle Management?

Hooks – In this context, hooks are scripts or binaries that run during container lifecycle events (e.g., prestart, poststart). The CUDA compatibility hook injects libraries or environment variables needed for CUDA apps.

Disabling the Hook – Prevents the automatic injection of CUDA compatibility libraries into containers. This does not disable the entire container lifecycle, but it removes one automation step in the lifecycle.

CVE-2025-53818: Command Injection in MCP Server github-kanban-mcp-server (15th July 2025)

Preface: Does it good when artificial Intelligence use Open Source software? Yes, using open-source software is generally considered a positive aspect for artificial intelligence development. It fosters collaboration, transparency, and faster innovation, while also potentially reducing costs and biases. However, it’s crucial to acknowledge potential risks like misuse and the need for responsible development practices.

Background: The Model Context Protocol (MCP) is an open standard, open-source framework designed to standardize how AI models, particularly large language models (LLMs), interact with external tools, systems, and data sources. Think of it as a universal adapter, similar to USB-C, for AI applications, allowing them to easily connect to and utilize various data and tools.

A Kanban MCP Server is a server component that manages Kanban boards using the Model Context Protocol (MCP). It allows AI assistants and other systems to interact with and manipulate Kanban boards programmatically, enabling automation and integration of workflows.

Vulnerability details: GitHub Kanban MCP Server is a Model Context Protocol (MCP) server for managing GitHub issues in Kanban board format and streamlining LLM task management. Versions 0.3.0 and 0.4.0 of the MCP Server are written in a way that is vulnerable to command injection vulnerability attacks as part of some of its MCP Server tool definition and implementation. The MCP Server exposes the tool `add_comment` which relies on Node.js child process API `exec` to execute the GitHub (`gh`) command, is an unsafe and vulnerable API if concatenated with untrusted user input.

Workaround: As of time of publication, no known patches are available.

But you can securely rewrite the vulnerable handleAddComment function using execFile or the GitHub REST API to avoid command injection risks.

Workaround 1: Using execFile (Safer Shell Execution)

execFile does not invoke a shell, so special characters in inputs (like ;, &&, etc.) are treated as literal arguments, not commands

Workaround 2: Using GitHub REST API via @octokit/rest

– No shell involved.

– Fully typed and authenticated.

– GitHub officially supports and maintains this SDK.

Official announcement: Please refer to url for details –

https://nvd.nist.gov/vuln/detail/CVE-2025-53818

AMD-based AI systems combining AMD rocBLAS and Intel MKL can become fast supercomputer in the world (14-07-2025)

Preface: Supercomputers rely on math libraries to efficiently handle the complex numerical computations required for scientific simulations and modeling. These libraries provide optimized routines for linear algebra, numerical analysis, and other mathematical operations, enabling supercomputers to perform these calculations much faster than with general-purpose code.

While math libraries are a crucial component, they are not the sole key to boosting overall AI performance on supercomputers. Supercomputers excel at AI due to their parallel processing capabilities, specialized hardware like GPUs and TPUs, and efficient memory management, not just the math libraries they use. Math libraries are essential for performing the calculations required by AI algorithms, but they rely on the underlying hardware architecture and software infrastructure of the supercomputer to deliver that performance.

Background: AMD rocBLAS 6.0.2 is a version of AMD’s library for Basic Linear Algebra Subprograms (BLAS) optimized for AMD GPUs within the ROCm platform. It provides high-performance, robust implementations of BLAS operations, similar to legacy BLAS but adapted for GPU execution using the HIP programming language. Specifically, version 6.0.2 is a point release that includes minor bug fixes to improve the stability of applications using AMD’s MI300 GPUs. It also introduces new driver features for system qualification on partner server offerings.

Using AMD rocBLAS and Intel MKL (2016 or later) together can be beneficial because MKL, while optimized for Intel CPUs, can sometimes perform suboptimally on AMD CPUs. rocBLAS, on the other hand, is specifically optimized for AMD GPUs and CPUs, providing a performance boost on AMD hardware.

Why Mix rocBLAS and MKL?

  • rocBLAS: Optimized for AMD GPUs (and CPUs via ROCm stack).
  • MKL: Optimized for Intel CPUs, but still useful for certain CPU-bound tasks.
  • Mixing: You can selectively use each library for the operations where it performs best.

– END-

Nvidia security focus – Rowhammer attack potential risk – July 2025 (11th July 2025)

Preface: The Rowhammer effect, a hardware vulnerability in DRAM chips, was first publicly presented and analyzed in June 2014 at the International Symposium on Computer Architecture (ISCA). This research, conducted by Yoongu Kim et al., demonstrated that repeatedly accessing a specific row in a DRAM chip can cause bit flips in nearby rows, potentially leading to security breaches.

Background: Nvidia has shifted from “copy on flip” to asynchronous copy mechanisms in their GPU architecture, particularly with the Ampere architecture and later. This change allows for more efficient handling of data transfers between memory and the GPU, reducing latency and improving overall performance, especially in scenarios with high frame rates or complex computations.

When System-Level ECC is enabled, it prevents attackers from successfully executing Rowhammer attacks by ensuring memory integrity. The memory controller detects and corrects bit flips, making it nearly impossible for an attacker to exploit them for privilege escalation or data corruption.

Technical details: Modern DRAMs, including the ones used by NVIDIA, are potentially susceptible to Rowhammer. The now decade-old Rowhammer problem has been well known for CPU memories (e.g., DDR, LPDDR). Recently, researchers at the University of Toronto demonstrated a successful Rowhammer exploitation on a NVIDIA A6000 GPU with GDDR6 memory where System-Level ECC was not enabled. In the same paper, the researchers showed that enabling System-Level ECC mitigates the Rowhammer problem. 

Official announcement: Technical details: see link – https://nvidia.custhelp.com/app/answers/detail/a_id/5671

AMD releases details about Transient Scheduler Attack (TSA) – 9 Jul 2025

Preface: CPU transient instructions refer to instructions that are speculatively executed by a processor’s out-of-order execution engine, but which may ultimately be discarded and not reflected in the processor’s architectural state. These instructions are executed based on predictions about control flow or data dependencies, and if the prediction is incorrect, the results of these transient instructions are discarded.

Background: Transient Scheduler Attacks (TSA) are new speculative side channel attacks related to the execution timing of instructions under specific microarchitectural conditions. In some cases, an attacker may be able to use this timing information to infer data from other contexts, resulting in information leakage.

Vulnerability details:

CVE-2024-36350 – A transient execution vulnerability in some AMD processors may allow an attacker to infer data from previous stores, potentially resulting in the leakage of privileged information.

CVE-2024-36357 – A transient execution vulnerability in some AMD processors may allow an attacker to infer data in the L1D cache, potentially resulting in the leakage of sensitive information across privileged boundaries.

CVE-2024-36348 – A transient execution vulnerability in some AMD processors may allow a user process to infer the control registers speculatively even if UMIP[3] feature is enabled, potentially resulting in information leakage.

CVE-2024-36349 – A transient execution vulnerability in some AMD processors may allow a user process to infer TSC_AUX even when such a read is disabled, potentially resulting in information leakage.

Official announcement: Please see the link for details –

https://www.amd.com/en/resources/product-security/bulletin/amd-sb-7029.html