Category Archives: AI and ML

CVE-2025-23272: About NVIDIA nvJPEG library (6th Oct 2025)

Preface: The nvJPEG library provides low-latency decoding, encoding, and transcoding for common JPEG formats used in computer vision applications such as image classification, object detection and image segmentation.

Background: The nvJPEG library enables the following functions: use the JPEG image data stream as input; retrieve the width and height of the image from the data stream, and use this retrieved information to manage the GPU memory allocation and the decoding.

To use the nvJPEG library, start by calling the helper functions for initialization. Create nvJPEG library handle with one of the helper functions nvjpegCreateSimple() or nvjpegCreateEx() . Create JPEG state with the helper function nvjpegJpegStateCreate() . See nvJPEG Type Declarations and nvjpegJpegStateCreate() .

The nvJPEG library provides high-performance, GPU accelerated JPEG decoding functionality for image formats commonly used in deep learning and hyperscale multimedia applications.

Ref: Arrays in C/C++ are zero-indexed, meaning that if an array has `n` elements, valid indices range from `0` to `n-1`. Accessing an index outside this range leads to out-of-bounds access. Pointers in C/C++ provide direct memory manipulation capabilities, but this power comes with the risk of “out-of-bounds” access.

Vulnerability details: NVIDIA nvJPEG library contains a vulnerability where an attacker can cause an out-of-bounds read by means of a specially crafted JPEG file. A successful exploit of this vulnerability might lead to information disclosure or denial of service.

Official announcement: For more details, please click the link.

https://nvd.nist.gov/vuln/detail/CVE-2025-23272

CVE-2025-10657: About Enhanced Container Isolation (2nd Oct 2025)

Preface: Standardized AI/ML model packaging: With OCI artifacts, models can be versioned, distributed, and tracked like container images. This promotes consistency and traceability across environments.Docker Desktop, specifically through its Docker Model Runner feature, can be used to run various AI models, particularly Large Language Models (LLMs) and other AI models that can be packaged as OCI Artifacts.

OCI Artifacts are any arbitrary files associated with software applications, extending the standardized OCI (Open Container Initiative) image format to include content beyond container images, such as Helm charts, Software Bill of Materials (SBOMs), digital signatures, and provenance data. These artifacts leverage the same fundamental OCI structure of manifest, config, and layers and are stored and distributed using OCI-compliant registries and tools like the ORAS CLI.

Background: A container desktop, such as Docker Desktop, acts as a local development environment and a management host for CI/CD pipelines by providing consistent, isolated environments for building, testing, and deploying containerized applications. It enables developers to package applications with their dependencies into portable containers, eliminating “works on my machine” issues and ensuring application uniformity across development, testing, and production. This simplifies the entire software delivery process, accelerating the development lifecycle by integrating container management directly into the developer’s workflow.

Vulnerability details: In a hardened Docker environment, with Enhanced Container Isolation ( ECI https://docs.docker.com/enterprise/security/hardened-desktop/enhanced-container-isolation/ ) enabled, an administrator can utilize the command restrictions feature https://docs.docker.com/enterprise/security/hardened-desktop/enhanced-container-isolation/config/#command-restrictions  to restrict commands that a container with a Docker socket mount may issue on that socket. Due to a software bug, the configuration to restrict commands was ignored when passed to ECI, allowing any command to be executed on the socket. This grants excessive privileges by permitting unrestricted access to powerful Docker commands. The vulnerability affects only Docker Desktop 4.46.0 users that have ECI enabled and are using the Docker socket command restrictions feature. In addition, since ECI restricts mounting the Docker socket into containers by default, it only affects containers which are explicitly allowed by the administrator to mount the Docker socket.

Official announcement: For more details, please see the link –

https://nvd.nist.gov/vuln/detail/CVE-2025-10657

CVE-2025-55780: AI LLM developers should not underestimate Mupdf design flaw! (29-09-2025)

Preface: LLMs are built on machine learning: specifically, a type of neural network called a transformer model. How do LLMs read PDFs? The first step was to extract the text blocks from the PDF using pdfplumber . Each text block came with its coordinates, which allowed to analyze their spatial relationships. Next, I created a “window” around each text block to capture its surrounding context. 

Background: MuPDF is not widely known by consumers as a popular standalone application, but it is popular and growing in popularity among developers, particularly those working with Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems, due to its powerful and lightweight nature.

Large Language Models (LLMs) do not directly “read” PDF files in their native binary format. Instead, they interact with the extracted content of the PDF. MuPDF, through its Python binding PyMuPDF (or its specialized variant PyMuPDF4LLM), plays a crucial role in this process by enabling efficient and accurate extraction of information from PDFs.

Vulnerability details: A null pointer dereference occurs in the function break_word_for_overflow_wrap() in MuPDF 1.26.4 when rendering a malformed EPUB document. Specifically, the function calls fz_html_split_flow() to split a FLOW_WORD node, but does not check if node->next is valid before accessing node->next->overflow_wrap, resulting in a crash if the split fails or returns a partial node chain.

Official announcement: For more details, see the link

https://nvd.nist.gov/vuln/detail/CVE-2025-55780

CVE-2025-23348 and CVE-2025-23349: About NVIDIA Megatron-LM (26-09-2025)

Preface: For years, OpenAI’s GPT series has been a dominant force, while NVIDIA’s Megatron-LM has provided a powerful framework for training these massive models.

NVIDIA Megatron-LM faces competition from several other frameworks especially Microsoft DeepSpeed, Hugging Face Accelerate, JAX/Flax and PyTorch Lightning.

Both PyTorch Lightning and NVIDIA Megatron-LM are built on top of the PyTorch library. PyTorch provides the fundamental tensor operations and deep learning primitives, while these frameworks add abstractions and tools for more efficient and scalable model development and training.

Background: The full GPT pre-training process:

A script such as pretrain_gpt[.]py orchestrates the following major steps to train the model from scratch on billions of parameters and terabytes of data. This structure contains four steps:

  1. Data preparation
  2. Distributed setup
  3. Core training loop
  4. Model saving and evaluation

The design objective of a script like orqa/unsupervised/nq.py is to prepare the GPT model for open-domain question answering (QA), a task that is not typically a part of standard, large-scale unsupervised pre-training. The script specifically uses the Natural Questions (NQ) dataset to enhance the model’s ability to retrieve information from a large corpus of documents and generate answers, all without the direct use of a labeled QA dataset for this step.

Vulnerability details:

CVE-2025-23348: NVIDIA Megatron-LM for all platforms contains a vulnerability in the pretrain_gpt script, where malicious data created by an attacker may cause a code injection issue. A successful exploit of this vulnerability may lead to code execution, escalation of privileges, information disclosure, and data tampering.

CVE-2025-23349: NVIDIA Megatron-LM for all platforms contains a vulnerability in the tasks/orqa/unsupervised/nq.py component, where an attacker may cause a code injection. A successful exploit of this vulnerability may lead to code execution, escalation of privileges, information disclosure, and data tampering.

Official announcement: Please refer to the link for more details –

https://nvidia.custhelp.com/app/answers/detail/a_id/5698

Chypnosis on FPGAs – AMD is investigating whether on specific devices and components are affected and plans to provide updates as new findings emerge.(22nd Sep 2025)

Preface: AMD uses FPGAs (Field-Programmable Gate Arrays) in High-Performance Computing (HPC) by offering accelerator cards and adaptive SoCs that allow users to program custom hardware for HPC workloads in fields like machine learning, data analytics, and scientific simulations.

AMD manufactures FPGA-based accelerator cards that enable users to program applications directly onto the FPGA, eliminating the lengthy card design process. These cards install as-is in servers, accelerating workloads in financial computing, machine learning, computational storage, and data analytics.

Background: The XADC is an integrated, on-chip block within certain AMD (formerly Xilinx) FPGAs that performs analog-to-digital conversion (ADC) and also includes on-chip sensors for voltage and temperature monitoring. The FPGA provides the programmable logic to process the digitized data from the XADC, use it for control, or access it through the FPGA’s interconnects like the Dynamic Reconfiguration Port (DRP) or JTAG interface.

Xilinx ADCs (XADCs), particularly flash ADCs, have disadvantages related to high power consumption, large physical size, and limited resolution due to the large number of comparators required for higher bit depth. Non-linearity can also introduce signal distortion and measurement errors, while the integration of ADCs directly into FPGAs may not be feasible for all applications due to the required external components.

Security Focus of an Academic Research Paper: Attacks on the Programmable Logic (PL) in AMD Artix™ 7 Series FPGA Devices.

Artix 7 FPGAs and Artix™ UltraScale+ difference – Key Differences at a Glance:

The main difference is that Artix™ UltraScale+ FPGAs are a newer, higher-performance family built on a 16nm FinFET process, offering improved power efficiency, higher transceiver speeds, and more advanced features like enhanced DSP blocks and hardened memory, while the Artix 7 FPGAs are older devices built on a 28nm process. UltraScale+ also features ASIC-class clocking, supports faster memory interfaces like LPDDR4x and DDR4, and includes advanced security features.

Vulnerability details: The academic research paper introducing the new approach demonstrates the attack on the programmable logic (PL) in AMD Artix™ 7-Series FPGA devices. It shows that the on-chip XADC-based voltage monitor is too slow to detect and/or execute a tamper response to clear memory contents. Furthermore, they show that detection circuits that have been developed to detect clock freezing2 are ineffective as well. In general, the attack can be applied on all ICs that do not have effective tamper responses to clear sensitive data in case of an undervoltage event.

Official announcement: Please see the link for details –

https://www.amd.com/en/resources/product-security/bulletin/amd-sb-8018.html

2025-23316 and CVE-2025-23268: About NVIDIA Triton Inference Server (18th Sep 2025)

Preface: AI deployment is accelerated by hardware advancements (especially GPUs), ML platforms and MLOps for automation, the use of pre-trained models via transfer learning, containerization and orchestration for scalability, cloud infrastructure providing on-demand resources, and industry collaborations and specialized data partners to streamline various stages of the AI lifecycle.

Background: NVIDIA Triton Inference Server is an open-source inference serving platform whose primary goal is to simplify and accelerate the deployment of AI models in production environments. It aims to provide a unified platform capable of serving models from various machine learning frameworks, such as TensorFlow, PyTorch, ONNX Runtime, and custom backends, enabling flexibility and interoperability.

The “model name” parameter in NVIDIA Triton Inference Server is a crucial identifier used to specify which model a client wishes to interact with for inference requests.

Client API Usage: When using Triton client libraries (e.g., tritonclient[.]grpc or tritonclient[.]http), the model_name parameter is typically a required argument in functions used to send inference requests.

Both backends (Python and DALI) are part of Triton’s modular architecture. The Python backend often acts as a wrapper or orchestrator for other backends, including DALI.

Vulnerability details:

CVE-2025-23316 NVIDIA Triton Inference Server for Windows and Linux contains a vulnerability in the Python backend, where an attacker could cause a remote code execution by manipulating the model name parameter in the model control APIs. A successful exploit of this vulnerability might lead to remote code execution, denial of service, information disclosure, and data tampering.

CVE-2025-23268 VIDIA Triton Inference Server contains a vulnerability in the DALI backend where an attacker may cause an improper input validation issue. A successful exploit of this vulnerability may lead to code execution.

Official announcement: Please see the link for details –

https://nvidia.custhelp.com/app/answers/detail/a_id/5691

Point of view – NVIDIA’s NVDebug tool, about CVE-2025-23342 and CVE-2025-23343.  (12th Sep 2025)

Preface: Debug logs may contain user IDs and passwords to provide diagnostic information for failed login attempts, authentication failures, or to trace user activity within an application, but this is a significant security risk and should be avoided. Security best practices dictate that sensitive information like passwords should never be logged in cleartext. Instead, logging should only include non-sensitive user identifiers to help with troubleshooting without exposing credentials

Background: NVIDIA’s NVDebug tool is part of the broader Nsight Systems tool suite and relies on the NVIDIA Data Center GPU Manager (DCGM) library, specifically utilizing it for data collection and diagnostics to assist in troubleshooting and monitoring NVIDIA GPUs.

  • NVDebug is a tool for debugging and profiling NVIDIA GPUs, particularly in data center environments.
  • DCGM is a library for managing and monitoring NVIDIA GPUs in clusters and data centers.

NVDebug uses the DCGM library to gather essential diagnostic data, logs, and health information from the GPUs, enabling detailed analysis of the system’s state and performance.

Vulnerability details:

CVE-2025-23342: The NVIDIA NVDebug tool contains a vulnerability that may allow an actor to gain access to a privileged account . A successful exploit of this vulnerability may lead to code execution, denial of service, escalation of privileges, information disclosure and data tampering.

CVE-2025-23343: The NVIDIA NVDebug tool contains a vulnerability that may allow an actor to write files to restricted components. A successful exploit of this vulnerability may lead to information disclosure, denial of service, and data tampering.

Official announcement: Please see the link for details – https://nvidia.custhelp.com/app/answers/detail/a_id/5696

The incorrect authorization described in CVE-2025-23256 may be triggered or facilitated by the underlying flaw CVE-2025-38456. (11th Sep 2025)

Preface:

  • IPMI is a standardized interface for hardware management, operating via the Baseboard Management Controller (BMC).
  • It supports both in-band (local) and out-of-band (remote) access.
  • BlueField’s reliance on OpenIPMI and IPMItool makes it susceptible to kernel-level vulnerabilities.

Background: The Intelligent Platform Management Interface (IPMI) is a standard interface for hardware management used by system administrators to control the devices and monitor the sensors. For these, it is necessary the IPMI Controller called Baseboard Management Controller (BMC) and a manager software (for example, IPMItool). It provides an interface to manage IPMI functions in a local (in-band) or remote (out-of-band) system.

Vulnerability details:

This advisory explores a potential causal relationship between two recent vulnerabilities:

  • CVE-2025-23256 – A high-severity vulnerability in the NVIDIA BlueField DPU management interface, allowing local attackers to bypass authorization and modify configurations.

https://nvidia.custhelp.com/app/answers/detail/a_id/5655

  • CVE-2025-38456 – A moderate-severity vulnerability in the Linux IPMI subsystem, involving memory corruption due to mishandled pointers in ipmi_create_user().

https://nvd.nist.gov/vuln/detail/CVE-2025-38456

Recommendations

  1. Patch Kernel IPMI Subsystem: Ensure CVE-2025-38456 is mitigated in all systems running BlueField.
  2. Update BlueField Firmware: Apply NVIDIA’s latest firmware updates addressing CVE-2025-23256.
  3. Audit IPMI Access Controls: Review and restrict local access to /dev/ipmi0 and IPMItool.

CVE-2025-23257 and CVE-2025-23258: About NVIDIA DOCA  (4th Sep 2025)

Preface: An NVIDIA endless “collect-export” loop refers to the standard, continuous operation of the DOCA Telemetry Service (DTS), where telemetry data is perpetually collected and then exported. While high-frequency telemetry (HFT) offers an external, triggered alternative, the standard DTS flow is designed to run indefinitely, collecting data from the Sysfs provider and potentially exporting it via Prometheus or Fluent Bit.

Background: CUDA (Compute Unified Device Architecture) and DOCA (Data Center Infrastructure-on-a-Chip Architecture) are both NVIDIA SDKs, but they serve distinct purposes and target different hardware.

CUDA SDK: Primarily designed for general-purpose computing on NVIDIA GPUs. It enables developers to program accelerated computing applications by leveraging the parallel processing power of GPUs.

DOCA SDK: Built specifically for NVIDIA BlueField Data Processing Units (DPUs) and SuperNICs, aiming to accelerate data center infrastructure tasks. It enables offloading infrastructure-related workloads from the host CPU to the DPU.

DOCA Telemetry Service (DTS) is a DOCA Service for collecting and exporting telemetry data. It can run on hosts and BlueField, collecting data from built-in providers and external telemetry applications. The service supports various providers, including sysfs, ethtool, ifconfig, PPCC, DCGM, NVIDIA SMI, and more.

Ref: The binary data can be read using the /opt/mellanox/collectx/bin/clx_read app, packaged in collectx-clxapidev , a DOCA dependency package.

Vulnerability details:

CVE-2025-23257: NVIDIA DOCA contains a vulnerability in the collectx-clxapidev Debian package that could allow an actor with low privileges to escalate privileges. A successful exploit of this vulnerability might lead to escalation of privileges.

CVE-2025-23258: NVIDIA DOCA contains a vulnerability in the collectx-dpeserver Debian package for arm64 that could allow an attacker with low privileges to escalate privileges. A successful exploit of this vulnerability might lead to escalation of privileges.

Official announcement: Please see the link for details –

https://nvidia.custhelp.com/app/answers/detail/a_id/5655

CVE-2025-23307: NVIDIA NeMo Curator for all platforms contains a vulnerability (28th Aug 2025)

Preface: NeMo Curator, part of the NVIDIA NeMo software suite for managing the AI agent lifecycle, is a Python library specifically designed for fast and scalable data processing and curation for generative AI use cases such as foundation language model pretraining, text-to-image model training, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT).

Background: To install the NeMo Curator library, run the following command:

  • git clone https://github[.]com/NVIDIA/NeMo-Curator[.]git
  • cd NeMo-Curator
  • pip install –extra-index-url https://pypi[.]nvidia[.]com “.[cuda12x]”

Data download: downloading pipeline in NeMo Curator consists of the following classes:

  • DocumentDownloader: Abstract class for downloading remote data to disk.
  • DocumentIterator: Abstract class for reading dataset raw records from the disk.
  • DocumentExtractor: Abstract class for extracting text records, as well as any relevant metadata from the records on the disk.

Vulnerability details: NVIDIA NeMo Curator for all platforms contains a vulnerability where a malicious file created by an attacker could allow code injection. A successful exploit of this vulnerability might lead to code execution, escalation of privileges, information disclosure, and data tampering.

Ref: The vulnerability arises when malicious files—such as JSONL files—are loaded by NeMo Curator. If these files are crafted to exploit weaknesses in how NeMo Curator parses or processes them, they can inject executable code. This aligns with your description of:

  • Embedded malicious payloads in JSONL files.
  • JSON injection attacks exploiting parsing logic.

Official announcement: Please see the link for details –

https://nvidia.custhelp.com/app/answers/detail/a_id/5690