Category Archives: AI and ML

Nvidia security focus – Rowhammer attack potential risk – July 2025 (11th July 2025)

Preface: The Rowhammer effect, a hardware vulnerability in DRAM chips, was first publicly presented and analyzed in June 2014 at the International Symposium on Computer Architecture (ISCA). This research, conducted by Yoongu Kim et al., demonstrated that repeatedly accessing a specific row in a DRAM chip can cause bit flips in nearby rows, potentially leading to security breaches.

Background: Nvidia has shifted from “copy on flip” to asynchronous copy mechanisms in their GPU architecture, particularly with the Ampere architecture and later. This change allows for more efficient handling of data transfers between memory and the GPU, reducing latency and improving overall performance, especially in scenarios with high frame rates or complex computations.

When System-Level ECC is enabled, it prevents attackers from successfully executing Rowhammer attacks by ensuring memory integrity. The memory controller detects and corrects bit flips, making it nearly impossible for an attacker to exploit them for privilege escalation or data corruption.

Technical details: Modern DRAMs, including the ones used by NVIDIA, are potentially susceptible to Rowhammer. The now decade-old Rowhammer problem has been well known for CPU memories (e.g., DDR, LPDDR). Recently, researchers at the University of Toronto demonstrated a successful Rowhammer exploitation on a NVIDIA A6000 GPU with GDDR6 memory where System-Level ECC was not enabled. In the same paper, the researchers showed that enabling System-Level ECC mitigates the Rowhammer problem. 

Official announcement: Technical details: see link – https://nvidia.custhelp.com/app/answers/detail/a_id/5671

AMD releases details about Transient Scheduler Attack (TSA) – 9 Jul 2025

Preface: CPU transient instructions refer to instructions that are speculatively executed by a processor’s out-of-order execution engine, but which may ultimately be discarded and not reflected in the processor’s architectural state. These instructions are executed based on predictions about control flow or data dependencies, and if the prediction is incorrect, the results of these transient instructions are discarded.

Background: Transient Scheduler Attacks (TSA) are new speculative side channel attacks related to the execution timing of instructions under specific microarchitectural conditions. In some cases, an attacker may be able to use this timing information to infer data from other contexts, resulting in information leakage.

Vulnerability details:

CVE-2024-36350 – A transient execution vulnerability in some AMD processors may allow an attacker to infer data from previous stores, potentially resulting in the leakage of privileged information.

CVE-2024-36357 – A transient execution vulnerability in some AMD processors may allow an attacker to infer data in the L1D cache, potentially resulting in the leakage of sensitive information across privileged boundaries.

CVE-2024-36348 – A transient execution vulnerability in some AMD processors may allow a user process to infer the control registers speculatively even if UMIP[3] feature is enabled, potentially resulting in information leakage.

CVE-2024-36349 – A transient execution vulnerability in some AMD processors may allow a user process to infer TSC_AUX even when such a read is disabled, potentially resulting in information leakage.

Official announcement: Please see the link for details –

https://www.amd.com/en/resources/product-security/bulletin/amd-sb-7029.html

CVE-2025-38085: About hugetlb[.]c of Linux kernel. (29-06-2025)

Preface: Does Big Data use the TLB in the Linux kernel?

Yes, big data applications in Linux utilize the Translation Lookaside Buffer (TLB) as a crucial component of memory management. The TLB speeds up address translation by caching recently used virtual-to-physical address mappings. Applications like databases, which often handle large datasets and have specific memory access patterns, can benefit from the TLB’s ability to reduce the overhead of accessing physical memory.

Background: The Linux kernel’s mm/hugetlb directory contains the code for Huge TLB (Translation Lookaside Buffer) support. This feature allows the kernel to use larger page sizes (like 2MB or 1GB instead of the usual 4KB) for memory management, potentially improving performance by reducing TLB misses.

Ref: syscalls are part of the operating system kernel and provide an interface for user space programs to request services from the kernel. User space refers to the memory area where applications run, while kernel space is where the operating system’s core and privileged operations reside.

Vulnerability details: In the Linux kernel, the following vulnerability has been resolved: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race huge_pmd_unshare() drops a reference on a page table that may have previously been shared across processes, potentially turning it into a normal page table used in another process in which unrelated VMAs can afterwards be installed. If this happens in the middle of a concurrent gup_fast(), gup_fast() could end up walking the page tables of another process. While I don’t see any way in which that immediately leads to kernel memory corruption, it is really weird and unexpected. Fix it with an explicit broadcast IPI through tlb_remove_table_sync_one(), just like we do in khugepaged when removing page tables for a THP collapse.

Official announcement: Please see the link for details https://nvd.nist.gov/vuln/detail/CVE-2025-38085

My comment: If your system is running a stable, older Linux kernel that predates the tlb_remove_table_sync_one() addition. The kernel will not call tlb_remove_table_sync_one() because it doesn’t exist in that version. The new kernel will enforce stricter synchronization, which could affect performance or expose latent bugs. You can make your decision to patch or remain unchanged.

CVE-2025-23260: About NVIDIA AIStore on Kubernetes (26-06-2025)

Preface: AI and machine learning workloads rely on optimized object storage to handle the massive, unstructured datasets needed for training and operation. NVIDIA AIStore (AIS) aims to overcome the limitations of traditional filesystems in handling large AI datasets by providing a distributed storage system that can handle the demands of modern AI models.

Background: An AIStore (AIS) target node primarily stores and manages user data, object replicas, and erasure-coded slices. It also handles bucket metadata and other persistent data structures. Essentially, it acts as a storage server within an AIS cluster.

To set up a service account for NVIDIA AIStore running inside Kubernetes, especially for storage services, you’ll typically follow these steps:

(1) The AIS Operator manages the lifecycle of AIStore clusters, including storage provisioning and access control.

(2) Create a Kubernetes Service Account.

(3) Bind Roles to the Service Account

(4) Configure AIStore to Use the Service Account

(5) Ensure Persistent Volumes Are Set Up

Vulnerability details: NVIDIA AIStore contains a vulnerability in the AIS Operator where a user may gain elevated k8s cluster access by using the ServiceAccount attached to the ClusterRole. A successful exploit of this vulnerability may lead to information disclosure.

Official announcement: Please see the link for details – https://nvidia.custhelp.com/app/answers/detail/a_id/5660

CVE-2025-23264 and CVE-2025-23265: About NVIDIA Megatron-LM (25-06-2025)

Preface: What Does “Linear” Mean in Machine Learning? In the context of machine learning and neural networks:

A linear function is one where the relationship between inputs and outputs can be represented as a straight line (in 2D), or more generally, a hyperplane in higher dimensions.

Background: NVIDIA Megatron-LM is an open-source framework designed for training large transformer models, particularly those with billions of parameters, across distributed GPU architectures. It leverages techniques like tensor and pipeline parallelism to enable efficient training of these massive models.

* Pipeline parallelism is when different stages of a process are executed in separate devices simultaneously. For instance, in the context of Machine Learning, various layers of a model can be distributed across different devices to create a pipeline.

Vulnerability details:

CVE-2025-23264: NVIDIA Megatron-LM for all platforms contains a vulnerability in a python component where an attacker may cause a code injection issue by providing a malicious file. A successful exploit of this vulnerability may lead to Code Execution, Escalation of Privileges, Information Disclosure and Data Tampering.

CVE-2025-23265: NVIDIA Megatron-LM for all platforms contains a vulnerability in a python component where an attacker may cause a code injection issue by providing a malicious file. A successful exploit of this vulnerability may lead to Code Execution, Escalation of Privileges, Information Disclosure and Data Tampering.

Official announcement: Please see the link for details – https://nvidia.custhelp.com/app/answers/detail/a_id/5663

AMD Fixed CVE-2024-21969 (23rd June 2025)

CVE-2024-21969: Whispering Pixels: Exploiting Uninitialized Register Accesses in Modern GPUs.

Preface: How to Enable Secure GPU Mode (Register Clearing)

  • This mode is supported on the following AMD GPUs:
  • Radeon RX 5000, 6000, 7000, 9000 series
  • Radeon PRO W5000, W6000, W7000 series
  • Radeon AI PRO 9000 series
  • Radeon VII, RX Vega
  • Instinct MI210, MI250, MI300X, etc.

Background: The proliferation of graphics processing units (GPUs) has brought unprecedented computing power.

Multiple register-based vulnerabilities found across different GPU implementations.

So-called whisper pixels. The vulnerability poses unique challenges to an adversary due to opaque scheduling and register remapping algorithms present in the GPU firmware, complicating the reconstruction of leaked data.

GPU Programming: An application has to use vendor- provided libraries in order to translate a shader from its high-level source code to an architecture-dependent binary code. Vendors provide these libraries for a variety of high-level languages.

Vulnerability details: Improper clearing of GPU registers could allow a malicious shader to read left-over pixel data leading to loss of confidentiality.

Mitigation (13th Aug 2024): AMD plans to create a new operating mode designed to prevent processes from running in parallel on the GPU, and to clear registers between processes on supported products.

Last Updated Date (23-06-2025): AMD has created a new operating mode designed to prevent processes from running in parallel on the GPU, and to clear registers between processes on supported products.  This mode is not enabled by default and needs to be set by an administrator. AMD expects performance impacts if the new mode is enabled in environments where multiple processes would have been running simultaneously on the GPU.  The performance impact will be related to the number of processes that would have been running in parallel.  Additionally, a lesser performance impact may arise due to the additional clearing of registers between processes.

Instructions for enabling the new mode can be found in the relevant release notes and/or product documentation.

AMD started rolling out mitigation options beginning in May 2024 through applicable driver updates.

Official announcement: Please refer to the website for details – https://www.amd.com/en/resources/product-security/bulletin/amd-sb-6013.html

CVE-2025-23252 – NVIDIA has released a software update for NVIDIA® NVDebug tool to address the security issue.(18-06-2025)

Preface: The NVdebug tool, used for NVIDIA GPU debugging, relies on the NVIDIA Data Center GPU Manager (DCGM) library. Specifically, it utilizes DCGM version 2.2.x or later. DCGM is a suite of tools for managing and monitoring NVIDIA GPUs in data center and cluster environments.

Background: The NV Debug Tool is part of the NVIDIA Nsight Systems and Nsight Graphics development tools. These tools are designed for debugging and profiling GPU-accelerated applications, including those using CUDA and other graphics APIs. It’s useful for debugging both CPU and GPU code, especially for CUDA applications.

Nsight Systems can collect logs for both Nsight Compute and Nsight Graphics. Nsight Systems is a system-wide performance analysis tool, while Nsight Compute focuses on kernel-level profiling and Nsight Graphics specializes in graphics application debugging and profiling. Nsight Systems can gather data that is relevant to both, and the collected data can be analyzed within the respective tools.

Vulnerability details: The NVIDIA NVDebug tool contains a vulnerability that may allow an actor to gain access to restricted components. A successful exploit of this vulnerability may lead to information disclosure.

Ref: CVE-2021-34398: NVIDIA DCGM, all versions prior to 2.2.9, contains a vulnerability in the DIAG module where any user can inject shared libraries into the DCGM server, which is usually running as root, which may lead to privilege escalation, total loss of confidentiality and integrity, and complete denial of service. (Public on May 29, 2025)

Point of view: In attached diagram description, if your system uses NVDebug, it’s very likely that it also includes or interacts with a version of the DCGM library, and therefore could be affected by vulnerabilities in DCGM versions prior to 2.2.9.

Official announcement: Please refer to the supplier announcement –

https://nvidia.custhelp.com/app/answers/detail/a_id/5651

CVE-2025-36852: “first-to-cache wins” jeopardizes Bucket-based object storage! (12th June 2025)

Preface: Cloud service providers use “first-to-cache wins” (also known as “cache hit”) for improved performance and efficiency. By caching frequently accessed data, they can quickly retrieve it without needing to access the slower underlying storage layer. This reduces latency, enhances user experience, and can also lower costs associated with accessing and processing data.

Background: In a “first-to-cache wins” scenario, the winning location is in CPU embedded memory, specifically the L1 and L2 caches, which are located directly on the processor chip. The L1 cache is the fastest and smallest, while the L2 cache is slower but larger. Both are much faster than accessing physical memory (RAM).

Do well-known cloud service providers implement the concept of “first-to-cache wins”?

Example 1: S3 is designed to store and manage data objects (files, images, etc.) in the cloud. While S3 itself doesn’t have a “first-to-cache wins” mechanism, it can be used as a layer of caching, especially when combined with other caching services like Amazon ElastiCache for Redis.

Example 2: Google Cloud Storage does not operate on a strict “first-to-cache wins” theory. Instead, it uses a more nuanced caching system that considers factors like the Cache-Control metadata, object type, and the location of the read operation.

Is the Vulnerability Valid?

Yes, it’s valid and serious in the context of build systems that:

-Use remote caching with object storage.

-Allow untrusted contributors (e.g., via pull requests).

-Do not enforce cache validation or isolation between trusted and untrusted environments.

Vulnerability details: A critical security vulnerability exists in remote cache extensions for common build systems utilizing bucket-based remote cache (such as those using Amazon S3, Google Cloud Storage, or similar object storage) that allows any contributor with pull request privileges to inject compromised artifacts from an untrusted environment into trusted production environments without detection. The vulnerability exploits a fundamental design flaw in the “first-to-cache wins” principle, where artifacts built in untrusted environments (feature branches, pull requests) can poison the cache used by trusted environments (protected branches, production deployments). This attack bypasses all traditional security measures including encryption, access controls, and checksum validation because the poisoning occurs during the artifact construction phase, before any security measures are applied.

Official announcement: Please refer to the link for details –

https://www.tenable.com/cve/CVE-2025-36852

https://nx.app/files/cve-2025-06

CVE-2025-2884 – Design weakness in the Trusted Platform Module (TPM) 2.0 reference implementation code. (11th June 2025)

Preface: The main difference between AMD’s Trusted Platform Module (TPM) and those from other manufacturers , how it’s implemented: AMD offers a firmware TPM (fTPM), while many other manufacturers, including Intel, also offer a dedicated hardware TPM (dTPM).

Background: TPM refers to a Trusted Platform Module, which is a specialized chip that securely stores cryptographic keys used for encryption and decryption, enhancing overall system security. AMD’s approach often involves Firmware TPM (fTPM), also known as Intel’s Platform Trust Technology (PTT), which implements TPM functionality within the system’s firmware rather than using a dedicated physical chip.

The AMD Ryzen Embedded 7000 series processors indeed integrate advanced security features, including:

  • AMD Secure Processor (ASP): A dedicated security co-processor embedded directly into the CPU die.
  • Firmware TPM (fTPM): Implemented in firmware and runs on the ASP.
  • Microsoft Pluton: A hardware-based security processor integrated into the silicon, designed to work alongside ASP and fTPM for enhanced protection.

Ref: The most common TPM is the TPM function supported by the Trusted Execution Environment (TEE) of Intel Core i series or AMD Ryzen series CPU in the motherboard UEFI firmware. fTPM can be used in all processors after Intel Broadwell (5th generation) and AMD Ryzen series. This is the most common method because you can easily use the TPM function without purchasing a separate module.

Vulnerability details: An out-of-bounds read vulnerability exists in TPM2.0’s Module Library allowing a read past the end of a TPM2.0 routine as described above. An attacker who can successfully exploit this vulnerability can read sensitive data stored in the TPM and/or impact the availability of the TPM.

Official announcement: Please see the link for details – https://www.amd.com/en/resources/product-security/bulletin/amd-sb-4011.html

CVE-2025-0036: A potential vulnerability exists with the configuration of the SSS (Secure Stream Switch) – 5th Jun 2025

Preface: AMD’s Versal Adaptive SoCs are designed for high-performance computing, offering a blend of programmable logic, processing system, and AI engines, along with advanced memory and interfaces. They excel in cloud, network, and edge applications by combining heterogeneous compute with a wide range of hard IP. This architecture enables outstanding performance/watt and adapts to changing requirements, making them suitable for various applications like AI, data centers, and network acceleration.

Background: In Versal™ Adaptive SoC devices, the Platform Loader and Manager (PLM) implements runtime (post-boot) software services that allows a remote processor to command the PLM to execute cryptographic operations – including AES, SHA3, RSA, ECDSA – on behalf of the remote processor. These operations require the Secure Stream Switch (SSS) to be configured such that the Direct Memory Access (DMA) hardware can send data to and read from these cryptographic engines.

Ref: Some crypto engines (like AES, SHA3) are integrated into the PMC for secure boot and runtime services. Others may be instantiated in the PL for custom cryptographic acceleration.

Vulnerability details: A potential vulnerability exists with the configuration of the SSS because the PLM does not clear the SSS configuration after a cryptographic operation completes. This allows an improper SSS configuration when setting up the SSS for any following cryptographic command.

Official announcement: For more details, please refer to the following link – https://www.amd.com/en/resources/product-security/bulletin/amd-sb-8011.html