Category Archives: AI and ML

CVE-2025-33228: About NVIDIA Nsight Systems (23rd Jan 2026)

Preface: Nsight Systems is a tool for developers who need to understand the big picture of application execution on heterogeneous systems, especially in scenarios involving data transfer bottlenecks between the CPU and GPU or scaling across multiple nodes.

Background: In NVIDIA Nsight Systems, process_nsys_rep_cli[.]py is an internal Python script used primarily for post-processing and report generation from raw profiling data. 

While users typically interact with the nsys command-line tool, this script is invoked behind the scenes during the following operations.

Why This Might Connect to Nsight Systems?

Nsight Systems allows exporting [.]nsys-rep files and then processing them with scripts like process_nsys_rep_cli[.]py.

If the CLI or scripts read commands or code from user-provided files without validation, it could lead to:

-Command injection (similar to os[.]system()).

-Code execution (similar to exec()).

The design flaw could be that Nsight Systems assumes [.]nsys-rep or related files are safe, but if an attacker crafts a malicious file and you run the processing script, it could execute harmful commands.

Vulnerability details: CVE-2025-33228 NVIDIA Nsight Systems contains a vulnerability in the gfx_hotspot recipe, where an attacker could cause an OS command injection by supplying a malicious string to the process_nsys_rep_cli[.]py script if the script is invoked manually. A successful exploit of this vulnerability might lead to code execution, escalation of privileges, data tampering, denial of service, and information disclosure.

Official announcement: Please refer to the link for details –

https://nvidia.custhelp.com/app/answers/detail/a_id/5755

CVE-2025-33233 – About NVIDIA Merlin Transformers4Rec for all platform  22nd Jan 2026

Official Updated 01/20/2026

Preface: Data engineers perform seamless preprocessing, a foundational stage where they gather messy, raw data from diverse sources, clean it (handling missing values, outliers, inconsistencies), integrate disparate datasets, and transform it into a unified, structured format, making it ready and reliable for data scientists to perform advanced feature engineering (creating new, meaningful features) and ultimately build better machine learning models. This ensures a high-quality, consistent input, preventing “garbage in, garbage out” for the modeling phase.

Background: Transformers4Rec is pre-installed in the merlin-pytorch container available from the NVIDIA GPU Cloud (NGC) catalog. This container is part of the NVIDIA Merlin ecosystem and is specifically designed to support sequential and session-based recommendation tasks using PyTorch.

The workflow can show you where we speculated design weakness of CVE-2025-33233.

NVTabular for preprocessingPyTorch for trainingTriton for serving—means PyTorch is a critical component. If its loading function is insecure, Merlin’s container is exposed regardless of NVIDIA’s own code. The workflow can display the location of suspected design flaw CVE-2025-33233.

If Transformers4Rec internally uses torch.load (which is common for loading PyTorch models) and relies on weights_only=True for safety, then CVE-2025-32434 could be the root cause or at least a contributing factor.

NVIDIA might have classified it as a separate CVE because the exploit path involves their product’s integration with PyTorch, making it a product-level exposure rather than just a dependency issue.

Vulnerability details: CVE-2025-33233 NVIDIA Merlin Transformers4Rec for all platforms contains a vulnerability where an attacker could cause a code injection issue. A successful exploit of this vulnerability might lead to code execution, escalation of privileges, information disclosure, and data tampering.

Official announcement: Please refer to the following link for details-

https://nvidia.custhelp.com/app/answers/detail/a_id/5761

CVE-2026-21452: About MessagePack for Java (7th Jan 2026)

Preface: Aerospike is a specific, high-performance NoSQL database, and benchmarks generally show it to be significantly faster than many other clustered NoSQL solutions like Cassandra and MongoDB.

The term “NoSQL” refers to a broad category of databases with varying performance characteristics, so a direct comparison is more nuanced than a simple yes/no answer.

Aerospike uses MessagePack as its default, internal serialization format for Lists and Maps (Collection Data Types or CDTs); it is not an optional configuration you need to enable in the core database itself.

Background: MessagePack is a compact binary serialization format designed to be more memory-efficient than text-based formats like JSON. For the Java implementation, its memory requirements depend on whether you are using the standard heap-based process or advanced off-heap optimizations.

The MessagePack serialization process primarily utilizes JVM Virtual Memory, which encompasses several different pools:

JVM Heap Memory, Off-Heap / Native Memory and OS Page Cache.

About EXT32?

•         In binary serialization formats (like Mashpack), EXT32 is a type identifier (byte 0xDD) indicating a subsequent 32-bit binary block or extension.

•         It’s used for efficiency, compacting data better than text formats (JSON, XML) by representing data directly as bytes.

Serialized EXT32 objects can require more memory in the JVM heap, primarily due to how standard Java MessagePack libraries manage large payloads during deserialization. While the MessagePack format itself is compact, the serialization and deserialization process in Java introduces specific memory overheads for the EXT32 type:

Large Payload Buffering (Heap Exhaustion) EXT32 is designed for large extension data, supporting payloads up to 4 GiB in size.

Vulnerability details: A known issue in msgpack-java (prior to v0.9.11) was that the library would trust the declared length in the EXT32 header and immediately attempt to allocate a matching byte array on the JVM heap.

Impact: If an EXT32 object declares a massive size, it can trigger rapid heap exhaustion or an OutOfMemoryError before the data is even fully read.

Official announcement: Please refer to the link for details.

https://www.tenable.com/cve/CVE-2026-21452

About 3rd part design weakness impact Intel® Xeon® 6 Processors with P-cores with Intel® TDX Connect (29-12-2025)

Last revised: 12/09/2025

Preface: Intel’s Xeon 6 processors represent a fascinating shift in the landscape of data center computing, moving toward a hybrid architecture that optimizes for different workloads with specialized cores. The P-core version, codenamed Granite Rapids, built entirely of Performance-cores for heavy compute and AI workloads, is accurate and highlights a significant technological leap in server processing capabilities. This new generation aims to deliver unprecedented performance and efficiency to meet the increasing demands of modern data centers, which are grappling with massive data volumes and the computational intensity of artificial intelligence.

Background: Intel® TDX Connect is specifically highlighted as a key feature on Intel® Xeon® 6 Processors with P-cores (Performance-cores) to enable confidential computing for connected devices like GPUs. Intel’s P6 architecture, as part of modern high-speed systems using PCI Express (PCIe), relies on SERDES (Serializer/Deserializer) technology, especially for PCIe 3.0 and newer, to handle high data rates through serial links, though P6 itself refers to older processor generations, the concept of using SERDES for high-speed I/O like PCIe is fundamental, with newer Intel CPUs using advanced SerDes for PCIe 4.0, 5.0, and 6.0 to achieve massive bandwidth for AI and data centers.

Does the Intel P6 use PCIe SERDES?

Yes, Intel’s P6 architecture, as part of modern high-speed systems using PCI Express (PCIe), relies on SERDES (Serializer/Deserializer) technology, especially for PCIe 3.0 and newer, to handle high data rates through serial links, though P6 itself refers to older processor generations, the concept of using SERDES for high-speed I/O like PCIe is fundamental, with newer Intel CPUs using advanced SerDes for PCIe 4.0, 5.0, and 6.0 to achieve massive bandwidth for AI and data centers.

Vulnerability details: [CVE-2025-9612] Improper validation of integrity check value in PCI Port for some Intel® platforms with Integrity and Data Encryption (IDE) for PCIe Base Specification Revision 5 or higher within Ring 0: Bare Metal OS may allow an information disclosure and escalation of privilege. System software adversary with a privileged user combined with a high complexity attack may enable escalation of privilege. This result may potentially occur via local access when attack requirements are present without special internal knowledge and requires no user interaction. The potential vulnerability may impact the confidentiality (low), integrity (low) and availability (none) of the vulnerable system, resulting in subsequent system confidentiality (none), integrity (none) and availability (none) impacts.

Official announcement: Please refer to the link for details –

https://www.intel.com/content/www/us/en/security-center/advisory/intel-sa-01409.html

CVE-2025-33223: NVIDIA Isaac Launchable contains a vulnerability (29th Dec 2025)

Official Updated 12/22/2025 09:21 AM

Preface: The ability to launch NVIDIA Isaac Lab via NVIDIA Brev (Cloud) is fundamentally driven by the need to democratize access to high-performance robotics simulation and AI development environments, circumventing significant hardware and setup barriers. This collaboration between Isaac Lab and Brev offers a streamlined, low-friction pathway for developers and researchers to leverage powerful, preconfigured GPU resources in the cloud.

Background: Isaac Lab requires a compatible version of Isaac Sim to run. An “Isaac Lab Launchable” is an installation option, such as via NVIDIA Brev (Cloud), to quickly get the environment running. The Launchable provides the correct Isaac Sim/Python setup, but you still use env_config[.]yaml within your scripts to define what runs on that platform.

In essence, Issac Lab use env_config[.]yaml to specify tasks (like Isaac-Ant-v0) within your Python training scripts (e.g., train[.]py)The environment command

isaaclab/scripts/reinforcement_learning/skrl/train[.]py –task=Isaac-Ant-v0 specifically targets the Isaac-Ant-v0 task. If train[.]py or related scripts dynamically construct shell commands from these inputs without validation, that’s a classic command injection risk.

Vulnerability details: CVE-2025-33223 – NVIDIA Isaac Launchable contains a vulnerability where an attacker could cause an execution with unnecessary privileges. A successful exploit of this vulnerability might lead to code execution, escalation of privileges, denial of service, information disclosure and data tampering.

Official announcement: Please refer to the link for details –

https://nvidia.custhelp.com/app/answers/detail/a_id/5749

CVE-2025-33226: NVIDIA NeMo Framework for all platforms contains a vulnerability (24th Dec 2025)

Official Updated 12/12/2025 02:17 PM

Preface: NVIDIA NeMo is a versatile, end-to-end platform used across a vast spectrum of industries, primarily to build, customize, and deploy generative AI agents and applications, such as custom chatbots and specialized assistants. Its primary users are companies in the computer software industry, but its use cases span many different sectors, including energy, telecommunications, financial services, healthcare, and retail.

NeMo is the framework for building models, while NIM (NVIDIA Inference Microservices) provides pre-packaged tools to easily deploy and manage those (and other) AI models as APIs, with NIM often using NeMo’s customized models for inference, creating a unified ecosystem.

Background: Uploading model checkpoints from the NVIDIA NeMo framework to NVIDIA NIM is essential for streamlining the transition from model development and training to optimized, scalable, production-ready inference. This integration merges the development power of NeMo with the robust, deployment-focused capabilities of NIM.

The process is more than a simple file transfer; it is a critical step in a comprehensive, end-to-end AI lifecycle management strategy. NVIDIA NeMo is a powerful framework for building, training, and fine-tuning large language models (LLMs) and other generative AI models, producing specialized .nemo checkpoints that contain model configurations and weights. NVIDIA NIM, or NVIDIA Inference Microservices, then takes these trained, domain-specific models and packages them into prebuilt, optimized, and secure microservices for deployment across various environments, whether in the cloud, on-premises data centers, or at the edge.

The restore_from() method is a function used for model loading and restoration. It allows you to: 

  • Load a previously trained model instance, including its weights and configuration, from a saved .nemo file.
  • Resume training from a specific point or perform fine-tuning/inference with a pre-trained model

Vulnerability details: CVE-2025-33226: NVIDIA NeMo Framework for all platforms contains a vulnerability where malicious data created by an attacker may cause a code injection. A successful exploit of this vulnerability may lead to code execution, escalation of privileges, information disclosure, and data tampering.

Official announcement: Please refer to the link for details – https://nvidia.custhelp.com/app/answers/detail/a_id/5736

CVE-2025-33235 is specifically about a vulnerability in NVIDIA’s implementation, not PyTorch or NVRx directly (23rd Dec 2025)

Official Updated 12/12/2025 02:20 PM

Preface: Given Google’s efforts to make its hardware compatible with PyTorch, why is the NVIDIA Resiliency Extension (NVRx) still needed? The answer lies in the complementarity of these technologies, the realities of large-scale distributed systems, and the practical necessity to ensure training efficiency and fault tolerance, especially when using NVIDIA GPUs, which remain the mainstream platform for AI development. While Google is working to make its Tensor Processing Units (TPUs) more compatible with PyTorch to provide an alternative, the NVIDIA Resiliency Extension aims to address the specific and pressing challenges encountered when performing large-scale distributed PyTorch training on NVIDIA’s widely used hardware, such as hardware on cloud platforms like Google Cloud.

Background: The open-source machine learning framework PyTorch was originally developed by researchers at Facebook AI Research (now Meta AI). It was first publicly released in September 2016. PyTorch-based workloads are artificial intelligence (AI) and machine learning (ML) tasks and applications that are developed, trained, and run using the PyTorch open-source deep learning framework.

The NVIDIA Resiliency Extension (NVRx) integrates multiple resiliency-focused solutions for PyTorch-based workloads. Users can modularly integrate NVRx capabilities into their own infrastructure to maximize AI training productivity at scale. The NVIDIA Resiliency Extension (NVRx) is a specific Python package that integrates several solutions to minimize downtime and maximize the effective training time for PyTorch-based workloads running on NVIDIA infrastructure. NVRx is a Python package that framework developers and users can modularly integrate into their own infrastructure. It is used in major NVIDIA frameworks like the NVIDIA NeMo Framework for building large language models, providing the underlying machinery for their resilient training capabilities.

NVRx provides utilities that run the actual checkpoint saving routines in the background. It employs mechanisms, often leveraging torch.multiprocessing, to fork a separate, temporary process dedicated to handling the I/O operations after the data has been quickly staged from GPU memory to CPU buffers.

Vulnerability details: CVE-2025-33235 – NVIDIA Resiliency Extension for Linux contains a vulnerability in the checkpointing core, where an attacker may cause a race condition. A successful exploit of this vulnerability may lead to information disclosure, data tampering, denial of service, or escalation of privileges.

Official announcement: Please refer to the link for details – https://nvidia.custhelp.com/app/answers/detail/a_id/5746

About CVE-2025-32210: NVIDIA Isaac Lab, a story can tell! (22-12-2025)

Official Updated 12/12/2025 02:19 PM

Preface: The primary difficulties in robotics learning involve bridging the gap between simulation and reality (the “reality gap”), enabling robust perception and decision-making in unpredictable environments, handling complex physical interactions like contact dynamics, ensuring safety and security, addressing high costs and lack of standardization, and overcoming workforce skill gaps, all while managing computational and power limitations.

Background: Isaac Lab uses NVIDIA Isaac Sim’s capabilities for realistic virtual environments, allowing researchers to train complex robot behaviors efficiently through parallel simulation and data generation, then transfer these policies to physical robots. The primary difficulties Isaac Lab tackles are:

While the physical world remains the definitive testbed, the acquisition of physical interaction data with robots is expensive, time-consuming, and often necessitates specialized instrumentation. These limitations are especially acute in rare but safety-critical situations. Events such as high-speed collisions, hardware malfunctions, or navigation in unpredictable human environments are difficult to reproduce and pose significant risks to equipment and human safety.

NVIDIA Isaac Sim manages computational and power limitations through GPU-accelerated design, customizable performance settings, and scalable deployment options, allowing users to balance performance, fidelity, and resource consumption. This approach empowers developers to tailor the simulation environment to their specific hardware capabilities and project needs, from local workstations to cloud-based multi-GPU setups.

Key contributions of Isaac Lab

• Modular and scalable framework: Built on NVIDIA Omniverse, enabling high-fidelity, GPU-accelerated simulation for complex robots and tasks.

• Advanced sensor simulation: Supports tiled RTX rendering, Warp-based custom sensors, and physics-based data for rich observation spaces.

• Seamless teleoperation and data collection: Integrates spacemouse, VR headsets, and other devices for large-scale demonstration capture.

• Extensive environment suite: Provides diverse, ready-to-use environments for reinforcement learning, imitation learning, and sim-to-real research.

Vulnerability details: CVE-2025-32210 – NVIDIA Isaac Lab contains a deserialization vulnerability.  A successful exploit of this vulnerability might lead to code execution.

Official announcement: Please refer to the link for details – https://nvidia.custhelp.com/app/answers/detail/a_id/5733

About: CVE-2025-33214 – NVIDIA NVTabular for Linux and CVE-2025-33213 – NVIDIA Merlin Transformers4Rec for Linux (15th Dec 2025)

Preface: Suppose you’re using cuML’s model persistence feature to load a serialized model from disk or a remote source. If the source is not trusted or validated, and the deserialization uses pickle or similar unsafe methods, it could execute arbitrary code.

The attached diagram demonstrates arbitrary code execution via pickle, which aligns with CVE-2025-33214 and likely CVE-2025-33213 if input validation is missing.

Background: NVTabular is a component of NVIDIA Merlin, an open source framework for building and deploying recommender systems and works with the other Merlin components including Merlin Models, HugeCTR and Merlin Systems to provide end-to-end acceleration of recommender systems on the GPU.

NVTabular requires Python version 3.7+. Additionally, GPU support requires:

  • CUDA version 11.0+
  • NVIDIA Pascal GPU or later (Compute Capability >=6.0)
  • NVIDIA driver 450.80.02+
  • Linux or WSL

When running NVTabular on the Criteo 1TB Click Logs Dataset using a single V100 32GB GPU, feature engineering and preprocessing was able to be completed in 13 minutes. Furthermore, when running NVTabular on a DGX-1 cluster with eight V100 GPUs, feature engineering and preprocessing was able to be completed within three minutes. Combined with HugeCTR, the dataset can be processed and a full model can be trained in only six minutes.

NVIDIA Merlin™ accelerates the entire pipeline, from ingesting and training to deploying GPU-accelerated recommender systems. Merlin NVTabular is a feature engineering and preprocessing library designed to effectively manipulate terabytes of recommender system datasets and significantly reduce data preparation time. It provides efficient feature transformations, preprocessing, and high-level abstraction that accelerates computation on GPUs using the RAPIDS™ cuDF library.

Vulnerability details:

CVE-2025-33214 – NVIDIA NVTabular for Linux contains a vulnerability in the Workflow component, where a user could cause a deserialization issue. A successful exploit of this vulnerability might lead to code execution, denial of service, information disclosure, and data tampering.

CVE-2025-33213 – NVIDIA Merlin Transformers4Rec for Linux contains a vulnerability in the Trainer component where a user may cause a deserialization issue. A successful exploit of this vulnerability may lead to code execution, denial of service, information disclosure, and data tampering.

Official announcement: Please refer to the following link for details-

https://nvidia.custhelp.com/app/answers/detail/a_id/5739

CVE-2025-9612/9613/9614: AMD is concerned that a defect in non-AMD PCIe IDE could affect certain AMD products. (12th Dec 2025)

Preface: The security concerns regarding data integrity in Peripheral Component Interconnect Express (PCIe) transactions are a critical aspect of modern computing, particularly in data centers, cloud environments, and automotive systems where sensitive information is routinely handled. Historically, PCIe interfaces were considered relatively secure due to their placement inside a server enclosure, but the rise of disaggregated systems, confidential computing, and sophisticated physical attacks has changed this perspective entirely. As an interconnect that links the CPU to various peripherals like GPUs, SSDs, and network adapters, any vulnerability can have far-reaching consequences, leading to data corruption, unauthorized access, or system compromise.

Background: AMD EPYC processors use an I/O Die (IOD) to manage all external interfaces, connecting to CPU Dies (CCDs) via high-speed Global Memory Interconnect (GMI) links and handling numerous DDR5 memory channels, PCIe Gen5, and CXL lanes, with SERDES (Serializer/Deserializer) technology underpinning these fast connections for massive bandwidth and low latency in data-intensive workloads, allowing for up to 12 memory channels and 128 PCIe lanes per socket in recent generations.

AMD SERDES technology significantly enhances the physical-layer data integrity and signal quality in PCIe transactions, but it is distinct from higher-level security features like encryption. SERDES technology is a foundational element that ensures reliable data transmission at extremely high speeds.

Affected Products and Mitigation:

From security point of view, it expect additional details from the PCIe SIG and plan to update this security notice as more information is available.  At this time, AMD believes the following products may be impacted.

AMD EPYC™ 9005 Series Processors

AMD EPYC™ Embedded 9005 Series Processors

Ref: PCI-SIG (Peripheral Component Interconnect Special Interest Group) is the electronics industry consortium that defines and maintains the standards for PCI, PCI-X, and PCI Express (PCIe) computer buses, ensuring high-speed, interoperable connectivity for components like graphics cards, SSDs, and network adapters in computers and data centers. This non-profit group, with hundreds of member companies, develops specifications, promotes compliance, and fosters an open ecosystem for PCIe technology, allowing different manufacturers’ products to work together seamlessly.

Vulnerability details:

CVE-2025-9612 (non-AMD) : An issue was discovered in the PCI Express (PCIe) Integrity and Data Encryption (IDE) specification, where insufficient guidance on Transaction Layer Packet (TLP) ordering and tag uniqueness may allow encrypted packets to be replayed or reordered without detection. This can enable local or physical attackers on the PCIe bus to violate data integrity protections.

CVE-2025-9613 (non-AMD): A vulnerability was discovered in the PCI Express (PCIe) Integrity and Data Encryption (IDE) specification, where insufficient guidance on tag reuse after completion timeouts may allow multiple outstanding Non-Posted Requests to share the same tag. This tag aliasing condition can result in completions being delivered to the wrong security context, potentially compromising data integrity and confidentiality.

CVE-2025-9614 (non-AMD): An issue was discovered in the PCI Express (PCIe) Integrity and Data Encryption (IDE) specification, where insufficient guidance on re-keying and stream flushing during device rebinding may allow stale write transactions from a previous security context to be processed in a new one. This can lead to unintended data access across trusted domains, compromising confidentiality and integrity.

Official announcement: Please refer to the link for details –

https://www.amd.com/en/resources/product-security/bulletin/amd-sb-7056.html