Category Archives: AI and ML

CVE-2024-21969: (AMD security focus) Whispering Pixels – Exploiting Uninitialized Register Accesses in Modern GPUs (14th Aug 2024)

Preface: The new AMD Radeon Instinct MI50 hints at the capabilities of AMD’s future GPUs. A study proof MI50 is capable scientific and ML applications.

Background: The proliferation of graphics processing units (GPUs) has brought unprecedented computing power.

Multiple register-based vulnerabilities found across different GPU implementations.

So-called whisper pixels. The vulnerability poses unique challenges to an adversary due to opaque scheduling and register remapping algorithms present in the GPU firmware, complicating the reconstruction of leaked data.

GPU Programming: An application has to use vendor- provided libraries in order to translate a shader from its high-level source code to an architecture-dependent binary code. Vendors provide these libraries for a variety of high-level languages.

Vulnerability details: Improper clearing of GPU registers could allow a malicious shader to read left-over pixel data leading to loss of confidentiality.

Mitigation: AMD plans to create a new operating mode designed to prevent processes from running in parallel on the GPU, and to clear registers between processes on supported products.

Official announcement: Please refer to the website for details – https://www.amd.com/en/resources/product-security/bulletin/amd-sb-6013.html

NVIDIA Mellanox OS, ONYX, Skyway, MetroX-2 and MetroX-3 XC contain a vulnerability in web support  (07 Aug 2024)

Preface: CGI is a standard protocol that allows web servers to execute external programs or scripts, typically written in languages like Perl or Python, in response to client requests.

Path Traversal: Exploiting lax file path validation, attackers navigate outside the intended directory, accessing restricted files or directories.

Background: MLNX-OSis a next-generation switch operating system for data centers with storage, enterprise, high-performance computing and cloud fabrics. Building networks with MLNX-OS enables scaling to thousands of compute and storage nodes with monitoring and provisioning capabilities, whether they are InfiniBand or Virtual Protocol Interconnect (VPI).

NVIDIA Onyx, with its robust layer-3 protocol stack, built-in monitoring and visibility tools, and high-availability mechanisms, Onyx is an ideal network operating system for enterprise and cloud data centers.

The NVIDIASkyway gateway appliance provides 1.6Tb/s  throughput, enabling scalable and efficient connectivity from InfiniBand data centers to external Ethernet-based infrastructures and storage.

The NVIDIAMetroX-3 XC long-haul system seamlessly and securely extends the reach of the NVIDIA Quantum InfiniBand networking platform, providing high data throughput, In-Network Computing, and native remote direct-memory access (RDMA) communications. Enhancing data security, MetroX-3 XC provides encrypted connectivity over long distances and dense wavelength-division multiplexing (DWDM) infrastructures.

Extending InfiniBand connectivity to 10 or 40 kilometers,. MetroX2 systems enable high data throughput, native remote direct memory access (RDMA).

Vulnerability details: NVIDIA Mellanox OS, ONYX, Skyway, and MetroX-3 XCC contain a vulnerability in the web support, where an attacker can cause a CGI path traversal by a specially crafted URI. A successful exploit of this vulnerability might lead to escalation of privileges and information disclosure.

Official announcement: Please refer to the link for details – https://nvidia.custhelp.com/app/answers/detail/a_id/5563

About CVE-2024-7553: Improper Access Control in MongoDB (8th Aug 2024)

Preface: What Is a Document Database? A document-oriented database is a special type of key-value store where keys can only be strings. Moreover, the document is encoded using standards like JSON or related languages like XML. You can also store PDFs, image files, or text documents directly as values.

Background: As a document database, MongoDB makes it easy for developers to store structured or unstructured data. It uses a JSON-like format to store documents. Most breaches involving MongoDB occur because of a deadly combination of authentication disabled and MongoDB opened to the internet.

Vulnerability details: Incorrect validation of files loaded from a local untrusted directory may allow local privilege escalation if the underlying operating systems is Windows. This may result in the application executing arbitrary behaviour determined by the contents of untrusted files.

Impact: This issue affects MongoDB Server v5.0 versions prior to 5.0.27, MongoDB Server v6.0 versions prior to 6.0.16, MongoDB Server v7.0 versions prior to 7.0.12, MongoDB Server v7.3 versions prior 7.3.3, MongoDB C Driver versions prior to 1.26.2 and MongoDB PHP Driver versions prior to 1.18.1. Required Configuration: Only environments with Windows as the underlying operating system is affected by this issue

Official announcement: Please refer to the link for details – https://nvd.nist.gov/vuln/detail/CVE-2024-7553

CVE-2024-3056 – About Podman (5th-Aug-2024)

Preface: Size of /dev/shm. A unit can be b (bytes), k (kibibytes), m (mebibytes), or g (gibibytes). If the unit is omitted, the system uses bytes. If the size is omitted, the default is 64m. When size is 0, there is no limit on the amount of memory used for IPC by the container. This option conflicts with –ipc=host.

IPC:Shared Memory

Two processes comunicating via shared memory.

shm_server[.]c — simply creates the string and shared memory portion.

shm_client[.]c — attaches itself to the created shared memory portion and uses the string (printf.

Background: Podman, Podman Desktop, and other open standards-based container tools make Red Hat Enterprise Linux a powerful container host that delivers production-grade support, stability, and security features as well as a path forward to Kubernetes and Red Hat OpenShift.

Vulnerability details: A flaw was found in Podman. This issue may allow an attacker to create a specially crafted container that, when configured to share the same IPC with at least one other container, can create a large number of IPC resources in /dev/shm. The malicious container will continue to exhaust resources until it is out-of-memory (OOM) killed. While the malicious container’s cgroup will be removed, the IPC resources it created are not. Those resources are tied to the IPC namespace that will not be removed until all containers using it are stopped, and one non-malicious container is holding the namespace open. The malicious container is restarted, either automatically or by attacker control, repeating the process and increasing the amount of memory consumed. With a container configured to restart always, such as `podman run –restart=always`, this can result in a memory-based denial of service of the system.

Official announcement: Please refer to the link for details – https://nvd.nist.gov/vuln/detail/CVE-2024-3056

CVE-2024-33976: Check for correct values rank in UpperBound and LowerBound. (30th Jul 2024)

Preface: Segmentation faults (segfaults) are a common error that occurs when a program tries to access a restricted area of memory. Segfaults can occur for a wide variety of reasons: usage of uninitialized pointers, out-of-bounds memory accesses, memory leaks, buffer overflows, etc.

Background: TensorFlow can be used to develop models for various tasks, including natural language processing, image recognition, handwriting recognition, and different computational-based simulations such as partial differential equations.

Vulnerability details: TensorFlow is an end-to-end open source platform for machine learning. `array_ops.upper_bound` causes a segfault when not given a rank 2 tensor.

The shape function in array_ops.cc for those ops requires that argument to have rank 2, but that function is bypassed when switching between graph and eager modes, allowing for invalid arguments to pass through and, in the test case, cause a segfault.

Solution: The fix will be included in TensorFlow 2.13 and will also cherrypick this commit on TensorFlow 2.12.

Official announcement: Please refer to the official announcement for details – https://www.tenable.com/cve/CVE-2023-33976

Regarding CVE-2024-0108: The manufacturer did not describe much. Is the situation below exactly what CVE mentioned? (25/07/2024)

Preface: What is an example of autonomous AI?

Autonomous intelligence is artificial intelligence (AI) that can act without human intervention, input, or direct supervision. It’s considered the most advanced type of artificial intelligence. Examples may include smart manufacturing robots, self-driving cars, or care robots for the elderly.

Background: What is Jetson AGX Xavier used for?

As the world’s first computer designed specifically for autonomous machines, Jetson AGX Xavier has the performance to handle the visual odometry, sensor fusion, localization and mapping, obstacle detection, and path-planning algorithms that are critical to next-generation robots.

Vulnerability details: NVIDIA Jetson Linux contains a vulnerability in NvGPU where error handling paths in GPU MMU mapping code fail to clean up a failed mapping attempt. A successful exploit of this vulnerability may lead to denial of service, code execution, and escalation of privileges.

Official announcement: Please refer to the official announcement for details – https://nvidia.custhelp.com/app/answers/detail/a_id/5555

CVE-2024-6960: H2O Model Deserialization RCE (21st July 2024)

Preface: TensorFlow provides a flexible framework for deep learning tasks, but may not be as optimized as H2O for handling large datasets.

Background: H2O uses Iced classes as the primary means of moving Java Objects around the cluster.

Auto-serializer base-class using a delegator pattern (the faster option is to byte-code gen directly in all Iced classes, but this requires all Iced classes go through a ClassLoader).

Iced is a marker class, and Freezable is the companion marker interface. Marked classes have 2-byte integer type associated with them, and an auto-genned delegate class created to actually do byte-stream and JSON serialization and deserialization. Byte-stream serialization is extremely dense (includes various compressions), and typically memory-bandwidth bound to generate.

Vulnerability details: The H2O machine learning platform uses “Iced” classes as the primary means of moving Java Objects around the cluster. The Iced format supports inclusion of serialized Java objects. When a model is deserialized, any class is allowed to be deserialized (no class whitelist). An attacker can construct a crafted Iced model that uses Java gadgets and leads to arbitrary code execution when imported to the H2O platform.

Official announcement: Please refer to the official announcement for details – https://nvd.nist.gov/vuln/detail/CVE-2024-6960

A critical step in exploiting a buffer overflow is determining the offset where important program control information is overwritten. In the Linux kernel, the (CVE-2024-41011) vulnerability has been resolved. (18-07-2024)

Preface: The PAGE_SIZE macro defined in the Linux kernel source determines the page size. Its definition is in the kernel header file /usr/src/kernels/5.14[.] 0-22. el9[.] x86_64/include/asm-generic/page.

Background: MMIO stands for Memory-Mapped Input/Output. In Linux, MMIO is a mechanism used by devices to interface with the CPU that involves mapping their control registers and buffers directly into the processor’s memory address space.

This enables the CPU to access device registers and exchange data with devices using load and store instructions, just as if they were conventional memory locations. Graphics cards, network interfaces, and storage controllers all employ MMIO to effectively conduct input and output tasks.

Vulnerability details: drm/amdkfd: don’t allow mapping the MMIO HDP page with large pages We don’t get the right offset in that case. The GPU has an unused 4K area of the register BAR space into which you can remap registers. We remap the HDP flush registers into this space to allow userspace (CPU or GPU) to flush the HDP when it updates VRAM. However, on systems with >4K pages, we end up exposing PAGE_SIZE of MMIO space.

Official announcement: Please refer to the official announcement for details – https://nvd.nist.gov/vuln/detail/CVE-2024-41011

CVE-2024-41009: bpf – Fix overrunning reservations in ringbuf (17th July 2024)

Preface: Consumer and producer counters are put into separate pages to allow each position to be mapped with different permissions. This prevents a user-space application from modifying the position and ruining in-kernel tracking. The permissions of the pages depend on who is producing samples: user-space or the kernel. Starting from Linux 5.8, BPF provides a new BPF data structure (BPF map): BPF ring buffer (ringbuf). It is a multi-producer, single-consumer (MPSC) queue and can be safely shared across multiple CPUs simultaneously.

Background: The first core skill point is “BPF Hooks”, that is, where in the kernel can BPF programs be loaded. There are nearly 10 types of hooks in the current Linux kernel, as shown below:

kernel functions (kprobes)

userspace functions (uprobes)

system calls

fentry/fexit

Tracepoints

network devices (tc/xdp)

network routes

TCP congestion algorithms

sockets (data level)

Vulnerability details: For example, consider the creation of a BPF_MAP_TYPE_RINGBUF map with size of 0x4000. Next, the consumer_pos is modified to 0x3000 /before/ a call to bpf_ringbuf_reserve() is made. This will allocate a chunk A, which is in [0x0,0x3008], and the BPF program is able to edit [0x8,0x3008]. Now, lets allocate a chunk B with size 0x3000. This will succeed because consumer_pos was edited ahead of time to pass the `new_prod_pos – cons_pos > rb->mask` check. Chunk B will be in range [0x3008,0x6010], and the BPF program is able to edit [0x3010,0x6010]. Due to the ring buffer memory layout mentioned earlier, the ranges [0x0,0x4000] and [0x4000,0x8000] point to the same data pages. This means that chunk B at [0x4000,0x4008] is chunk A’s header. bpf_ringbuf_submit() / bpf_ringbuf_discard() use the header’s pg_off to then locate the bpf_ringbuf itself via bpf_ringbuf_restore_from_rec(). Once chunk B modified chunk A’s header, then bpf_ringbuf_commit() refers to the wrong page and could cause a crash.  

Official announcement: Please refer to the official announcement for details – https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=47416c852f2a04d348ea66ee451cbdcf8119f225

About CVE-2024-41008: When a design weakness is discovered in a GPU, it is now not limited to affecting graphics cards! Machine learning should be on alert! (16th July 2024)

Preface: In computer science, reference counting is a programming technique of storing the number of references, pointers, or handles to a resource, such as an object, a block of memory, disk space, and others. In garbage collection algorithms, reference counts may be used to deallocate objects that are no longer needed.

Background: The drm/amdgpu driver supports all AMD Radeon GPUs based on the Graphics Core Next (GCN) architecture.

AI and Machine Learning Development on a Local Desktop with AMD Radeon™ Graphics Cards

AMD now supports RDNA™ 3 architecture-based GPUs for desktop based AI and ML workflows using AMD ROCm™ software. Developers can work with ROCm 6.1 software for Radeon on Linux® systems using PyTorch®, TensorFlow and ONNX Runtime. Added support for WSL 2 (Windows® Subsystem for Linux) now also enables users to develop with AMD ROCm™ software on a Windows® system, eliminating the need for dual boot set ups.

Vulnerability details: The CVE does not describe the vulnerability enumeration. Additionally, AMD only provides patch change details. Perhaps the design weakness in CVE-2024-41008 is related to garbage collection.

This patch changes the handling and lifecycle of vm->task_info object.

The major changes are:

  • vm->task_info is a dynamically allocated ptr now, and its uasge is reference counted.
  • introducing two new helper funcs for task_info lifecycle management
    • amdgpu_vm_get_task_info: reference counts up task_info before returning this info
    • amdgpu_vm_put_task_info: reference counts down task_info
  • – last put to task_info() frees task_info from the vm.

Official announcement: Please refer to the vendor announcement for details – https://nvd.nist.gov/vuln/detail/CVE-2024-41008