All posts by admin

About NVIDIA Security Bulletin – CVE-2023-31029 and CVE-2023-31030 (14th Jan 2024)

Preface: Artificial intelligence performs better when humans are involved in data collection, annotation, and validation. But why is artificial intelligence ubiquitous in the human world? Can we limit the use of AI?

Background: The NVIDIA DGX A100 system comes with a baseboard management controller (BMC) for monitoring and controlling various hardware devices on the system. It monitors system sensors and other parameters. Kernel-based Virtual Machine (KVM) is an open source virtualization technology built into Linux. KVM lets you turn Linux into a hypervisor that allows a host machine to run multiple, isolated virtual environments called guests or virtual machines (VMs).

What is Virtio-net device? Virtio-net device emulation enables users to create VirtIO-net emulated PCIe devices in the system where the NVIDIA® BlueField® DPU is connected.

Vulnerability details:

CVE-2023-31029 – NVIDIA DGX A100 baseboard management controller (BMC) contains a vulnerability in the host KVM daemon, where an unauthenticated attacker may cause a stack overflow by sending a specially crafted network packet. A successful exploit of this vulnerability may lead to arbitrary code execution, denial of service, information disclosure, and data tampering.

CVE-2023-31030 – NVIDIA DGX A100 BMC contains a vulnerability in the host KVM daemon, where an unauthenticated attacker may cause a stack overflow by sending a specially crafted network packet. A successful exploit of this vulnerability may lead to arbitrary code execution, denial of service, information disclosure, and data tampering.

Official announcement: Please refer to the link for details – https://nvidia.custhelp.com/app/answers/detail/a_id/5510

My comment: The vendor published this vulnerability but did not provide full details. Do you think whether the details in attached diagram is the actual reason?

CVE-2023-5091: Mali GPU Kernel Driver allows improper GPU processing operations (8th Jan 2024)

Preface: According to news in October 2023, experts speculated that commercial spyware exploited a security vulnerability in the Arm Mali GPU driver to compromise some people’s devices. The vulnerability was claimed to be a local attack. But how do attacker plant malware on a smartphone without remote access? Hard to say! Phishing and social engineering techniques may be involved.

Background: About four years ago, the mainstream GPUs are PowerVr, Mali, and Adreno (Qualcomm). Apple used a customized version of PowerVr in the early days. However, as Apple develops its own GPU, PowerVr software design now owned by Canyon Bridge Capital Partners. Mali is the graphics acceleration IP of ARM. Mali is actually ARM’s Mali series IP core.

The first version of the Mali microarchitecture is called Utgard. Later there were versions called Midgard (second generation), Bifrost (third generation), and Valhall (fourth generation). Valhall was launched in the second quarter of 2019. The main series are Mali-G57 and Mali-G77.

However, commercial spyware has exploited a security hole in Arm’s Mali GPU drivers to compromise some people’s devices, according to news from Oct 2023.

ARM decided last September (2023) not to disclose any details of CVE-2023-5091 to the public. The official announcement published on January 8, 2024 finally.

Vulnerability details: Use After Free vulnerability in Arm Ltd Valhall GPU Kernel Driver allows a local non-privileged user to make improper GPU processing operations to gain access to already freed memory. This issue affects Valhall GPU Kernel Driver: from r37p0 through r40p0.

Official announcement: Please refer to the link for details – https://nvd.nist.gov/vuln/detail/CVE-2023-5091

CVE-2024-21318 – SharePoint Enterprise Server 2016, 2019 and Subscription Edition design limitation (10th Jan 2024)

Preface: Under normal circumstances, CVEs are recorded sequentially every year. Microsoft announced CVE-2024-21318 on January 9, 2024. It’s the start of a new year, and this record let me speculated that whether there are plenty of design weakness found in 2023. But it is waiting to be verified. Due to the huge amount of data, it need to wait for the official CVE reference number. So, it carry forward to 2024. This brings the total to five figures.

Background: Microsoft did not disclose details. Therefore, the technical details are not yet clear. Do you think SharePoint Add-in is one of the possible factor in this matter?

A SharePoint Add-in is a self-contained piece of functionality that extends the capabilities of SharePoint websites to solve a well-defined business problem. Add-ins don’t have custom code that runs on SharePoint servers. Instead, all custom logic moves “up” to the cloud, or “down” to client computers, or “over” to an on-premises server that is outside the SharePoint farm or SharePoint Online subscription. Keeping custom code off SharePoint servers provides reassurance to SharePoint administrators that the add-in can’t harm their servers or reduce the performance of their SharePoint Online websites.

Business logic in a SharePoint Add-in can access SharePoint data through one of the several client APIs included in SharePoint. Which API you use for your add-in depends on certain other design decisions you make.

Vulnerability details: Microsoft SharePoint Server Remote Code Execution Vulnerability. Technical details unknown.

Remedy: Applying the patch can eliminate this problem. Possible mitigations were released immediately after the vulnerability was disclosed.

Official announcement: Please refer to the link for details –

https://msrc.microsoft.com/update-guide/vulnerability/CVE-2024-21318

Supply constraints and product attribute design. It is expected that two camps will be operated in the future. (9th JAN 2024)

Preface: When High performance cluster (HPC) was born, it destiny do a competition with traditional mainframe technology.  The major component of HPC is the multicore processor. That is GPU. For example: The NVIDIA GA100 GPU is composed of multiple GPU Processing Clusters (GPCs), Texture Processing Clusters (TPCs), Streaming Multiprocessors (SMs), and HBM2 memory controllers. Compare with the best of the best setup,  the world’s fastest public supercomputer, Frontier, has 37,000 AMD Instinct 250X GPUs.

How to break through traditional computer technology and go from serial to parallel processing: CPUs are fast, but they work by quickly executing a series of tasks, which requires a lot of interactivity. This is known as serial processing. GPU parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. As time goes by. Until the revolution in GPU processor technology and high-performance clusters. RedHat created a High Performance Cluster system configuration. The overall performance is close to that of a supercomputer processor using crossbar switches. But the bottleneck lies in how to transform traditional software applications from serial processing to parallel processing.

Reflection of reality in the technological world: A common consense that GPU processor manufacturer Nvidia had strong market share in the world. The Nvidia A100 processor delivers strong performance on intensive AI tasks and deep learning. A more budget-friendly option, the H100 can be preferred for graphics-intensive tasks. The H100’s optimizations, such as TensorRT-LLM and NVLink, show that it surpasses the A100, especially in the LLM area. Large Language Models (LLMs) have revolutionised the field of natural language processing. As these models grow in size and complexity, the computational demands for inference also increase significantly. To tackle this challenge, leveraging multiple GPUs becomes essential.

Supply constraints and product attribute design create headaches for web hosting providers: CUDA is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). But converting serial C code to data parallel code is a difficult problem. Because of this limitation. Nvidia develop NVIDIA CUDA Compiler (NVCC). This software is a proprietary compiler by Nvidia intended for use with CUDA.

Using the CUDA Toolkit you can accelerate your C or C++ applications by updating the computationally intensive portions of your code to run on GPUs. To accelerate your applications, you can call functions from drop-in libraries as well as develop custom applications using languages including C, C++, Fortran and Python.

But you cannot use CUDA without a Nvidia Graphics Card. CUDA is a framework developed by Nvidia that allows people with a Nvidia Graphics Card to use GPU acceleration when it comes to deep learning, and not having a Nvidia graphics card defeats that purpose. (Refer to attached Diagram Part 1).

If web hosting service provider not use NVIDIA product, is it possible to use other brand GPU processor for AI machine learning? Yes, you can select OpenCilk.

OpenCilk (http://opencilk.org) is a new open-source platform to support task-parallel programming in C/C++. (Refer to attached Diagram Part 2)

Referring to the above details, the technological development atmosphere makes people foresee that two camps will operate in the future. This is the Nvidia camp and the non-Nvidia camp. This is why I have observed that web hosting service providers are giving themselves headaches in this new trend in technology gaming.

CVE-2023-34326: Potential risk allowing access to unindented memory regions (8th JAN 2024)

Preface: In fact, by the time the vulnerability was released to the public, the design limitations and/or flaws had already been fixed. You may ask, what is the discussion space for the discovered vulnerabilities? As you know, an increasing number of vendors remain compliant with CVE policies, but the technical details will not be disclosed. If your focus is understanding, even if the vendor doesn’t release any details. You can learn about specific techniques as you learn. The techniques you learn can expand your horizons.

Background: AMD-Vi represents an I/O memory management unit (IOMMU) that is embedded in the chipset of the AMD Opteron 6000 Series platform. IOMMU is a key technology in extending the CPU’s virtual memory to GPUs to enable heterogeneous computing. AMD-Vi (also known as AMD IOMMU) to allow for PCI Passthrough.

DMA mapping is a conversion from virtual addressed memory to a memory which is DMA-able on physical addresses (actually bus addresses).

DMA remapping maps virtual addresses in DMA operations to physical addresses in the processor’s memory address space. Similar to MMU, IOMMU uses a multi-level page table to keep track of the IOVA-to-PA mappings at different page-size granularity (e.g., 4-KiB, 2-MiB, and 1-GiB pages). The hardware also implements a cache (aka IOTLB) of page table entries to speed up translations.

AMD processors use two distinct IOTLBs for caching Page Directory Entry (PDE) and Page Table Entry (PTE) (AMD, 2021; Kegel et al., 2016).

Ref: If your application scenario does not require virtualization, then disable AMD Virtualization Technology. With virtualization disabled, also, disable AMD IOMMU. It can cause differences in latency for memory access. Finally, also disable SR-IOV.

Vulnerability details: The caching invalidation guidelines from the AMD-Vi specification (48882—Rev 3.07-PUB—Oct 2022) is incorrect on some hardware, as devices will malfunction (see stale DMA mappings) if some fields of the DTE are updated but the IOMMU TLB is not flushed. Such stale DMA mappings can point to memory ranges not owned by the guest, thus allowing access to unindented memory regions.

Official announcement: Please refer to the link for details – https://nvd.nist.gov/vuln/detail/CVE-2023-34326

Android Security Bulletin – Released January 2024, covers a vulnerability in August 2023 (CVE-2023-21651) – 4th Jan 2024

Preface: According to the Android Security Bulletin, it releases a security bulletin once a month in the traditional way. However, if design limitations are related to other suppliers. The conclusion of the vulnerability details will be included the responses from relevant manufacturers. Therefore, Qualcomm also released its assessment of the severity of these problems.

I was not paying attention to this vulnerability in August 2023. Out of personal interest, maybe I’ll take this opportunity to dig into the details of this vulnerability. If you are interested, please become my guest.

Background: The full name of TEE is trusted execution environment, which is an area on the CPU of mobile devices (smart phones, tablets, smart TVs). The role of this area is to provide a more secure space for data and code execution, and to ensure their confidentiality and integrity.

Other TEE operating systems are traditionally supplied as binary blobs by third-party vendors or developed internally. Developing internal TEE systems or licensing a TEE from a third-party can be costly to System-on-Chip (SoC) vendors and OEMs.

Trusty is a secure Operating System (OS) that provides a Trusted Execution Environment (TEE) for Android. A Trusty application is defined as a collection of binary files (executables and resource files), a binary manifest, and a cryptographic signature. At runtime, Trusty applications run as isolated processes in unprivileged mode under the Trusty kernel

The Qualcomm Trusted Execution Environment software cryptographic library is part of the implemented software hybrid module. As part of the Snapdragon SoC architecture. It is the physical boundary of a single-chip software hybrid module.

Vulnerability details: Memory Corruption in Core due to incorrect type conversion or cast in secure_io_read/write function in TEE.

Official announcement: Please refer to the link for details –

Android: https://source.android.com/docs/security/bulletin/2024-01-01

Qualcomm: https://docs.qualcomm.com/product/publicresources/securitybulletin/august-2023-bulletin.html

CVE-2023-43514 – Use After Free in DSP Services (3rd JAN 2024)

Preface: Is Qualcomm Snapdragon based on Arm? Based on its brand-new ARM CPU core ‘Oryon’, developed from its Nuvia acquisition, Qualcomm’s Snapdragon X Elite SoC is built on TSMC’s 4nm process node. The CPU uses ARM’s 8.7 instruction set and features 12 high-performance ‘Oryon’ cores clocked at 3.8GHz.

Background: How to call ioctl from user space? To invoke ioctl commands of a device, the user-space program would open the device first, then send the appropriate ioctl() and any necessary arguments. static int mydrvr_ioctl (struct inode *inode, struct file *filp, unsigned int cmd, unsigned long arg);

Ref: A kbase_context object is responsible for managing resources for each driver file that is opened and is unique for each file handle. In particular, the kbase_context manages different types of memory that are shared between the GPU devices and user space applications.

Ref: DSPs are optimized in two key areas compared to classic CPUs. They accelerate common DSP mathematical operations in hardware and boast specific memory architectures designed for real-time data streams. A DSP is designed for performing mathematical functions like “add”, “subtract”, “multiply” and “divide” very quickly.

Vulnerability details: Memory corruption while invoking IOCTLs calls from user space for internal mem MAP and internal mem UNMAP.

Consequence: Use After Free vulnerability in DSP Services

Official announcement: Please refer to the link for details – https://docs.qualcomm.com/product/publicresources/securitybulletin/january-2024-bulletin.html

CVE-2023-33025: Speculate what would cause a vulnerability to become a critical risk level (1st JAN 2024)

Preface: VoLTE stands for Voice over Long-Term Evolution or Voice over LTE. VoLTE offers the possibility to voice call via the LTE/4G* mobile network. Previously, 4G was limited to surfing the Internet. When it came to calls, your phone would automatically switch to 3G or 2G.

Background: A 5G modem-RF system is a combination of two different technologies that work together to enable 5G communication. The modem is the part of the system that processes the digital signals, including encoding and decoding data, and managing the connection to the network.

Voice over LTE, or VoLTE, is a digital packet technology that uses 4G LTE networks to route voice traffic and transmit data. From technical point of view, VoLTE uses “Internet data,” whereas traditional voice calls are circuit-switched.

Ref: For example: Qualcomm Snapdragon X55 5G Modem-RF System is a comprehensive modem-to-antenna solution designed to allow OEMs to build 5G multimode devices for a new era of connected experiences.

Vulnerability details:  Memory corruption in Data Modem when a non-standard SDP body, during a VOLTE call.

Vulnerability Type:  CWE-120 Buffer Copy Without Checking Size of Input (‘Classic Buffer Overflow’)

Official announcement: Please refer to the link for details – https://docs.qualcomm.com/product/publicresources/securitybulletin/january-2024-bulletin.html

Artificial Intelligence technology development whether bring a battle for hegemony of compiler? (29th Dec 2023)

Preface: The competitors of LLVM such as GCC, Microsoft Visual C++, and Intel C++ Compiler. NVIDIA’s CUDA Compiler (NVCC) is based on the widely used LLVM open source compiler infrastructure. Furthermore, Tesla engineers wrote their own LLVM backed JIT neural compiler for Dojo.

Background: Instead of relying on computing power to function, GPUs rely on these numerous cores to pull data from memory, perform parallel calculations on it, and push the data back out for use. If you code something and compile it with a regular compiler, that’s not targeted for GPU execution, the code will always execute at the CPU. The GPU driver and compiler interact to ensure that the execution of the program on the GPU is correct operations. For example: You can compile CUDA codes for an architecture when your node hosts a GPU of different architecture.

A full build of LLVM and Clang will need around 15-20 GB of disk space. The exact space requirements will vary by system.

NVIDIA’s CUDA Compiler (NVCC) is based on the widely used LLVM open source compiler infrastructure. Developers can create or extend programming languages with support for GPU acceleration using the NVIDIA Compiler SDK.

Technical details: The LLVM is a low level register-based virtual machine. It is designed to abstract the underlying hardware and draw a clean line between a compiler back-end (machine code generation) and front-end (parsing, etc.). LLVM is a set of compiler and toolchain technologies that can be used to develop a frontend for any programming language and a backend for any instruction set architecture.

Ref: LLVM Pass framework is an important component of LLVM infrastructure, and it performs code transformations and optimizations at LLVM IR level.

LLVM IR is the language used by the LLVM compiler for program analysis and transformation. It’s an intermediate step between the source code and machine code, serving as a kind of lingua franca that allows different languages to utilize the same optimization and code generation stages of the LLVM compiler.

Looking Ahead: But facing the prospect of cyber security, perhaps new compilers will join this battle in the future.

Processor technology perspective: Unified Memory with shared page tables (28th Dec 2023)

Preface: NVIDIA Ada Lovelace architecture GPUs are designed to deliver performance for professional graphics, video, AI and computing. The GPU is based on the Ada Lovelace architecture, which is different from the Hopper architecture used in the H100 GPU.

As of October 2022, NVLink is being phased out in NVIDIA’s new Ada Lovelace architecture. The GeForce RTX 4090 and the RTX 6000 Ada both do not support NVLink.

Background: The NVIDIA Grace Hopper Superchip pairs a power-efficient, high-bandwidth NVIDIA Grace CPU with a powerful NVIDIA H100 Hopper GPU using NVLink-C2C to maximize the capabilities for strong-scaling high-performance computing (HPC) and giant AI workloads.

NVLink-C2C is the enabler for Nvidia’s Grace-Hopper and Grace Superchip systems, with 900GB/s link between Grace and Hopper, or between two Grace chips.

Technical details: One of the major differences in many-core versus multicore architectures is the presence of two different memory spaces: a host space and a device space. In the case of NVIDIA GPUs, the device is supplied with data from the host via one of the multiple memory management API calls provided by the CUDA framework, such as CudaMallocManaged and CudaMemCpy. Modern systems, such as the Summit supercomputer, have the capability to avoid the use of CUDA calls for memory management and access the same data on GPU and CPU. This is done via the Address Translation Services (ATS) technology that gives a unified virtual address space for data allocated with malloc and new if there is an NVLink connection between the two memory spaces.

My comment: Since CUDA is proprietary parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). In normal circumstances, dynamic memory is allocated and released while the program is running, it may cause memory space fragmentation. Over time, this fragmentation can result in insufficient contiguous memory blocks for new allocations, resulting in memory allocation failures or unexpected behaviour. So, it’s hard to say that design limitations won’t arise in the future!

Reference: In CUDA, kernel code is written using the [code]global[/code] qualifier and is called from the host code to be executed on the GPU. In summary, [code]cudaMalloc[/code] is used in the host code to allocate memory on the GPU, while [code]malloc[/code] is used in the kernel code to allocate memory on the CPU.