Category Archives: AI and ML

CVE-2024-47670: ocfs2 – add bounds checking (10-10-2024)

Preface: OCFS2 is a file system. It allows users to store and retrieve data. The data is stored in files that are organized in a hierarchical directory tree. It is a POSIX compliant file system that supports the standard interfaces and the behavioral semantics as spelled out by that specification.

Background: OCFS2 is a useful clustered file system that has many general purpose uses beyond Oracle workloads. Utilizing shared storage, it can be used for many general computing tasks where shared clustered storage is required.

OCFS2 supports 512-4K block size. In addition, there is support 4K-1M range for the Cluster size, which we can also call the allocation unit.

Vulnerability details: In the Linux kernel, the following vulnerability has been resolved: ocfs2: add bounds checking to ocfs2_xattr_find_entry() Add a paranoia check to make sure it doesn’t stray beyond valid memory region containing ocfs2 xattr entries when scanning for a match. It will prevent out-of-bound access in case of crafted images.

Official announcement: Please refer to the link for details –

https://www.tenable.com/cve/CVE-2024-47670

About CVE-2024-33066: Memory corruption while redirecting log file to any file location with any file name, said Qualcomm (8th Oct 2024)

Preface: To redirect the standard error output of a command to a file in the Linux shell, you can use the “2>” operator followed by the name of the file where you want to redirect the stderr. Additionally, you can combine the stderr and stdout streams using the “2>&1” operator if you want to redirect both to the same file.

Background: Qualcomm Snapdragon X65 5G Modem-RF System is the world’s first 10 Gigabit 5G and first 3GPP Release 16 modem-to-antenna solution. It is designed with an upgradable architecture to rapidly commercialize 5G Release 16 and extend 5G in mobile broadband, fixed wireless, industrial IoT and 5G private network applications.

Vulnerability details: Memory corruption while redirecting log file to any file location with any file name.

Official announcement: Please refer to the link for details – https://docs.qualcomm.com/product/publicresources/securitybulletin/october-2024-bulletin.html

CVE-2024-47561: Apache Avro Java SDK: Arbitrary Code Execution when reading Avro Data (Java SDK)

Preface: Kafka understands only byte arrays. Kafka acts as a Broker to convert and transmit data over the network between producers and consumers. But it need a mechanism to convert data into a format that Kafka, producers and consumers can understand.

Background: Apache Avro is a powerful data serialization framework that provides many useful features. It uses the AVRO file format, which is a compact binary format suitable for evolving data schemas. For example, it supports schema enforcement and schema transformations, which are essential for data integrity and compatibility.

Vulnerability details: Schema parsing in the Java SDK of Apache Avro 1.11.3 and previous versions allows bad actors to execute arbitrary code. Users are recommended to upgrade to version 1.11.4 or 1.12.0, which fix this issue.

Official announcement: Please refer to the link for details – https://lists.apache.org/thread/c2v7mhqnmq0jmbwxqq3r5jbj1xg43h5x

CVE‑2024-0103, CVE-2024-0104 & CVE-2024-0105 Interference from the development of supercomputers and artificial intelligence (3rd Oct 2024)

Preface: OpenAI revealed that the project cost $100 million, took 100 days, and used 25,000 NVIDIA A100 GPUs. Each server equipped with these GPUs uses approximately 6.5 kW, so an estimated 50 GWh of energy is consumed during training.

Background: Parallel processing is a method in computing of running two or more processors (CPUs) to handle separate parts of an overall task. Breaking up different parts of a task among multiple processors will help reduce the amount of time to run a program. GPUs render images more quickly than a CPU because of its parallel processing architecture, which allows it to perform multiple calculations across streams of data simultaneously. The CPU is the brain of the operation, responsible for giving instructions to the rest of the system, including the GPU(s).

NVIDIA CUDA provides a simple C/C++ based interface. The CUDA compiler leverages parallelism built into the CUDA programming model as it compiles your program into code.
CUDA is a parallel computing platform and programming interface model created by Nvidia for the development of software which is used by parallel processors. It serves as an alternative to running simulations on traditional CPUs.

Vulnerability details:

CVE-2024-0123 – NVIDIA CUDA toolkit for Windows and Linux contains a vulnerability in the nvdisasm command line tool where an attacker may cause an improper validation in input issue by tricking the user into running nvdisasm on a malicious ELF file. A successful exploit of this vulnerability may lead to denial of service. (CWE‑1285 – Improper Validation of Specified Index, Position, or Offset in Input)

CVE-2024-0124 – NVIDIA CUDA Toolkit for Windows and Linux contains a vulnerability in the nvdisam command line tool, where a user can cause nvdisasm to read freed memory by running it on a malformed ELF file. A successful exploit of this vulnerability might lead to a limited denial of service. (CWE-416 – Use After Free)

CVE-2024-0125 – NVIDIA CUDA Toolkit for Windows and Linux contains a vulnerability in the nvdisam command line tool, where a user can cause a NULL pointer dereference by running nvdisasm on a malformed ELF file. A successful exploit of this vulnerability might lead to a limited denial of service. (CWE-476 – NULL Pointer Dereference)

Official announcement: Please refer to the vendor announcement for details –

https://nvidia.custhelp.com/app/answers/detail/a_id/5577

CVE-2024-0116: NVIDIA Triton Inference Server contains a vulnerability where a user may cause an out-of-bounds.(2nd Oct 2024)

Preface: Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak.

Background: Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs. Triton supports an HTTP/REST and GRPC protocol that allows remote clients to request inferencing for any model being managed by the server.

Vulnerability details: NVIDIA Triton Inference Server contains a vulnerability where a user may cause an out-of-bounds read issue by releasing a shared memory region while it is in use. A successful exploit of this vulnerability may lead to denial of service.

Official announcement: Please refer to the vendor announcement for details – https://nvidia.custhelp.com/app/answers/detail/a_id/5565

CVE-2024-0132: About NVIDIA Container Toolkit 1.16.1 or earlier contains a Time-of-check Time-of-Use (TOCTOU) vulnerability (25th Sep 2024)

Preface: In software development, time-of-check to time-of-use (TOCTOU, TOCTTOU or TOC/TOU) is a class of software bugs caused by a race condition involving the checking of the state of a part of a system (such as a security credential) and the use of the results of that check.

Background: The NVIDIA container stack is architected so that it can be targeted to support any container runtime in the ecosystem. The components of the stack include:

-The NVIDIA Container Runtime (nvidia-container-runtime)

-The NVIDIA Container Runtime Hook (nvidia-container-toolkit / nvidia-container-runtime-hook)

-The NVIDIA Container Library and CLI (libnvidia-container1, nvidia-container-cli)

The components of the NVIDIA container stack are packaged as the NVIDIA Container Toolkit.

The NVIDIA Container Toolkit is a key component in enabling Docker containers to leverage the raw power of NVIDIA GPUs. This toolkit allows for the integration of GPU resources into your Docker containers.

Vulnerability details: NVIDIA Container Toolkit 1.16.1 or earlier contains a Time-of-check Time-of-Use (TOCTOU) vulnerability when used with default configuration where a specifically crafted container image may gain access to the host file system. This does not impact use cases where CDI is used. A successful exploit of this vulnerability may lead to code execution, denial of service, escalation of privileges, information disclosure, and data tampering.

Official announcement: Please refer to the vendor announcement for details – https://nvidia.custhelp.com/app/answers/detail/a_id/5582

CVE-2024-39928: Vulnerability in Apache Linkis – Spark EngineConn (25-09-2024)

Preface: Apache Linkis is a computation middleware that acts as a layer between upper-level applications and underlying engines, such as Apache Spark, Apache Hive and Apache Flink. It started as an Apache Incubator project in 2021 and graduated to a Top Level Project in January 2023.

Background: Linkis provides standardized interfaces (REST, JDBC, WebSocket etc.) to easily connect to various underlying engines (Spark, Presto, Flink, etc.), and acts as a proxy between the upper applications layer and underlying engines layer.

Ref: Py4J is a popularly library integrated within PySpark that lets python interface dynamically with JVM objects (RDD’s). Apache Spark comes with an interactive shell for python as it does for Scala. The shell for python is known as “PySpark”. To use PySpark you will have to have python installed on your machine.

Vulnerability details: In Apache Linkis <= 1.5.0, a Random string security vulnerability in Spark EngineConn, random string generated by the Token when starting Py4j uses the Commons Lang’s RandomStringUtils.

Remedy: Users are recommended to upgrade to version 1.6.0, which fixes this issue.

Official announcement: Please refer to the vendor announcement for details – https://lists.apache.org/thread/g664n13nb17rsogcfrn8kjgd8m89p8nw

CVE-2024-8375: A vulnerability has been found in Google Deepmind Reverb object deserialization. (20-09-2024)

Preface: As companies and researchers leave Tensorflow and move to PyTorch, Google seems interested in moving its products to JAX to solve some of Tensorflow’s pain points, such as the complexity of the API and the complexity of training in custom chips such as TPUs.

PyTorch optimizes performance by taking advantage of Python’s native support for asynchronous execution. In TensorFlow, you have to manually code and fine-tune every operation to be performed on a specific device to allow for decentralized training.

Background: TensorFlow is open-source Python library designed by Google to develop Machine Learning models and deep learning neural networks.

Reverb is an efficient and easy-to-use data storage and transport system designed for machine learning research. Reverb is primarily used as an experience replay system for distributed reinforcement learning algorithms but the system also supports multiple data structure representations such as FIFO, LIFO, and priority queues.

Vulnerability details: This is a retroactive issue for the already-fixed security vulnerability. The Reverb Server stores Tensors represented by protos. These protos contain type information as well as a string field called “tensor_content”. When the Reverb client communicates with the server, it unpacks these protos by turning them back into tensors.

Reverb supports the VARIANT datatype, which is supposed to represent an arbitrary object in C++. When a tensor proto of type VARIANT is unpacked, memory is first allocated to store the entire tensor, and a ctor is called on each instance. Afterwards, Reverb copies the content in tensor_content to the previously mentioned pre-allocated memory, which results in the bytes in tensor_content overwriting the vtable pointers of all the objects which were previously allocated.

Reverb exposes 2 relevant gRPC endpoints: InsertStream and SampleStream. By default, neither is authenticated and there is no authorization. The attacker can insert this stream into the server’s database, then when the client next calls SampleStream they will unpack the tensor into RAM, and when any method on that object is called (including its destructor) the attacker gains control of the Program Counter.

Official announcement: Please refer to the vendor announcement for details – https://www.tenable.com/cve/CVE-2024-8375

CVE-2024-44961: Forward soft recovery errors to userspace (5th Sep 2024)

Preface: The AMD Radeon Instinct™ MI50 server accelerator designed on the world’s First 7nm FinFET technology process brings customers a full-feature set based on the industry’s newest technologies. The MI50 is AMD’s workhorse accelerator offering that is ideal for large scale deep learning. Delivering up to 26.5 TFLOPS of native half-precision (FP16) or up to 13.3 TFLOPS single-precision (FP32) peak floating point performance and INT8 support and combined with 16GB or 32GB of high-bandwidth HBM2 ECC memory, the AMD Radeon Instinct™ MI50 brings customers finely balanced performance needed for enterprise-class, mid-range compute capable of training complex neural networks for a variety of demanding deep learning applications in a cost effective design.

Background: The drm/amdgpu driver supports all AMD Radeon GPUs based on the Graphics Core Next (GCN), Radeon DNA (RDNA), and Compute DNA (CDNA) architectures.

CDNA (Compute DNA) is a compute-centered graphics processing unit (GPU) microarchitecture designed by AMD for datacenters.

AMD CDNA architecture is supported by AMD ROCm™, an open software stack that includes a broad set of programming models, tools, compilers, libraries, and runtimes for AI and HPC solution development targeting AMD Instinct accelerators.

Vulnerability details: In the Linux kernel, the following vulnerability has been resolved: drm/amdgpu: Forward soft recovery errors to userspace As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting hanging command buffers cascading us to a hard reset. 1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/ (cherry picked from commit 434967aadbbbe3ad9103cc29e9a327de20fdba01)

Official announcement: Please refer to the website for details – https://nvd.nist.gov/vuln/detail/CVE-2024-44961

CVE‑2024-0110: Supercomputer and AI development Interlude (II)  (30th Aug 2024)

Preface: OpenAI revealed that the project cost $100 million, took 100 days, and used 25,000 NVIDIA A100 GPUs. Each server equipped with these GPUs uses approximately 6.5 kW, so an estimated 50 GWh of energy is consumed during training.

Background: Parallel processing is a method in computing of running two or more processors (CPUs) to handle separate parts of an overall task. Breaking up different parts of a task among multiple processors will help reduce the amount of time to run a program. GPUs render images more quickly than a CPU because of its parallel processing architecture, which allows it to perform multiple calculations across streams of data simultaneously. The CPU is the brain of the operation, responsible for giving instructions to the rest of the system, including the GPU(s).

NVIDIA CUDA provides a simple C/C++ based interface. The CUDA compiler leverages parallelism built into the CUDA programming model as it compiles your program into code.
CUDA is a parallel computing platform and programming interface model created by Nvidia for the development of software which is used by parallel processors. It serves as an alternative to running simulations on traditional CPUs.

Vulnerability details:

CVE-2024-0110: NVIDIA CUDA Toolkit contains a vulnerability in command `cuobjdump` where a user may cause an out-of-bound write by passing in a malformed ELF file. A successful exploit of this vulnerability may lead to code execution or denial of service.

CWE‑787   Code execution, denial of service (Severity –  Medium)

Ref: The integer overflow may result in an out-of-bounds write.

Official announcement: Please refer to the vendor announcement for details – https://nvidia.custhelp.com/app/answers/detail/a_id/5564