Preface: NVIDIA Run:ai is a Kubernetes-native platform for managing and optimizing AI workloads, acquired by NVIDIA in 2024. It provides dynamic orchestration for GPU resources, supporting flexible resource allocation to improve resource utilization and accelerate the AI development lifecycle in hybrid, on-premises, and cloud environments. Prior to its acquisition by NVIDIA in December 2024, Run:ai was an Israeli software startup founded in 2018, focused on AI infrastructure management. Its core product is Kubernetes-based software for managing and optimizing GPU workloads for AI applications, helping enterprises utilize hardware more efficiently. The company’s software assists enterprises in managing GPU resources in on-premises, cloud, and hybrid environments. Run:ai does not “change its role in machine learning,” but rather focuses on the key challenges of AI workload management and GPU orchestration throughout the broader Machine Learning Operations (MLOps) lifecycle.
Background: NVIDIA Run:ai is a Kubernetes-based platform that acts as a GPU orchestrator to maximize the efficiency of AI and machine learning workloads. It addresses challenges like managing expensive GPU resources by enabling dynamic, policy-based scheduling, allowing for the sharing of GPUs across teams and projects, and optimizing workload performance for training, tuning, and inference. Run:ai integrates into existing hybrid or on-premises AI infrastructure to improve GPU utilization and accelerate AI development cycles.
NVIDIA DGX Cloud primarily leverages Kubernetes (K8s) for orchestrating and managing AI workloads.
While DGX systems historically used Docker for containerization, DGX Cloud, particularly for advanced AI workloads and resource management, relies on Kubernetes for its scalability, high-performance computing capabilities, and efficient GPU resource allocation. This is often integrated with other NVIDIA software like NVIDIA NeMo and NVIDIA Run:ai, and deployed on cloud services such as Amazon Elastic Kubernetes Service (Amazon EKS).
Vulnerability details: NVIDIA RunAI for all platforms contains a vulnerability where a user could cause an improper restriction of communications channels on an adjacent network. A successful exploit of this vulnerability might lead to escalation of privileges, data tampering, and information disclosure.
Official announcement: Please refer to the link for details –