
Preface: NeMo Curator, part of the NVIDIA NeMo software suite for managing the AI agent lifecycle, is a Python library specifically designed for fast and scalable data processing and curation for generative AI use cases such as foundation language model pretraining, text-to-image model training, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT).
Background: To install the NeMo Curator library, run the following command:
- git clone https://github[.]com/NVIDIA/NeMo-Curator[.]git
- cd NeMo-Curator
- pip install –extra-index-url https://pypi[.]nvidia[.]com “.[cuda12x]”
Data download: downloading pipeline in NeMo Curator consists of the following classes:
- DocumentDownloader: Abstract class for downloading remote data to disk.
- DocumentIterator: Abstract class for reading dataset raw records from the disk.
- DocumentExtractor: Abstract class for extracting text records, as well as any relevant metadata from the records on the disk.
Vulnerability details: NVIDIA NeMo Curator for all platforms contains a vulnerability where a malicious file created by an attacker could allow code injection. A successful exploit of this vulnerability might lead to code execution, escalation of privileges, information disclosure, and data tampering.
Ref: The vulnerability arises when malicious files—such as JSONL files—are loaded by NeMo Curator. If these files are crafted to exploit weaknesses in how NeMo Curator parses or processes them, they can inject executable code. This aligns with your description of:
- Embedded malicious payloads in JSONL files.
- JSON injection attacks exploiting parsing logic.
Official announcement: Please see the link for details –