Preface: LLMs are built on machine learning: specifically, a type of neural network called a transformer model. How do LLMs read PDFs? The first step was to extract the text blocks from the PDF using pdfplumber . Each text block came with its coordinates, which allowed to analyze their spatial relationships. Next, I created a “window” around each text block to capture its surrounding context.
Background: MuPDF is not widely known by consumers as a popular standalone application, but it is popular and growing in popularity among developers, particularly those working with Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems, due to its powerful and lightweight nature.
Large Language Models (LLMs) do not directly “read” PDF files in their native binary format. Instead, they interact with the extracted content of the PDF. MuPDF, through its Python binding PyMuPDF (or its specialized variant PyMuPDF4LLM), plays a crucial role in this process by enabling efficient and accurate extraction of information from PDFs.
Vulnerability details: A null pointer dereference occurs in the function break_word_for_overflow_wrap() in MuPDF 1.26.4 when rendering a malformed EPUB document. Specifically, the function calls fz_html_split_flow() to split a FLOW_WORD node, but does not check if node->next is valid before accessing node->next->overflow_wrap, resulting in a crash if the split fails or returns a partial node chain.
Official announcement: For more details, see the link