Product Requirements Document: PDF Text Extractor
Introduction
This document outlines the requirements for a software solution designed to automate the extraction of text from scanned PDF documents. The system will process image-based PDFs, extract their textual content, and save the results as formatted markdown files. The primary goals are to provide a reliable, automated, and local solution for digitizing document content for use with tools like Obsidian.
Project Goals
The core objective is to create a robust, file-based automation system that performs the following tasks:
- Process a designated input directory for new PDF files when the script is run.
- Extract text, including tabular data, from image-based PDFs using Optical Character Recognition (OCR).
- Detect the primary language (English or Hungarian) of the document to optimize OCR.
- Generate a well-structured markdown file for each successful extraction.
- Manage input files by moving them to success or failure directories.
- Log all events and errors.
Functional Requirements
File System Structure
The application will operate using a predefined folder structure:
inbox/
: The directory where all new PDF files are placed for processing.processed/
: The destination for PDF files that have been successfully processed.texts/
: The destination for the generated markdown files.failed/
: The destination for PDF files that the system failed to process.
Processing Logic
The system will be run manually. When executed, the script will process all PDF files found in the inbox/
directory. It will iterate through each file and initiate the following sequence:
-
Read and Analyze: The system will open the PDF file to determine its content.
-
Language Detection: Before OCR, the system will detect the document’s main language as either English or Hungarian. This can be based on the presence of accented characters (diacritics) or a list of common stop words.
-
OCR and Extraction: Using the recommended
docling
tool, the system will perform OCR on the images within the PDF to extract all text.- Tables: Tabular data will be recognized and converted into a markdown table format.
- Images: Any non-textual images (e.g., photos) will be ignored and not extracted.
-
Markdown Generation: If text is successfully extracted, a markdown file will be created. The filename must exactly match the original PDF filename (e.g.,
SampleDoc.pdf
becomesSampleDoc.md
). -
YAML Frontmatter: Each markdown file will include the following YAML frontmatter at the top:
--- title: [Document Title] source: [Original PDF Filename] detected_language: [en or hu] page_count: [Number of pages] processed_at: [ISO 8601 Timestamp] ---
-
Error Handling:
- If a PDF is found to contain only images (no extractable text), a markdown file will not be created, and the PDF will be moved to the
processed/
folder. - If the system fails to process a PDF for any reason, the file will be moved to the
failed/
folder.
- If a PDF is found to contain only images (no extractable text), a markdown file will not be created, and the PDF will be moved to the
Post-Processing
A post-processing step will be implemented to fix any markdown linting errors to ensure the generated files are clean and correctly formatted.
Technical Requirements
Technology Stack
- Core Tool:
docling
will be used for the OCR and document analysis. - Platform: The solution must be able to run locally on a MacBook Pro.
- Hardware Acceleration: The system will attempt to utilize the GPU for processing to improve performance. If GPU acceleration is not available or fails, it will fall back to using the CPU.
Logging
- A plain text log file (
run.log
) will be maintained. - The log will record key events, including file detection, language detection, successful processing, errors, and file movements, along with timestamps.
Success Metrics
The project will be considered successful if it achieves the following:
- High Success Rate: A minimum of 95% of valid, text-containing PDFs are successfully processed.
- Accuracy: The extracted text and tables are highly accurate and readable.
- Automation: The system operates without manual intervention once configured.
- Compatibility: The generated markdown files are fully compatible with Obsidian.