I'm working on a project where I need to extract only the bold text from PDF files using Python. At first, I tried using libraries like PyMuPDF (fitz) and pdfminer, extracting the PDF as HTML and analyzing tags or CSS font-weight: bold styles.
However, I realized that most PDFs I receive are scanned images, or even when they contain embedded text, they do not preserve bold/italic formatting internally. They just look bold visually, but the underlying text data is flat — without any font-weight, tags, or metadata indicating style.
I also considered using OCR solutions like ocrmypdf, which successfully extract selectable text from image-based PDFs. However, they do not differentiate normal text from bold or italic — the OCR engines (like Tesseract) just output plain text without style information.
My goal is not just to extract text, but specifically to detect and separate bold parts from the rest, even if it requires image analysis.
Is there a practical way to do this in Python?
Specifically:
Can OpenCV or similar libraries be used to detect bold text visually (based on line thickness, font weight, or stroke density)?
Are there OCR engines or extensions that preserve font style (bold, italic) after recognition?
Would it make sense to use LLMs with Vision capabilities (like GPT-4 Vision, Claude 3 Vision) to identify bold text directly from the rendered page?
Is there any best practice or pipeline recommendation for projects like this?
I appreciate any advice, experience, or code samples you could share. Thanks in advance!
[I used ChatGPT to help me generate this question based in my chat history of days trying to solve this problem]
Edit
PDFs for example and tests : https://drive.google.com/drive/folders/1yYl3fbTdYaw7cyVC9Zrl_YB0AgE4l35c?usp=sharing
About the project I'm devolping a mricroservice (with python (pymupdf) and flask) to a client who needs to extract questions from public tests (i'm at Brazil actually). Each PDF test has around 30 to 100 questions. I'm using LLMs to process and extract reason (like disciplines, the corect answer, subect). Right now i'm sending raw text (PyMyPDF) to the LLM. But i can't process questions with bold and indentation. I think i'ts the last step of the project.