1

I'm working on a project where I need to extract only the bold text from PDF files using Python. At first, I tried using libraries like PyMuPDF (fitz) and pdfminer, extracting the PDF as HTML and analyzing tags or CSS font-weight: bold styles.

However, I realized that most PDFs I receive are scanned images, or even when they contain embedded text, they do not preserve bold/italic formatting internally. They just look bold visually, but the underlying text data is flat — without any font-weight, tags, or metadata indicating style.

I also considered using OCR solutions like ocrmypdf, which successfully extract selectable text from image-based PDFs. However, they do not differentiate normal text from bold or italic — the OCR engines (like Tesseract) just output plain text without style information.

My goal is not just to extract text, but specifically to detect and separate bold parts from the rest, even if it requires image analysis.

Is there a practical way to do this in Python?

Specifically:

Can OpenCV or similar libraries be used to detect bold text visually (based on line thickness, font weight, or stroke density)?

Are there OCR engines or extensions that preserve font style (bold, italic) after recognition?

Would it make sense to use LLMs with Vision capabilities (like GPT-4 Vision, Claude 3 Vision) to identify bold text directly from the rendered page?

Is there any best practice or pipeline recommendation for projects like this?

I appreciate any advice, experience, or code samples you could share. Thanks in advance!

[I used ChatGPT to help me generate this question based in my chat history of days trying to solve this problem]

Edit

PDFs for example and tests : https://drive.google.com/drive/folders/1yYl3fbTdYaw7cyVC9Zrl_YB0AgE4l35c?usp=sharing

About the project I'm devolping a mricroservice (with python (pymupdf) and flask) to a client who needs to extract questions from public tests (i'm at Brazil actually). Each PDF test has around 30 to 100 questions. I'm using LLMs to process and extract reason (like disciplines, the corect answer, subect). Right now i'm sending raw text (PyMyPDF) to the LLM. But i can't process questions with bold and indentation. I think i'ts the last step of the project.

4
  • 1
    The problem is every bold font will be different proportions. Hence the best historic solution has been take a suitable snippet of monotone pixels and from the offered Enhanced light to range of mediums and emboldened offering make a human decision. it was a lot easier in the days of wooden block printing when all fonts were bold and metal was lighter.
    – K J
    Commented5 hours ago
  • 2
    Well, this is a difficult problem. You've provided no examples of PDF's with bold and normal text, nor did you try to tell us why you want to do this. This is important because a solution that may work for one document/reason, might not work at all for another.Commented5 hours ago
  • Thanks KIKO Software i added some contextCommented5 hours ago
  • interesting so the first one has 3 bold fonts File: prova1.pdf PDF Producer: cairo 1.16.0 (cairographics.org) Fonts: Arial-BoldItalicMT (TrueType; Ansi; embedded) Arial-BoldMT (TrueType (CID); Identity-H; embedded) Arial-BoldMT (TrueType; Ansi; embedded) I wonder if should be more
    – K J
    Commented2 hours ago

1 Answer 1

1

Hey dont worry you can use my code :

import os from pdf2image import convert_from_path def pdf_to_images(pdf_path, output_folder, dpi=300, image_format='JPEG', poppler_path=None): """ Convert each page of a PDF to an image file. Args: pdf_path (str): Path to the PDF file. output_folder (str): Folder to save the output images. dpi (int): Resolution for the output images. image_format (str): Format of the output images (e.g., 'JPEG', 'PNG'). poppler_path (str): Optional; path to the Poppler binaries. """ # Create output folder if it doesn't exist if not os.path.exists(output_folder): os.makedirs(output_folder) # Convert PDF to a list of image objects pages = convert_from_path(pdf_path, dpi=dpi, poppler_path=poppler_path) # Save each page as an image file in sequential order for i, page in enumerate(pages, start=1): image_filename = os.path.join(output_folder, f"page_{i}.{image_format.lower()}") page.save(image_filename, image_format) print(f"Saved {image_filename}") import pytesseract from PIL import Image import os from natsort import natsorted import json import google.generativeai as genai # Configure the API key for the generativeai module free model you can use API key gemini it is free GOOGLE_API_KEY = 'yourAPIKEY' genai.configure(api_key=GOOGLE_API_KEY) def extract_text_tesseract(image_path): """ Fallback example using Tesseract if you still need it. """ try: img = Image.open(image_path) text = pytesseract.image_to_string(img, config='--psm 4') return text.strip() except Exception as e: return f"Error: {str(e)}" def extract_text_genai(image_path): """ Use Generative AI (Gemini) to extract data from the image with a new prompt geared towards table data. """ image = Image.open(image_path) model = genai.GenerativeModel('gemini-1.5-flash') # Updated prompt prompt = """ You are provided an image of a : Your task is to extract the following structure: { } Do not include any extra text, markdown, or explanation. Return only the JSON. """ # Send the prompt and image to the model response = model.generate_content( contents=[prompt, image] ) return response.text 

what you can do here you can make good prompt and your problem will be solved

2

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.