PDF page & word count, recursive searching of directory tree, output to excel

Question

I wrote this program as part of a work-related problem but with a focus on improving my Python skills. The program was needed to do a word count PDF files. The problem was there were a large number of PDF files (over 1000) which were scattered over a drive. I made use of a number of modules to achieve this.

Quick description

The program uses glob.iglob to search through a directory tree for PDF files. Once it has a PDF file it extracts the content to a list saved_text. Using len it works out the number of pages in the PDF. It then uses the slip() method to do a word count for the full document. It returns to the top of the loop and repeats till all PDF files have been counted.

I have heavily commented in my code.

Import, file paths input/output

import PyPDF2 import glob import xlsxwriter # ------------------------ Input file path ------------------------ # input_file_path = r"\\...pdf" input_file_count_path = r"\\.." # ------------------------ Output file path ----------------------- # output_file_path = "word_count.xlsx"

def main

def main(input_file_path, output_file_path, file_count): # lists total_words_list = [] # word count total_pages_list = [] # page count file_name_list = [] print('PyPDF2 import complete') for pdfFileObj in glob.iglob(input_file_path): try: # save the file path to list file_name_list.append(pdfFileObj) # To get a PdfFileReader object that represents this PDF, call PyPDF2.PdfFileReader() and pass it pdfFileObj. Store this PdfFileReader object in pdfReader. pdfReader = PyPDF2.PdfFileReader(pdfFileObj) number_of_pages = pdfReader.getNumPages() saved_text = [] # loop to count number pages in document for page_number in range(number_of_pages): page = pdfReader.getPage(page_number) # Once you have your Page object, call its extractText() method to return a string of the page’s text ❸. The text extraction isn’t perfect. page_content = page.extractText() # appned to list saved_text.append(page_content) # count the total number of pages on the document total_pages = len(saved_text) # add number of pages from document 'total_pages' to total_pages_list which will keep record for all documents. total_pages_list.append(total_pages) # print number of pages print("total number of pages are: ", total_pages) # create new var word_count_total = 0 # loop count words and output total words on document for i in saved_text: #The split() method splits a string into a list. which means 'len' can be used to count the number of words per page. word_count = len(i.split()) # Might be some counting issues, need to test futher. #Totals up the count of words for all pages in the PDF. word_count_total = word_count + word_count_total # takes total word count for each document and adds to a list total_words_list.append(word_count_total) print("Total word count for your file is: ", word_count_total) print(pdfFileObj) except: print("------there was a problem with the file, it will be skipped-------") continue # Deals with ENCRYPTED files. return(total_words_list, total_pages_list, file_name_list, input_file_path, output_file_path)

There is a separate function to write the data lists to an excel file using xlsxwriter.

def save_to_file

# ----------------- function to save to file ------------------------ # def save_to_file (total_words_list, total_pages_list, file_name_list, input_file_path, output_file_path): print("output to excel started") # makes a new excel file workbook = xlsxwriter.Workbook(output_file_path) # add_worksheet method called on workbook object worksheet = workbook.add_worksheet() # start from first cell row = 1 # loop word count, page count and file location then write to file for a, b, c in zip(total_pages_list, total_words_list, file_name_list): # write to file using write method worksheet.write(row, 0, a ) worksheet.write(row, 1, b ) worksheet.write(row, 2, c ) row += 1 workbook.close() print("---output to excel file complete, program all finished---") main(input_file_path, output_file_path, file_count)

Areas of concern:

Is my code clean? could it be more concise?
Being new to Python I am sure to overlooked something simple..?
Program is very slow when running on deep directory trees e.g. depth 4+

Reinderien · Accepted Answer · 2019-07-08 12:57:42Z

Performance

The problem was there were a large number of PDF files (over 1000) which were scattered over a drive.

This is the perfect scenario for a parallel application. Spin up a few workers and have them run through the files, perhaps sorted by size and evenly distributed so that the workload is also evenly distributed.

File paths

input_file_path = r"\\...pdf" input_file_count_path = r"\\.."

This is puzzling, and probably not what you actually want. Does your filename contain a literal backslash? How many of those dots means the upper directory? You may be better off using some Python path functions to form these paths, especially since you're on Windows (?)

Function length

Your main is long and complex, and should be subdivided into more functions.

Never `except:`

This is a deadly trap for beginners. Ctrl+C (program break) is represented as an exception, so this effectively prevents the user from killing your program with the keyboard. Use except Exception instead. Also, you should be outputting what went wrong, even if you decide to continue on with the other files.

Multiple return

First of all, your return doesn't need parens, because multiple return uses an implicit tuple.

Also, given the large number of return values, you're better off representing this result as an object.

Clean up after yourself

This:

workbook.close()

won't be executed if there's an exception before it. Instead, put your workbook in a with block.

Thanks for the feedback. I have tried to parallelise using concurrent.futures.ProcessPoolExecutor but have to date not got it working. Sorry, I cleaned up the file paths to much have now added the **\ back. Right I will try and break up main. Multiple return - great I was not sure how to handle this. Would a namedtuple() be the way to maybe? — Cam, CommentedJul 8, 2019 at 14:27
namedtuple is a lightweight option, yes. A heavier option is to write a class. — Reinderien, CommentedJul 8, 2019 at 14:36

Matthias Huschle · Accepted Answer · 2019-07-08 17:39:15Z

This already has a good answer, but I noticed some other things, that you should be aware of.

save_to_file

input_file_path is never used
In simple loops it can be a valid choice to use single-letter names for iteration items. The usage of a, b, c here however is confusing, as it is not clear, that they are what you intend them to be, especially since the order you use differs from the comment directly above and the function signature.

main

file_count is never used
The return tuple contains two of the call arguments entirely unchanged. Why should a caller want to have them in the return object? So only three of the return objects are relevant. A namedtuple would be a good choice anyway.
your try-except construct is flawed: You expect the individual arrays in the return tuple to have an equal number of elements. But if for example an exception ocurrs after file_name_list.append(pdfFileObj), this is no longer the case, and your save_to_file will crash later. My suggestion would be to move everything in the try-block to a new function (or multiple), that returns a single (named-)tuple, that is added to an array at the end of the try-block. That array should be the only return value of main.
Your iterations (pdf pages and parsed pages) are a bit clumsy. Always prefer ready-to-use iterators over hand-made ones: Using for page in pdfReader.pages: makes two local variables obsolete. Or - as you asked if you can be more concise - directly use a list comprehension: saved_text = [page.extractText() for page in pdfReader.pages], and a generator expression for the word count: word_count_total = sum(len(page_text.split()) for page_text in saved_text).

general

You are using way too many comments for my taste. One good thing about Python is, that by choosing good names and structure, it's quite easy to make code explain itself. If you need lots of comments, you should start investing more time in structure (move parts to new methods with expressive names) and names (pdfFileObj is a filepath, why not pdf_filepath?). If your comments just state the obvious, they clutter the reader's view and should be removed (# create new var before word_count_total = 0, or # write to file using write method before worksheet.write(row, 0, a )).
the call to main at the end is probably not from your original code, as file_count is undefined. If it is, use an if __name__ == '__main__': block for it. Otherwise it's impossible to import the file without side effects for tests or code-reusage.
Using ProcessPoolExecutor is a good approach to increase performance. But it won't work in your current structure. You need a function that cares about nothing else outside the current file. If you implement the proposal from my third point in main, this will be a lot easier (filename as call-argument, 3-tuple as return value).

Thanks for all the great feedback, looks like I have some coding to do. — Cam, CommentedJul 10, 2019 at 9:22

Stack Exchange Network

PDF page & word count, recursive searching of directory tree, output to excel

2 Answers 2

Performance

File paths

Function length

Never `except:`

Multiple return

Clean up after yourself

save_to_file

main

general

Hot Network Questions

PDF page & word count, recursive searching of directory tree, output to excel

2 Answers 2

Performance

File paths

Function length

Never except:

Multiple return

Clean up after yourself

save_to_file

main

general

Related

Hot Network Questions

Never `except:`