I wrote this program as part of a work-related problem but with a focus on improving my Python skills. The program was needed to do a word count PDF files. The problem was there were a large number of PDF files (over 1000) which were scattered over a drive. I made use of a number of modules to achieve this.
Quick description
The program uses glob.iglob
to search through a directory tree for PDF files. Once it has a PDF file it extracts the content to a list saved_text
. Using len
it works out the number of pages in the PDF. It then uses the slip()
method to do a word count for the full document. It returns to the top of the loop and repeats till all PDF files have been counted.
I have heavily commented in my code.
Import, file paths input/output
import PyPDF2 import glob import xlsxwriter # ------------------------ Input file path ------------------------ # input_file_path = r"\\...pdf" input_file_count_path = r"\\.." # ------------------------ Output file path ----------------------- # output_file_path = "word_count.xlsx"
def main
def main(input_file_path, output_file_path, file_count): # lists total_words_list = [] # word count total_pages_list = [] # page count file_name_list = [] print('PyPDF2 import complete') for pdfFileObj in glob.iglob(input_file_path): try: # save the file path to list file_name_list.append(pdfFileObj) # To get a PdfFileReader object that represents this PDF, call PyPDF2.PdfFileReader() and pass it pdfFileObj. Store this PdfFileReader object in pdfReader. pdfReader = PyPDF2.PdfFileReader(pdfFileObj) number_of_pages = pdfReader.getNumPages() saved_text = [] # loop to count number pages in document for page_number in range(number_of_pages): page = pdfReader.getPage(page_number) # Once you have your Page object, call its extractText() method to return a string of the page’s text ❸. The text extraction isn’t perfect. page_content = page.extractText() # appned to list saved_text.append(page_content) # count the total number of pages on the document total_pages = len(saved_text) # add number of pages from document 'total_pages' to total_pages_list which will keep record for all documents. total_pages_list.append(total_pages) # print number of pages print("total number of pages are: ", total_pages) # create new var word_count_total = 0 # loop count words and output total words on document for i in saved_text: #The split() method splits a string into a list. which means 'len' can be used to count the number of words per page. word_count = len(i.split()) # Might be some counting issues, need to test futher. #Totals up the count of words for all pages in the PDF. word_count_total = word_count + word_count_total # takes total word count for each document and adds to a list total_words_list.append(word_count_total) print("Total word count for your file is: ", word_count_total) print(pdfFileObj) except: print("------there was a problem with the file, it will be skipped-------") continue # Deals with ENCRYPTED files. return(total_words_list, total_pages_list, file_name_list, input_file_path, output_file_path)
There is a separate function to write the data lists to an excel file using xlsxwriter
.
def save_to_file
# ----------------- function to save to file ------------------------ # def save_to_file (total_words_list, total_pages_list, file_name_list, input_file_path, output_file_path): print("output to excel started") # makes a new excel file workbook = xlsxwriter.Workbook(output_file_path) # add_worksheet method called on workbook object worksheet = workbook.add_worksheet() # start from first cell row = 1 # loop word count, page count and file location then write to file for a, b, c in zip(total_pages_list, total_words_list, file_name_list): # write to file using write method worksheet.write(row, 0, a ) worksheet.write(row, 1, b ) worksheet.write(row, 2, c ) row += 1 workbook.close() print("---output to excel file complete, program all finished---") main(input_file_path, output_file_path, file_count)
Areas of concern:
- Is my code clean? could it be more concise?
- Being new to Python I am sure to overlooked something simple..?
- Program is very slow when running on deep directory trees e.g. depth 4+