I'm trying to parse a file of about 200 MB in size. I decided to use Python re module for this task. However, upon some further study, I found that the BNF grammar-based PyParsing provides what I'm looking for.
To test my code, I used a 5MB file and to my surprise, the code takes more than 3 minutes to parse this file. Could somebody please review my code and see if I'm making any mistakes?
Here is the file-content:
16:31:19.321 xxxTIM23 L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 xxxdummy L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 xtext345 L3_Tx_BCH Downlink DumpStack: xdfsfosifjsfj 16:31:19.321 xtext345 L3_Tx_BCH Downlink DumpStack: xdfsfosifjsfj 16:31:19.321 unwanted L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 xrandom3 L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 xtext345 L3_Tx_BCH Downlink DumpStack: xdfsfosifjsfj 16:31:19.321 xtext345 L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch
Hers's the file content that I'm interested in:
16:31:19.321 xxxTIM23 L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 xxxdummy L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 unwanted L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 xrandom3 L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 xtext345 L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch
I'm sorry that I can't share the exact log file as the content is confidential to my company. I've shared one extra code and sample data file from the internet below.
from pyparsing import * FILE_PATH = "C:\\User\\Sam\\Desktop\\log.txt" #base grammer space = Literal(" ") digits = Word(nums) timestamps = Word(nums+":.") tags = Word(alphanums+"_-") brackets = oneOf("{ } [ ] ( )") #custom grammer badTchTag = Literal("L1_Rx_TCH") badTchText = Literal("BAD TCH BLOCK") badTchExt = Literal("send_report_btch") #query query = Combine(timestamps+space+Suppress(SkipTo(badTchTag))+badTchTag+space+Suppress(SkipTo(badTchText))+badTchText+brackets+digits+brackets+space+badTchExt) #parse with open(FILE_PATH) as file_ptr: try: output = query.searchString(file_ptr.read()) for line in output: print(line) except ParceException: print('not found!') file_ptr.close()
Even though the code does a bunch of other stuff, it is the most important part. The execution of query.searchString()
is taking more than 3 minutes even for a 5MB file.
[Edit] Another Example:
This dataset (http://opensource.indeedeng.io/imhotep/files/nasa_19950630.22-19950728.12.tsv.gz) has one log file inside it(final size is about 141 MB). I unzipped this file and opened it in Python to test Pyparsing.
Below is a snapshot of file contents:
unicomp6.unicomp.net - 804571214 GET /shuttle/countdown/count.gif 200 40310 unicomp6.unicomp.net - 804571214 GET /images/NASA-logosmall.gif 200 786
unicomp6.unicomp.net - 804571214 GET /images/KSC-logosmall.gif 200 1204
d104.aa.net - 804571215 GET /shuttle/countdown/count.gif 200 40310
d104.aa.net - 804571215 GET /images/NASA-logosmall.gif 200 786
d104.aa.net - 804571215 GET /images/KSC-logosmall.gif 200 1204
129.94.144.152 - 804571217 GET /images/ksclogo-medium.gif 304 0
199.120.110.21 - 804571217 GET /images/launch-logo.gif 200 1713
I wrote below Python code to extract lines with (* GET /images/*) content like this:
unicomp6.unicomp.net - 804571214 GET
/images/NASA-logosmall.gif unicomp6.unicomp.net - 804571214
GET /images/KSC-logosmall.gif d104.aa.net - 804571215
GET /images/NASA-logosmall.gif d104.aa.net - 804571215
GET /images/KSC-logosmall.gif 129.94.144.152 - 804571217 GET /images/ksclogo-medium.gif 199.120.110.21 - 804571217 GET /images/launch-logo.gif
from pyparsing import * digits = Word(nums) first = Word(alphanums+"._-") space = OneOrMore(" ") dash = Literal("-") REQ1 = Literal("GET") REQ2 = Literal("/images/") query = Combine(first+space+dash+space+digits+space+REQ1+space+REQ2+first) try: result = query.searchString(open("C:\\Users\\Sam\\Desktop\\19950630.23-19950801.00.tsv", encoding="Latin-1").read()) for item in result: print(result) except Exception as e: print(str(e))
This code also takes forever to run and I've to kill the execution prematurely. Could you please help me to identify what wrong am I doing ?
Literal
,Words
,nums
and so on. Can you add the necessary imports for us?\$\endgroup\$SkiptTo
andParceException
look like typos to me. Please fix the post so that the code is runnable.\$\endgroup\$