Searching for a value from one CSV file in another CSV file

Question

I am writing a script that takes one CSV file searches for a value in another CSV file then writes an output depending on the result it finds.

I have been using Python's CSV Distreader and writer. I have it working, but it is very inefficient because it is looping through the 2 sets of data until it finds a result.

There are a few bits in the code which are specific to my setup (file locations etc), but I'm sure people can see around this.

# Set all csv attributes cache = {} in_file = open(sript_path + '/cp_updates/' + update_file, 'r') reader = csv.DictReader(in_file, delimiter= ',') out_file = open(sript_path + '/cp_updates/' + update_file + '.new', 'w') out_file.write("StockNumber,SKU,ChannelProfileID\n") writer = csv.DictWriter(out_file, fieldnames=('StockNumber', 'SKU', 'ChannelProfileID'), delimiter=',') check_file = open(sript_path + '/feeds/' + feed_file, 'r') ch_file_reader = csv.DictReader(check_file, delimiter=',') #loop through the csv's, find stock levels and update file for row in reader: #print row check_file.seek(0) found = False for ch_row in ch_file_reader: #if row['SKU'] not in cache: if ch_row['ProductCode'] == row[' Stock']: Stk_Lvl = int(ch_row[stk_lvl_header]) if Stk_Lvl > 0: res = 3746 elif Stk_Lvl == 0: res = 3745 else: res = " " found = True print ch_row print res cache[row['SKU']] = res if not found: res = " " #print ch_row #print res cache[row['SKU']] = res row['ChannelProfileID'] = cache[row['SKU']] writer.writerow(row)

This is a few lines from my in_file and also the outfile is the same structure. It just updates the ChannelProfileID depending on the results found.

"StockNumber","SKU","ChannelProfileID" "10m_s-vid#APTIIAMZ","2VV-10",3746 "10m_s-vid#CSE","2VV-10",3746 "1RR-01#CSE","1RR-01",3746 "1RR-01#PCAWS","1RR-01",3746 "1m_s-vid_ext#APTIIAMZ","2VV-101",3746

This is a few line from the check_file:

ProductCode, Description, Supplier, CostPrice, RRPPrice, Stock, Manufacturer, SupplierProductCode, ManuCode, LeadTime 2VV-03,3MTR BLACK SVHS M - M GOLD CABLE - B/Q 100,Cables Direct Ltd,0.43,,930,CDL,2VV-03,2VV-03,1 2VV-05,5MTR BLACK SVHS M - M GOLD CABLE - B/Q 100,Cables Direct Ltd,0.54,,1935,CDL,2VV-05,2VV-05,1 2VV-10,10MTR BLACK SVHS M - M GOLD CABLE - B/Q 50,Cables Direct Ltd,0.86,,1991,CDL,2VV-10,2VV-10,1

You can see it selects the first line from the in_file, looks up the SKU in the check_file then writes the out_file in the same format as the in_file changing the ChannelProfileID depending what it finds in the Stock field of the check_file, it then goes back to the first line in the check_file and performs the same on the next line on the in_file.

As I say this script is working and outputs exactly what I want, but I believe it is slow and inefficient due to having to keep loop through the check_file until it finds a result.

What I'm after is suggestions on how to improve the efficiency. I'm guessing there's a better way to find the data than to keep looping through the check_file.

Blair · Accepted Answer · 2011-12-23 00:30:15Z

What you want is a mapping from the product code (the key) to a stock level/result code (the value). In Python this is known as a dictionary. The way to do it is to go through your check file at the start, and use the information in it to create a dictionary containing all the stock level details. You then go through your input file, read in the product code, and retrieve the stock code from the dictionary you created earlier.

I've rewritten your code to do this, and it works for the example files you gave. I have commented it fairly thoroughly, but if there is anything unclear in it just post a comment and I'll try to clarify.

import csv # Open the check file in a context manager. This ensures the file will be closed # correctly if an error occurs. with open('checkfile.csv', 'rb') as checkfile: checkreader = csv.DictReader(checkfile) # Create a function which maps the stock level to the result code. def result_code(stock_level): if stock_level > 0: return 3746 if stock_level == 0: return 3745 return " " # This does the real work. The middle line is a generator expression which # iterates over each line in the check file. The product code and stock # level are extracted from each line, the stock level converted into the # result, and the two values put together in a tuple. This is then converted # into a dictionary. This dictionary has the product codes as its keys and # their result code as its values. product_result = dict( (v['ProductCode'], result_code(int(v[' Stock']))) for v in checkreader ) # Open the input and output files. with open('infile.csv', 'rb') as infile: with open('outfile.csv', 'wb') as outfile: reader = csv.DictReader(infile) # Use the same field names for the output file. writer = csv.DictWriter(outfile, reader.fieldnames) writer.writeheader() # Iterate over the products in the input. for product in reader: # Find the stock level from the dictionary we created earlier. Using # the get() method allows us to specify a default value if the SKU # does not exist in the dictionary. result = product_result.get(product['SKU'], " ") # Update the product info. product['ChannelProfileID'] = result # Write it to the output file. writer.writerow(product)

Great Thanks, really clear and simple with excellent comments! — gingebot, CommentedDec 26, 2011 at 17:46
Great Thanks, really clear and simple with excellent comments, has really helped my learning. It's also taken a script that took about 30 mins to complete with a full dataset to a matter of seconds! One slight change I made is; ` def result_code(stock_level): try : stock_level = int(stock_level) except : stock_level = 0 if stock_level > 0: return 3746 if stock_level == 0: return 3745 return " " product_result = dict( (v['item code'], result_code(v['stock'])) for v in checkreader )` Because now and again a blank space is used instead of 0 for a stock level of 0. — gingebot, CommentedDec 26, 2011 at 17:52

jcollado · Accepted Answer · 2011-12-22 23:58:23Z

One thing that you should avoid is reading the same file multiple times. There isn't any detail about how big are your files in the question, so I guess that they can fit in memory. In that case, I'd recommend to read the files once, work on the data in memory and write the result file.

Aside from that, as you read the data, there should be some way to improve the search time later. It looks like the columns in which you're interested are the ones related to the ProductCode. So maybe you could create a dictionary of lists that can be accessed using the ProductCode as key. As I said, this should speed up the search.

If there's some reason why using a dictionary isn't appropriate. You can try to use a database like sqlite3, which is part of the standard library, and store your data in memory in such a way that you can run SQL queries to get the data that you need in a faster way.

I hope this helps.

MDTMDT · Accepted Answer · 2011-12-23 00:06:49Z

I think I have come up with something along the lines of what you want using dicts

import csv in_file = open("in.csv", 'r') reader = csv.DictReader(in_file, delimiter= ',') out_file = open("out.txt", 'w') out_file.write("StockNumber,SKU,ChannelProfileID\n") check_file = open("check.csv", 'r') check = csv.DictReader(check_file, delimiter=',') prodid = set() prod_sn = dict() for row in reader: prodid.add(row["SKU"]) prod_sn[row["SKU"]] = row["StockNumber"] print(row["SKU"]) stocknums = dict() for row in check: stocknums[row["ProductCode"]] = row[" Stock"] print(row["ProductCode"]) for product in prodid: ref = 0 if product in stocknums: if(stocknums[product] > 0): ref = 1 out_file.write(str(prod_sn[product]) + ',' + str(product) + ','+ str(ref)+ "\n")

pyInTheSky · Accepted Answer · 2011-12-24 19:31:48Z

This I hope, meets all your needs. It allows you to hold on to your csv in a dict form, do lookups and modifications, and also write it in a perserved order. You can also change which column you want to be your lookup column (making sure that there is a unique id for every row of that column. In my example of usage, it assumes that both classes are contained withing the same file named 'CustomDictReader.py'. So in the end, what you can do with this is create two CSVRW objects, set your lookup column for each one and do your swapping/compare/lookup, then do the final write, when you are done going through what you need

-- File 'CustomDictReader.py' --

import csv, collections, copy ''' # CSV TEST FILE 'test.csv' TBLID,DATETIME,VAL C1,01:01:2011:00:01:23,5 C2,01:01:2012:00:01:23,8 C3,01:01:2013:00:01:23,4 C4,01:01:2011:01:01:23,9 C5,01:01:2011:02:01:23,1 C6,01:01:2011:03:01:23,5 C7,01:01:2011:00:01:23,6 C8,01:01:2011:00:21:23,8 C9,01:01:2011:12:01:23,1 #usage >>> import CustomDictReader >>> import pprint >>> test = CustomDictReader.CSVRW() >>> success, thedict = test.createCsvDict('TBLID',',',None,'test.csv') >>> pprint.pprint(dict(d)) {'C1': OrderedDict([('TBLID', 'C1'), ('DATETIME', '01:01:2011:00:01:23'), ('VAL', '5')]), 'C2': OrderedDict([('TBLID', 'C2'), ('DATETIME', '01:01:2012:00:01:23'), ('VAL', '8')]), 'C3': OrderedDict([('TBLID', 'C3'), ('DATETIME', '01:01:2013:00:01:23'), ('VAL', '4')]), 'C4': OrderedDict([('TBLID', 'C4'), ('DATETIME', '01:01:2011:01:01:23'), ('VAL', '9')]), 'C5': OrderedDict([('TBLID', 'C5'), ('DATETIME', '01:01:2011:02:01:23'), ('VAL', '1')]), 'C6': OrderedDict([('TBLID', 'C6'), ('DATETIME', '01:01:2011:03:01:23'), ('VAL', '5')]), 'C7': OrderedDict([('TBLID', 'C7'), ('DATETIME', '01:01:2011:00:01:23'), ('VAL', '6')]), 'C8': OrderedDict([('TBLID', 'C8'), ('DATETIME', '01:01:2011:00:21:23'), ('VAL', '8')]), 'C9': OrderedDict([('TBLID', 'C9'), ('DATETIME', '01:01:2011:12:01:23'), ('VAL', '1')])} ''' class CustomDictReader(csv.DictReader): ''' override the next() function and use an ordered dict in order to preserve writing back into the file ''' def __init__(self, f, fieldnames = None, restkey = None, restval = None, dialect ="excel", *args, **kwds): csv.DictReader.__init__(self, f, fieldnames = None, restkey = None, restval = None, dialect = "excel", *args, **kwds) def next(self): if self.line_num == 0: # Used only for its side effect. self.fieldnames row = self.reader.next() self.line_num = self.reader.line_num # unlike the basic reader, we prefer not to return blanks, # because we will typically wind up with a dict full of None # values while row == []: row = self.reader.next() d = collections.OrderedDict(zip(self.fieldnames, row)) lf = len(self.fieldnames) lr = len(row) if lf < lr: d[self.restkey] = row[lf:] elif lf > lr: for key in self.fieldnames[lr:]: d[key] = self.restval return d class CSVRW(object): def __init__(self): self.file_name = "" self.csv_delim = "" self.csv_dict = collections.OrderedDict() def setCsvFileName(self, name): ''' @brief stores csv file name @param name- the file name ''' self.file_name = name def getCsvFileName(): ''' @brief getter @return returns the file name ''' return self.file_name def getCsvDict(self): ''' @brief getter @return returns a deep copy of the csv as a dictionary ''' return copy.deepcopy(self.csv_dict) def clearCsvDict(self): ''' @brief resets the dictionary ''' self.csv_dict = collections.OrderedDict() def updateCsvDict(self, newCsvDict): ''' creates a deep copy of the dict passed in and sets it to the member one ''' self.csv_dict = copy.deepcopy(newCsvDict) def createCsvDict(self,dictKey, delim, handle = None, name = None, readMode = 'rb', **kwargs): ''' @brief create a dict from a csv file where: the top level keys are the first line in the dict, overrideable w/ **kwargs each row is a dict each row can be accessed by the value stored in the column associated w/ dictKey that is to say, if you want to index into your csv file based on the contents of the third column, pass the name of that col in as 'dictKey' @param dictKey - row key whose value will act as an index @param delim - csv file deliminator @param handle - file handle (leave as None if you wish to pass in a file name) @param name - file name (leave as None if you wish to pass in a file handle) @param readMode - 'r' || 'rb' @param **kwargs - additional args allowed by the csv module @return bool - SUCCESS|FAIL ''' retVal = (False, None) self.csv_delim = delim try: reader = None if isinstance(handle, file): self.setCsvFileName(handle.name) reader = CustomDictReader(handle, delim, **kwargs) else: if None == name: name = self.getCsvFileName() else: self.setCsvFileName(name) reader = CustomDictReader(open(name, readMode), delim, **kwargs) for row in reader: self.csv_dict[row[dictKey]] = row retVal = (True, self.getCsvDict()) except IOError: retVal = (False, 'Error opening file') return retVal def createCsv(writeMode, outFileName = None, delim = None): ''' @brief create a csv from self.csv_dict @param writeMode - 'w' || 'wb' @param outFileName - file name || file handle @param delim - csv deliminator @return none ''' if None == outFileName: outFileName = self.file_name if None == delim: delim = self.csv_delim with open(outFileName, writeMode) as fout: for key in self.csv_dict.values(): fout.write(delim.join(key.keys()) + '\n') break for key in self.csv_dict.values(): fout.write(delim.join(key.values()) + '\n')

Stack Exchange Network

Searching for a value from one CSV file in another CSV file

4 Answers 4

Hot Network Questions

Searching for a value from one CSV file in another CSV file

4 Answers 4

Related

Hot Network Questions