1
\$\begingroup\$

I have to weekly upload a csv file to sqlserver, and I do the job using python 3. The problem is that it takes too long for the file to be uploaded (around 30 minutes), and the table has 49000 rows and 80 columns.

Here is a piece of the code, where I have to transform the date format and replace quotes as well. I have already tried it with pandas, but took longer than that.

import csv import os import pyodbc import time srv='server_name' db='database' tb='table' conn=pyodbc.connect('Trusted_Connection=yes',DRIVER='{SQL Server}',SERVER=srv,DATABASE=db) c=conn.cursor() csvfile='file.csv' with open(csvfile,'r') as csvfile: reader = csv.reader(csvfile, delimiter=';') cnt=0 for row in reader: if cnt>0: for r in range(0,len(row)): #this is the part where I transform the date format from dd/mm/yyyy to yyyy-mm-dd if (len(row[r])==10 or len(row[r])==19) and row[r][2]=='/' and row[r][5]=='/': row[r]=row[r][6:10]+'-'+row[r][3:5]+'-'+row[r][0:2] #here I replace the quote to nothing, since it is not important for the report if row[r].find("'")>0: row[r]=row[r].replace("'","") #at this part I query the index to increment by 1 on the table qcnt="select count(1) from "+tb resq=c.execute(qcnt) rq=c.fetchone() rq=str(rq[0]) #here I insert each row into the table that already exists insrt=("insert into "+tb+" values("+rq+",'"+("', '".join(row))+"')") if cnt>0: res=c.execute(insrt) conn.commit() cnt+=1 conn.close() 

Any help will be appreciated. Thanks!

\$\endgroup\$
2
  • 2
    \$\begingroup\$What is reader?\$\endgroup\$
    – vnp
    CommentedFeb 7, 2019 at 21:57
  • \$\begingroup\$sorry, copied from my code but forgot to insert it here. It comes at this part: 'with open(csvfile,'r') as csvfile: reader = csv.reader(csvfile, delimiter=';')'. Just edited it now.\$\endgroup\$CommentedFeb 8, 2019 at 13:23

1 Answer 1

2
\$\begingroup\$

First of all, when in doubt, profile.

Now a not-so-wild guess. Most of the time is wasted in

 qcnt="select count(1) from "+tb resq=c.execute(qcnt) rq=c.fetchone() rq=str(rq[0]) 

In fact, the rq is incremented by each successful insert. Better fetch it once, and increment it locally:

 qcnt="select count(1) from "+tb resq=c.execute(qcnt) rq=c.fetchone() for row in csvfile: .... insert = .... c.execute(insert) rq += 1 .... 

Another guess is that committing each insert separately also does not help with performance. Do it once, after the loop. In any case, you must check the success of each commit.


Notice that if there is more than one client updating the table simultaneously, there is a data race (clients fetching the same rq), both with the original design, and with my suggestion. Moving rq into a column of its own may help; I don't know your DB design and requirements.

Consider a single insert values, wrapped into a transaction, instead of multiple independent inserts.


Testing for cnt > 0 is also wasteful. Read and discard the first line; then loop for the remaining rows.


Figuring out whether a field represents a date seems strange. You shall know that in advance.

\$\endgroup\$
1
  • \$\begingroup\$Thanks man, made two modifications, and elapsed time was reduced by half the time (mainly on the increment part). Awesome!!!\$\endgroup\$CommentedFeb 8, 2019 at 17:17

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.