Efficiently adding huge amounts of data from CSV files into an SQLite DB in Java

Question

I'm trying to parse values from a CSV file to a SQLite DB, however the file is quite large (~2,500,000 lines). I ran my program for a few hours, printing where it was up to, but by my calculation, the file would have taken about 100 hours to parse completely, so I stopped it.

I'm going to have to run this program in a separate thread at least once a week, on a new CSV file that is around 90% similar to the previous one. I have ideas about how my code might be improved (listed below), but I'm not sure that the changes I have in mind would result in significant performance improvements.

Is there a more efficient way to read a CSV file than what I have already?
Is instantiating an ObjectOutputStream and storing it as a BLOB significantly computationally expensive? I could directly add the values instead, but I use the BLOB later, so storing it now saves me from instantiating a new one multiple times.
Would connection pooling, or changing the way I use the Connection in some other way be more efficient?
Given the similarities of any new CSV files to the previous one would it be significantly faster to diff the two most recent ones and then run my program on the diff? (Of course this doesn't help for populating the DB initially)
I'm setting the URL column as UNIQUE so I can use INSERT OR IGNORE, but testing this on smaller datasets(~10000 lines) indicates that there is no performance gain compared to dropping the table and repopulating. Is there a faster way to add only unique values?
Are there any obvious mistakes I'm making? (Again, I know very little about databases)

public class DataBase{ public static void main(String[] args){ Connection c = connect("db.db"); createTable(c); addCSVToDatabase(c, "test.csv"); disconnect(c); } public static void createTable(Connection c){ Statement stmt; String sql = "CREATE TABLE results(" + "ID INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, " + "TITLE TEXT NOT NULL, " + "URL TEXT NOT NULL UNIQUE, " + "SELLER TEXT NOT NULL, " ... ... + "BEAN BLOB);"; try { stmt = c.createStatement(); stmt.executeUpdate(sql); } catch (SQLException e) { e.printStackTrace();} } public static void addCSVToDatabase(Connection c, String csvFile){ BufferedReader reader = null; DBEntryBean b; String[] vals; PreparedStatement pstmt = null; String sql = "INSERT OR IGNORE INTO results(" + "TITLE, " + "URL, " ... ... + "SELLER, " + "BEAN" + ");"; try{ pstmt = c.prepareStatement(sql); reader = new BufferedReader(new InputStreamReader(new FileInputStream(csvFile), "UTF-8")); for(String line; (line = reader.readLine()) != null;){ //Each line takes the form: "title|URL|...|...|SELLER" vals = line.split("|"); b = new DBEntryBean(); b.setTitle(vals[0]); b.setURL(vals[1]); ... ... b.setSeller(vals[n]); insert(b, pstmt); } } catch (FileNotFoundException e){ e.printStackTrace(); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } catch (SQLException e){ e.printStackTrace(); } finally{ if(pstmt != null){ try{ pstmt.close(); } catch (SQLException e) { e.printStackTrace(); } } } } public static void insert(DBEntryBean b, PreparedStatement pstmt) throws SQLException { pstmt.setString(Constants.DB_COL_TITLE, b.getTitle()); pstmt.setString(Constants.DB_COL_URL, b.getURL()); ... ... pstmt.setString(Constants.DB_COL_SELLER, b.getSeller()); // ByteArrayOutputStream baos = new ByteArrayOutputStream(); // oos = new ObjectOutputStream(baos); // oos.writeObject(b); // byte[] bytes = baos.toByteArray(); // pstmt.setBytes(Constants.DB_COL_BEAN, bytes); pstmt.executeUpdate(); } private static Connection connect(String path) { String url = "jdbc:sqlite:" + path; Connection conn = null; try { Class.forName("org.sqlite.JDBC"); conn = DriverManager.getConnection(url); } catch (SQLException e) { e.printStackTrace(); } catch (ClassNotFoundException e){ e.printStackTrace(); } return conn; } private static void disconnect(Connection c) { try{ if(c != null){ c.close(); } } catch(SQLException e){ e.printStackTrace(); } } }

If an explicit transaction is not used, SQLite places an implicit transaction around each command that modifies the database, so each INSERT that you have within the loop will get its own transaction. That will be slow. — cartant, CommentedJan 7, 2017 at 6:53
@cartant Thanks for the response. I tried putting c.setAutoCommit(false); just before my loop, and c.commit() in the finally block just after the loop, and although it's definitely faster, nothing gets inserted into the database. Do you have an example you could link? Or if you are able to modify my code and post it as an answer, I'll mark it correct if it speeds up my program. — Sam, CommentedJan 7, 2017 at 8:43
If the table you are adding a lot of data to is indexed, then it can be faster to turn off the indexing for that table, add the data without indexing it and then re-index the whole of the enlarged table. Adding indexed data requires two updates, one to update the data itself and one (or more) to update the index(es). — rossum, CommentedJan 7, 2017 at 10:02
I'm not going to post an answer, as I'm not especially familiar with the JDBC implementation. The transaction behaviour is an SQLite issue that occurs on many (all?) platforms. However, looking at the documentation, it appears that you might not be using commit correctly. — cartant, CommentedJan 7, 2017 at 10:12
Never mind, it turns out to be relatively quick to sed the csv file beforehand. Thanks again. — Sam, CommentedJan 9, 2017 at 2:34

Community · Accepted Answer · 2017-05-23 12:41:02Z

Your problem might be linked to this SO question.

I am not familiar, so I can not give you an example on how to use the begin and end, but the user is talking about

inserting 1 million records into a simple sqlite table with five columns

would last around 18 hours. I think this may be close to your problem and so I think the accepted answer may also work for you.

EDIT: This is the answer to the post above

Did you have your queries autocommitted? That could explain why it took so long. Try wrapping them in a begin / end so that it doesn't have to do a full commit for every insert.
This page explains begin/end transaction, while the FAQ touches on inserts/autocommits.

You might consider extracting what the accepted answer states, because it might change between now and some time in the future, so the reference will be lost. — msanford, CommentedMar 15, 2017 at 13:29

Stack Exchange Network

Efficiently adding huge amounts of data from CSV files into an SQLite DB in Java

1 Answer 1

Linked

Hot Network Questions

Efficiently adding huge amounts of data from CSV files into an SQLite DB in Java

1 Answer 1

Linked

Related

Hot Network Questions