I'm trying to parse values from a CSV file to a SQLite DB, however the file is quite large (~2,500,000 lines). I ran my program for a few hours, printing where it was up to, but by my calculation, the file would have taken about 100 hours to parse completely, so I stopped it.
I'm going to have to run this program in a separate thread at least once a week, on a new CSV file that is around 90% similar to the previous one. I have ideas about how my code might be improved (listed below), but I'm not sure that the changes I have in mind would result in significant performance improvements.
Is there a more efficient way to read a CSV file than what I have already?
Is instantiating an
ObjectOutputStream
and storing it as a BLOB significantly computationally expensive? I could directly add the values instead, but I use the BLOB later, so storing it now saves me from instantiating a new one multiple times.Would connection pooling, or changing the way I use the Connection in some other way be more efficient?
Given the similarities of any new CSV files to the previous one would it be significantly faster to diff the two most recent ones and then run my program on the diff? (Of course this doesn't help for populating the DB initially)
I'm setting the URL column as UNIQUE so I can use INSERT OR IGNORE, but testing this on smaller datasets(~10000 lines) indicates that there is no performance gain compared to dropping the table and repopulating. Is there a faster way to add only unique values?
Are there any obvious mistakes I'm making? (Again, I know very little about databases)
public class DataBase{ public static void main(String[] args){ Connection c = connect("db.db"); createTable(c); addCSVToDatabase(c, "test.csv"); disconnect(c); } public static void createTable(Connection c){ Statement stmt; String sql = "CREATE TABLE results(" + "ID INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, " + "TITLE TEXT NOT NULL, " + "URL TEXT NOT NULL UNIQUE, " + "SELLER TEXT NOT NULL, " ... ... + "BEAN BLOB);"; try { stmt = c.createStatement(); stmt.executeUpdate(sql); } catch (SQLException e) { e.printStackTrace();} } public static void addCSVToDatabase(Connection c, String csvFile){ BufferedReader reader = null; DBEntryBean b; String[] vals; PreparedStatement pstmt = null; String sql = "INSERT OR IGNORE INTO results(" + "TITLE, " + "URL, " ... ... + "SELLER, " + "BEAN" + ");"; try{ pstmt = c.prepareStatement(sql); reader = new BufferedReader(new InputStreamReader(new FileInputStream(csvFile), "UTF-8")); for(String line; (line = reader.readLine()) != null;){ //Each line takes the form: "title|URL|...|...|SELLER" vals = line.split("|"); b = new DBEntryBean(); b.setTitle(vals[0]); b.setURL(vals[1]); ... ... b.setSeller(vals[n]); insert(b, pstmt); } } catch (FileNotFoundException e){ e.printStackTrace(); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } catch (SQLException e){ e.printStackTrace(); } finally{ if(pstmt != null){ try{ pstmt.close(); } catch (SQLException e) { e.printStackTrace(); } } } } public static void insert(DBEntryBean b, PreparedStatement pstmt) throws SQLException { pstmt.setString(Constants.DB_COL_TITLE, b.getTitle()); pstmt.setString(Constants.DB_COL_URL, b.getURL()); ... ... pstmt.setString(Constants.DB_COL_SELLER, b.getSeller()); // ByteArrayOutputStream baos = new ByteArrayOutputStream(); // oos = new ObjectOutputStream(baos); // oos.writeObject(b); // byte[] bytes = baos.toByteArray(); // pstmt.setBytes(Constants.DB_COL_BEAN, bytes); pstmt.executeUpdate(); } private static Connection connect(String path) { String url = "jdbc:sqlite:" + path; Connection conn = null; try { Class.forName("org.sqlite.JDBC"); conn = DriverManager.getConnection(url); } catch (SQLException e) { e.printStackTrace(); } catch (ClassNotFoundException e){ e.printStackTrace(); } return conn; } private static void disconnect(Connection c) { try{ if(c != null){ c.close(); } } catch(SQLException e){ e.printStackTrace(); } } }
INSERT
that you have within the loop will get its own transaction. That will be slow.\$\endgroup\$c.setAutoCommit(false);
just before my loop, andc.commit()
in thefinally
block just after the loop, and although it's definitely faster, nothing gets inserted into the database. Do you have an example you could link? Or if you are able to modify my code and post it as an answer, I'll mark it correct if it speeds up my program.\$\endgroup\$commit
correctly.\$\endgroup\$sed
the csv file beforehand. Thanks again.\$\endgroup\$