4
\$\begingroup\$

I've found some resources about PHP performance optimization here and there but I don't know how to start. I send 130k CSV rows in approximatively 7min. My script read CSV file and send $batch_size row per query, while doing some string operations such as changing date to the correct format or replacing , with . for the floats.

$db = new PDO("mysql:host=$db_host;port=$db_port;charset=utf8","$db_username", "$db_password"); $csv_path = "path/to/myfile.csv"; $table = "TableName"; //Start of the query $query="USE ".$db_name.";"; $FilePath = $csv_path; $is_not_title_row=FALSE;//Usefull not to take the first row $row_batch_count=0;//Counter for batches //1 data = 1 part of the row $number_of_data=0; $data=''; // Open file in read-only mode $handle = fopen($FilePath, "r"); while (($data = fgetcsv($handle, 2000, ";" , chr(8))) !== FALSE) { $number_of_data = count($data)-1; // Generate MySQL query if($is_not_title_row) { $query .= "INSERT INTO ".$table." VALUES ("; //We read all the data of one row for ($c=0; $c <= $number_of_data; $c++) { $data[$c]=utf8_decode($data[$c]); $data[$c]=str_replace('/', '-', $data[$c]); $data[$c]=str_replace(',', '.', $data[$c]); //If it's a date, we convert it to the right date format if (DateTime::createFromFormat('d-m-Y', $data[$c]) !== FALSE) { $data[$c] = date("Y-m-d", strtotime($data[$c])); $query .= "'" . mres($data[$c]) ."',"; //If there is nothing, we send NULL value else if(mres($data[$c])==NULL){ $query .= "NULL,"; } else{ $query .= "'" . mres($data[$c]) ."',"; } //If this is the end of the INSERT INTO we remove the comma if($c==$number_of_data) { $query=substr_replace($query ,"", -1); } } $query .= ");"; $row_batch_count++; //If we are at the end of the batch, we send the query then start the creation of another if($row_batch_count==$batch_size) { echo "There is ".$row_batch_count." row sent<br/>"; echo "<br/><br/>"; $query_to_execute = $db->prepare($query); $query_to_execute->execute(); $query="USE ".$db_name.";"; $row_batch_count=0; } } //Usefull to remove the first row if(!$is_not_title_row){ $is_not_title_row=TRUE; } } //Without more batches, we execute the query(=we are at the end of the file) $query_to_execute = $db->prepare($query); $query_to_execute->execute(); $query_to_execute = NULL; 

Solution

Instead of doing one INSERT INTO for each row, I've simply inserted multiple row in a single INSERT INTO. The output query looks like that :

INSERT INTO Table VALUES (x,x,x),(x,x,x),(x,x,x),(... 

The csv insertion is now 30 times faster 🚀

\$\endgroup\$
5
  • 1
    \$\begingroup\$Welcome to Code Review! I changed the title so that it describes what the code does per site goals: "State what your code does in your title, not your main concerns about it.". Feel free to edit and give it a different title if there is something more appropriate.\$\endgroup\$CommentedMar 4, 2021 at 16:50
  • 1
    \$\begingroup\$Please provide more complete code - e.g. complete function/method/class. It would be useful to know more about variables like $handle, $table, etc. "there are significant pieces of the core functionality missing, and we need you to fill in the details. Excerpts of large projects are fine, but if you have omitted too much, then reviewers are left imagining how your program works."\$\endgroup\$CommentedMar 4, 2021 at 18:09
  • 1
    \$\begingroup\$I'm also missing a bit of input data. We are asked to understand and improve code for which we don't have the most important bits: input & output. I can certainly give some pointers to improve this code, but it would be nice to also provide a bit of code that actually works.\$\endgroup\$CommentedMar 4, 2021 at 18:43
  • \$\begingroup\$I've added the input, which is the path to the CSV file. It seemed to me that the query execution was enough for the output, or should I add something to make it clearer?\$\endgroup\$CommentedMar 5, 2021 at 8:05
  • \$\begingroup\$By "input" I meant some real data. Without real data there's, for instance, no way for us to understand why you choose chr(8) in fgetcsv(). That looks weird to me. It's also difficult to understand why you process the rows, the way you do, without any idea of what the data looks like, and what it has to become.\$\endgroup\$CommentedMar 5, 2021 at 8:32

2 Answers 2

2
\$\begingroup\$

What I found to greatly improves the speed of insert queries, is inserting multiple rows using a single query. You seems to have found the same solution, but it is implemented in a very roundabout way.

Your code could also benefit a lot from using functions. So, let's start with the most important one: How to insert multiple rows. It could look something like this:

function insertRows($db, $table, $rows) { $values = '(' . implode(',', array_fill(0, count($rows[0]), '?')) . ')'; $query = 'INSERT INTO '.$table . ' VALUES ' . implode(',', array_fill(0, count($rows), $values)); $statement = $db->prepare($query); $index = 1; foreach($rows as $row) { foreach($row as $value) { $statement->bindValue($index++, $value); } } $statement->execute(); } 

This bit of code has only one function: Insert a certain number of rows into the database as quickly as possible. This function doesn't read data, change it in any way, but it does use parameter binding for extra security. That's probably not needed in this case, but it also doesn't hurt. No more problems with quotes.

You can insert quite a few rows using this function, somewhere around 50 to 150 rows, depending on what your input looks like. Above a certain number the speed simply won't increase anymore, so there's no point in going to extremely large number of rows.

I would create another function for changing the row data from the format you received to the format you want to insert:

function reformatRow($row) { // <...put your code here...> return $row; } 

We can now also implement a function that does the above for multiple rows:

function reformatRows($rows) { foreach ($rows as $key => $row) { $rows[$key] = reformatRow($row); } return $rows; } 

Functions don't have to be complex to be useful. I actually will not use this function, but I thought it was nice to show it.

All that's now left to do is to read the input, so let's do that. I will use parts of your code:

const MAX_ROWS = 100; fgetcsv($handle, 2000, ";", chr(8)); // read header and dump it $rows = []; while (($row = fgetcsv($handle, 2000, ";", chr(8))) !== FALSE) { $rows[] = reformatRow($row); if (count($rows) >= MAX_ROWS) { insertRows($db, $table[$i], $rows); $rows = []; } } if (count($rows) > 0) { insertRows($db, $table[$i], $rows); } 

I haven't fully tested all this code.

I also don't think this code will be any quicker than yours. There's also no way for me to test any optimization, since I don't have your data, so I'm afraid I can't help you there. Reading data into a database just takes time. It is possible that fgetcsv() is not the optimal way to read a big CSV file, but I haven't tested that.

What I have tried to make clear is that by splitting your code into functional parts it become easier to understand, and more reusable. The insertRows() is, for instance, is, partly, reused from my own code. Once you've written a function like that, and it works, you can reuse it everywhere.

\$\endgroup\$
2
  • \$\begingroup\$Unless you specifically don't want to explain it, you could suggest using the row index instead of starting with $index = 1 so the foreach would be foreach($rows as $index => $row) { and then instead of incrementing it after using it, use $index + 1.\$\endgroup\$CommentedMar 4, 2021 at 19:54
  • \$\begingroup\$@SᴀᴍOnᴇᴌᴀ I edited my question. I didn't quite get what you were trying to say, but I hope this is an improvement.\$\endgroup\$CommentedMar 5, 2021 at 8:30
1
\$\begingroup\$

Initial thoughts

I agree with the ideas in the answer by KIKOSoftware (which is deleted - hopefully only termporarily). Creating functions can not only clean up the code but also allow for better testing. Using bound parameters is always a great idea. And using arrays to hold the strings and imploding them will eliminate the need to remove trailing commas.

Standards for readability

Consider adhering to PHP Standards Recommendations - especially PSR-12. The biggest concern I have for readability with this code is whitespace - e.g. around operators like =:

6.2. Binary operators

All binary arithmetic, comparison, assignment, bitwise, logical, string, and type operators MUST be preceded and followed by at least one space:

if ($a === $b) { $foo = $bar ?? $a ?? $b; } elseif ($a > $b) { $foo = $a + $b * $c; } 

1

replacing characters

Instead of calling str_replace() multiple times - e.g.

$data[$c]=str_replace('/', '-', $data[$c]); $data[$c]=str_replace(',', '.', $data[$c]); 

Consider passing arrays for the search and replace arguments:

$data[$c]=str_replace(['/', ','], ['-', '.'], $data[$c]); 

You might also explore using strtr() instead of str_replace() to see if it performs better - in some cases it does, others it doesn't.2.

Comparison operators

There are two comparisons that use loose equality comparisons:

 else if(mres($data[$c])==NULL){ 

and

//If this is the end of the INSERT INTO we remove the comma if($c==$number_of_data) 

There are other places where strict equality comparisons are used - i.e. === and !== and it is advisable to always use those unless you are sure you want to allow types to be coerced 3.

bonus - an external webpage

I happened to search the web for "CSV to SQL" and found this tool that could replace the script...

\$\endgroup\$

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.