Exporting a built data set to a CSV

Question

I wrote a program to scan files, extract text, build a data set and export it to a CSV. My initial program was ugly, but it worked. I have since refactored it, but the latest version seems to run a lot slower than its previous rendition.

A brief rundown of the program:

GetFiles()
- The program gets a folder to scan
- It then sets a csv file to write the data to
ReadFiles
- foreach loop - each file in the folder
  - creates a StreamReader
  - extracts each key/value pair in the file and adds it to a list
  - closes the StreamReader
AddToTable
- Creates a DataTable
- foreach loop - each key/value pair in the list
  - if a key does not exist in the DataTable columns, add it as a column
  - build each row, based on the key/value
SaveFiles
- Creates a StreamWriter
- builds the csv based on the information in the DataTable.

namespace FlirAdept_Plugin { class FLIR_Datatable { public void GetFiles() { FolderBrowserDialog fbd = new FolderBrowserDialog(); if (fbd.ShowDialog() == DialogResult.OK) { string filesToScan = fbd.SelectedPath; SaveFileDialog sfd = new SaveFileDialog(); sfd.Filter = "CSV Files (*.csv)|*.csv|All Files (*.*)|*.*"; sfd.FilterIndex = 1; sfd.RestoreDirectory = true; if (sfd.ShowDialog() == DialogResult.OK) { Stream fileCheck = null; if ((fileCheck = sfd.OpenFile()) != null) { fileCheck.Close(); string csvFile = sfd.FileName.ToString(); if (!csvFile.EndsWith(".csv")) csvFile += ".csv"; List<Dictionary<string, string>> dictionary = ReadFiles(filesToScan); DataTable table = AddToTable(dictionary); SaveFiles(table, csvFile); } } } } List<Dictionary<string, string>> ReadFiles(string filesToScan) { string[] files = Directory.GetFiles(filesToScan); List<string> errorFiles = new List<string>(); Dictionary<string, string> record = new Dictionary<string, string>(); List<Dictionary<string, string>> recordList = new List<Dictionary<string, string>>(); foreach (string file in files) { string fileName = Path.GetFileName(file); string findText = ".0 index"; string match = @"(\.\d index)|(\.\d\.label)|(\.\d\.value)"; StreamReader reader = new StreamReader(file); string header = null; string data = null; List<string> fileData = new List<string>(); record = new Dictionary<string, string>(); if (file.Contains(".jpg")) { reader = new StreamReader(file); string line = null; record.Add("Filename", fileName); while ((line = reader.ReadLine()) != null) { if (line.Contains(findText)) { break; } } try { // Look for ".n" where 'n' is a digit Match m = Regex.Match(line, match, RegexOptions.IgnoreCase); // Read file, and split text to identify "Label" and "Value" while (m.Success && line != null) { var result = Regex.Replace(line, match, $"{Environment.NewLine}$&"); var lines = result.Split(new[] { Environment.NewLine }, StringSplitOptions.None); foreach (string s in lines) { // Adds only the "metadata" lines to the List if ((Regex.Match(s, match)).Success) fileData.Add(s); } line = reader.ReadLine(); } } catch (Exception e) { // If a file cannot be read, it is added to a list // These files do not contain metadata errorFiles.Add(fileName); } // read "Label" and compare to header foreach (string s in fileData) { int start; int end; if (s.Contains(".label")) { start = s.IndexOf('"'); end = s.LastIndexOf('"'); if (start >= 0 && end > start) { header = s.Substring(start + 1, end - start - 1); continue; } } else if (s.Contains(".value")) { start = s.IndexOf('"'); end = s.LastIndexOf('"'); if (start >= 0 && end > start) { data = s.Substring(start + 1, end - start - 1).Replace(",","."); record.Add(header, "," + data); } } } } recordList.Add(record); reader.Close(); } return recordList; } DataTable AddToTable(List<Dictionary<string, string>> dataList) { DataTable table = new DataTable(); DataColumn column; DataRow row; foreach (var item in dataList) { row = table.NewRow(); foreach (var record in item) { try { if (!table.Columns.Contains(record.Key)) { column = new DataColumn(); column.ColumnName = record.Key.ToString(); column.DefaultValue = ""; table.Columns.Add(column); } row[record.Key.ToString()] = record.Value.ToString(); } catch (Exception e) { MessageBox.Show(e.Message); } } table.Rows.Add(row); } return table; } void SaveFiles(DataTable table, string csvFile) { StreamWriter writer = new StreamWriter(csvFile); string headerRow = ""; string dataRow = ""; foreach (DataColumn col in table.Columns) { headerRow += col + ","; } writer.WriteLine(headerRow.TrimEnd(',')); foreach (DataRow row in table.Rows) { dataRow = ""; foreach (string s in row.ItemArray) { dataRow += s; } writer.WriteLine(dataRow); } writer.Close(); } } }

I don't know for certain, but I believe it might be something to do with the disposal of the StreamReader in my ReadFiles method, but I'm not entirely sure. I definitely know it's something to do with that loop.

What is causing the program to slow down, and how can I fix it?

Better than guessing is to use the profiler. It'll clearly show you any bottlenecks. — t3chb0t, CommentedJun 4, 2018 at 6:15
Well, then it'd be great if you could add these benchmarks and screenshots to your question. If you identified them you then you also know exactly what is running slow... could you reveal your findings? — t3chb0t, CommentedJun 4, 2018 at 6:30
"seems to run a lot slower than its previous rendition" - have you measured the runtime using something like the time command or automated testing? — Freiheit, CommentedJun 4, 2018 at 13:10

paparazzo · Accepted Answer · 2018-06-04 14:19:09Z

You are creating the reader twice!

StreamReader reader = new StreamReader(file); reader = new StreamReader(file);

The first is not used.

You are concerned with reader not getting disposed. Put it in a using block

using (StreamReader rdr = new StreamReader(file)) { }

You are creating strings in the loop. Move that outside the loop.

string findText = ".0 index"; string match = @"(\.\d index)|(\.\d\.label)|(\.\d\.value)";

You are creating regex in the loop. That is expensive. Move that outside the loop.

 string pat = @"(\w+)\s+(car)"; // Instantiate the regular expression object. Regex r = new Regex(pat, RegexOptions.IgnoreCase); // Match the regular expression pattern against a text string. Match m = r.Match(text); loop { Match m = r.Match(text);

You open the file then immediately close it. But then you open it in SaveFiles.

 if ((fileCheck = sfd.OpenFile()) != null) { fileCheck.Close(); SaveFiles(table, csvFile);

Put this StreamWriter writer = new StreamWriter(csvFile); in a using block.

As mentioned in another answer DataTable is not very efficient.

I don't know how this Regex.Match(s, match) is finding a match. You replaced match just a few lines above.

Why add to fileData.Add(s); just to loop on it later. You have s process it.
Break out process in a method. List is efficient but still.

Not efficient to read all the filenames and then loop on them.

string[] files = Directory.GetFiles(filesToScan);

Enumerate FileInfo as then you get information about the file

public static List<Dictionary<string, string>> ParseJPGfiles (string directory, string csvFilename) { string line; string header = null; string data; string findText = ".0 index"; string lineMatchExpression = @"(\.\d index)|(\.\d\.label)|(\.\d\.value)"; Regex lineMatch = new Regex(lineMatchExpression, RegexOptions.IgnoreCase); Dictionary<string, string> record = new Dictionary<string, string>(); List<Dictionary<string, string>> recordList = new List<Dictionary<string, string>>(); DirectoryInfo directoryInfo = new DirectoryInfo("path"); foreach (FileInfo fi in directoryInfo.EnumerateFiles()) { if (fi.Extension == ".jpg") { string fileName = fi.Name; record.Add("FileName", fileName); //filesParsed.Add(fileName); using (StreamReader reader = fi.OpenText()) { //do all the processing here and be done with it while ((line = reader.ReadLine()) != null) { if (line.Contains(findText)) { break; } } System.Text.RegularExpressions.Match m = lineMatch.Match(line); while (m.Success && line != null) { var result = lineMatch.Replace(line, $"{Environment.NewLine}$&"); foreach (string s in result.Split(new[] { Environment.NewLine }, StringSplitOptions.None)) { // Adds only the "metadata" lines to the List if (lineMatch.Match(s).Success) { int start; int end; if (s.Contains(".label")) { start = s.IndexOf('"'); if(s >= 0) { end = s.LastIndexOf('"'); if (end > start) { header = s.Substring(start + 1, end - start - 1); } } } else if (s.Contains(".value")) { start = s.IndexOf('"'); end = s.LastIndexOf('"'); if (start >= 0 && end > start) { data = s.Substring(start + 1, end - start - 1).Replace(",", "."); record.Add(header, "," + data); } }; } } line = reader.ReadLine(); } recordList.Add(record); } } } return recordList; }

Wow! You were right about the regex! It cut the whole run time in half! — Ben, CommentedJun 5, 2018 at 4:43

Adriano Repetti · Accepted Answer · 2018-06-04 13:27:02Z

UI and logic

You're mixing UI and logic. What's the role of that FolderBrowserDialog there? Move UI things outside your logic and you'll be able to test it without user interaction. Minor note:

if (value == something) { // Do things... }

Might be rewritten to:

if (value != something) return; // Do something

You'll then reduce indentation and increase readability.

Classes and methods

Your class is not static and it does not contains static methods however you do not have any instance data. If this is your real code then both your class and your methods should be static (or take advantage of instance data to simplify them, if it makes sense).

Error handling

catch (Exception) is almost never the proper way to handle errors. Catch the exceptions you expect and only them. Is ignoring OutOfMemoryException the right thing to do here? In normal conditions I/O functions throw IOException and UnauthorizedAccessException, catch and deal with them but ignore everything else because you're just hiding bugs and erratic run-time behaviour.

Dispose resources properly

Do not directly close a stream with .Close(), use the using statement instead and it'll be correctly closed also in case of errors. Now if something fails while reading a file then you'll keep the file open until your program terminates.

Readability

In C# you do not need to declare all variables at the beginning of the method. Also you can use var to simplify long type names. For example record is declared at the beginning and then reassigned in the loop. Compiler is smart enough to do not waste memory, declare it as late as possible (or - better - Clear() the dictionary instead of creating a new one).

Split your logic into multiple functions, it'll be much easier to read! Don't forget to use proper names: GetFiles() suggests that it's fetching the list of files to process but it actually performs everything (find, read, process and write data).

In your question you felt you need to explain what each functions does, code should speak without that.

Don't forget to follow C# naming conventions (see next example). In this example names are still generic but with a better domain knowledge you absolutely must pick better specific names.

static class FlirConverter { public static IEnumerable<string> FindFilesToProcess() { // FIND the files to process, nothing else } public static void ExtractMetadataAndWriteOutput( IEnumerable<string> inputPaths, string outputPath) { if (inputPaths == null) throw new ArgumentNullException(nameof(inputPaths)); if (outputPath == null) throw new ArgumentNullException(nameof(outputPath)); if (String.IsNullOrWhiteSpace(outputPath)) throw new ArgumentException("...", nameof(outputPath)); var metadata = ExtractMetadata(inputPaths); SaveDataTableAsCsv(CreateDataTable(metadata), outputPath); } static List<Dictionary<string, string>> ExtractMetadata( IEnumerable<string> inputPaths) { var data = new List<Dictionary<string, string>>(); foreach (var path in inputPaths) { try { ExtractMetadata(data, path); } catch (IOException e) { // Do something... } catch (UnauthorizedAccessException e) { // Do something... } } } static void ExtractMetadata(string path, List<Dictionary<string, string>> metadata) { // More... } }

It's little bit a corner case but List<Dictionary<string, string>> strongly urge me to extract a small private type to hide this implementation detail with just one or two public methods. It's a matter of taste, I suppose.

I suppose you understand what I mean. Code is still far from perfect but we're working on it. You may now want to apply a retry pattern for common IOException (file might be in use, for example).

Now you may want to remove arguments introducing instance fields.

When there is a standard function...use it: to check file extension you may use Path.GetExtension(). Also do not forget that Windows FS is case insensitive (but it stores the correct case) then .JPG and .jPg are both JPEG files. Use String.Equals() for comparison.

You keep checking for line != null, do it once and do not repeat the check over and over (see later).

Performance

I/O is slow and multi-threading may (or may not) speed-up things but it's extremely difficult to tune in a reliable manner. I seldom suggest to go parallel unless there is a proven need and it's reasonably easy to test. After you fixed everything else then you may consider to span over two threads with a simple (bounded!) Parallel.ForEach(). The right number of concurrent I/O threads isn't simple to determine (one per CPU? more or less with SSDs? what about other system I/O operations and other applications?) but 2/4 is usually a good starting point for tuning this value.

Input

Here I'd first start with some simple things:

Do not process the full JPEG file! I'm not sure about what you're searching for but your current method is:

Not reliable because the matched pattern may be anywhere, even in the image data.
Reading image data as text may produce thousands of encoding exceptions.
Extremely slow because you'll process the image data when (I suppose) everything you need is only in the header. If you're just looking for metadata then this alone is probably the major performance hit.
You're reinventing the wheel. There is a plethora of good, easy to use and fast JPEG readers (even the poor simple System.Drawing.Bitmap has a decoder for JPEG. Use it.

When switching to a proper JPEG decoder you won't need these comments but, in general:

You're using Regex.Replace() to add newlines because you want to split with String.Split()...that's slow, just go through the matched groups directly.

You search for a character with IndexOf() and then for the end with LastIndexOf(), for long strings it's faster to start searching for the last character AFTER the first one, use the String.LastIndexOf(Char, Int32) overload.

Output

DataTable isn't the fastest thing out-there. A specific CSV writer is probably way faster. Don't do it by hand unless data you have to write are extremely simple and stable: CSV is slightly more complicate than you may think.

Thanks for the response! Unfortunately the way the information is searched for and handled and built for output is how it needs to be done. This handles all the potential situations for where, and how the information is stored. As for your other suggestions, I'll look into them. Thanks again! — Ben, CommentedJun 4, 2018 at 22:58

Stack Exchange Network

Exporting a built data set to a CSV

2 Answers 2

UI and logic

Classes and methods

Error handling

Dispose resources properly

Readability

Performance

Input

Output

Hot Network Questions

Exporting a built data set to a CSV

2 Answers 2

UI and logic

Classes and methods

Error handling

Dispose resources properly

Readability

Performance

Input

Output

Related

Hot Network Questions