1

I have k csv files (5 csv files for example), each file has m fields which produce a key and n values. I need to produce a single csv file with aggregated data.

I'm looking for the most efficient solution for this problem, speed mainly. I don't think by the way that we will have memory issues. Also I would like to know if hashing is really a good solution because we will have to use 64 bit hashing solution to reduce the chance for a collision to less than 1% (we are having around 30000000 rows per aggregation).

For example

file 1: f1,f2,f3,v1,v2,v3,v4 a1,b1,c1,50,60,70,80 a3,b2,c4,60,60,80,90 file 2: f1,f2,f3,v1,v2,v3,v4 a1,b1,c1,30,50,90,40 a3,b2,c4,30,70,50,90 result: f1,f2,f3,v1,v2,v3,v4 a1,b1,c1,80,110,160,120 a3,b2,c4,90,130,130,180 

algorithm that we thought until now:

  1. hashing (using concurentHashTable)

  2. merge sorting the files

  3. DB: using mysql or hadoop or redis.

The solution needs to be able to handle Huge amount of data (each file more than two million rows)

a better example: file 1

country,city,peopleNum england,london,1000000 england,coventry,500000 

file 2:

country,city,peopleNum england,london,500000 england,coventry,500000 england,manchester,500000 

merged file:

country,city,peopleNum england,london,1500000 england,coventry,1000000 england,manchester,500000 

The key is: country,city. This is just an example, my real key is of size 6 and the data columns are of size 8 - total of 14 columns.

We would like that the solution will be the fastest in regard of data processing.

3
  • Is this line number based? Should line 10 from file1 be aggregated with line 10 from file2?CommentedAug 7, 2013 at 15:01
  • for hashing, see Which hashing algorithm is best for uniqueness and speed?
    – gnat
    CommentedAug 7, 2013 at 15:12
  • How many rows are we talking? Will it be practical to hold all the keys in memory at once? (What real performance requirements are we talking; would batch processing be adequate?)CommentedDec 22, 2013 at 16:00

2 Answers 2

1

Since the key-fileds are always the first colums i would sort the source rows (whithout the header) by the keys and then advance in both files line by line similar to Merge_algorithm used in Mergesort.

But instead of sorting the 2 csv-lists to one you compute the sum of elements that are in both list. Elements that are in one list only are simply copied.

The algorithm looks similar to this:

While (NOT EndOfFile(left-item) AND NOT EndOfFile(right-item)) if (EndOfFile(right-item) OR left-item.key < right-item.key) store(left-item); advance left-item; if (EndOfFile(left-item) OR left-item.key > right-item.key) store(right-item); advance right-item; if (left-item.key = right-item.key) store(sum(right-item, left-item)); advance right-item; advance left-item 
    0

    I would use a database to store the results (and handle the transactions) and then just write a small script which reads a file and inserts and/or updates the data accordingly (line per line). You can then run the script in parallel on all your input files and fork off as much as your machine can take, or even run it on multiple machines.

    The database will be the bottleneck, though.

    1
    • Not sure this will scale. With hundreds of millions of rows per file this seems a better use case for breaking the files into smaller subsets and using map-reduce to make processing highly parallel. Hadoop would be a better solution.
      – maple_shaft
      CommentedDec 21, 2013 at 13:20

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.