I have k
csv files (5 csv files for example), each file has m
fields which produce a key and n
values. I need to produce a single csv file with aggregated data.
I'm looking for the most efficient solution for this problem, speed mainly. I don't think by the way that we will have memory issues. Also I would like to know if hashing is really a good solution because we will have to use 64 bit hashing solution to reduce the chance for a collision to less than 1% (we are having around 30000000 rows per aggregation).
For example
file 1: f1,f2,f3,v1,v2,v3,v4 a1,b1,c1,50,60,70,80 a3,b2,c4,60,60,80,90 file 2: f1,f2,f3,v1,v2,v3,v4 a1,b1,c1,30,50,90,40 a3,b2,c4,30,70,50,90 result: f1,f2,f3,v1,v2,v3,v4 a1,b1,c1,80,110,160,120 a3,b2,c4,90,130,130,180
algorithm that we thought until now:
hashing (using concurentHashTable)
merge sorting the files
DB: using mysql or hadoop or redis.
The solution needs to be able to handle Huge amount of data (each file more than two million rows)
a better example: file 1
country,city,peopleNum england,london,1000000 england,coventry,500000
file 2:
country,city,peopleNum england,london,500000 england,coventry,500000 england,manchester,500000
merged file:
country,city,peopleNum england,london,1500000 england,coventry,1000000 england,manchester,500000
The key is: country,city
. This is just an example, my real key is of size 6 and the data columns are of size 8 - total of 14 columns.
We would like that the solution will be the fastest in regard of data processing.