I have multiple binary (structured) file, each of 2GB, which I am currently reading in pair, using memmap to cross-correlate the same. I want to minimise the time required by this IO process, in the code.
I am implementing this as a Cython function, though the copying of memmap array to numpy array is quite fast (~3 sec) when same set of files are processed twice, it takes large amount of time if new files are to be read (~71 sec), possibly because of cache memory, this is the same with numpy fromfile as well.
What is the efficient and fastest way of copying the memmap to numpy array?
Any suggestions on the same are appreciated.
Code used:
comf = np.memmap(file_name, dtype = dt, mode = 'c') comf1 = np.memmap(file_name1, dtype = dt, mode = 'c') cdef np.ndarray tempcomf = np.zeros((templen, 1024), dtype = np.int8) cdef np.ndarray tempcomf1 = np.zeros((templen1, 1024), dtype = np.int8) tempcomf = comf['data'] tempcomf1 = comf1['data']
EDIT:
Here is the function used:
cpdef tuple decrypt_file(file_name, file_name1): cdef long long int templen = 0 cdef long long int templen1= 0 cdef np.ndarray tempcomf = np.zeros((templen, 1024),dtype=np.int8) cdef np.ndarray tempcomf1 = np.zeros((templen1, 1024),dtype=np.int8) dt = np.dtype([('header', 'S8'), ('Source', 'S10'), ('header_rest', 'S10'), ('Packet', '>u4'), ('data', '>i1', 1024)]) comf = np.memmap(file_name, dtype = dt, mode = 'c') comf1 = np.memmap(file_name1, dtype = dt, mode = 'c') templen = comf['Packet'][-1]-comf['Packet'][0] templen1= comf1['Packet'][-1]-comf1['Packet'][0] t_1 = time.time() tempcomf = comf['data'] tempcomf1= comf1['data'] print('Time take for memarray copy...'+str(time.time()-t_1)) tempcomf = tempcomf.ravel() tempcomf1= tempcomf1.ravel() tempcomf_X = np.array(tempcomf[1::2], order = 'F') tempcomf_Y = np.array(tempcomf[0::2], order = 'F') tempcomf1_X= np.array(tempcomf1[1::2],order = 'F') tempcomf1_Y= np.array(tempcomf1[0::2],order = 'F') return tempcomf_X, tempcomf_Y, tempcomf1_X, tempcomf1_Y
Input Data Structure: The binary file input has 32 bytes header and 1024 bytes data, the focus is on reading the latter memmap array to numpy array.
This is the function where the files are read and the data is separated from the header. If same file set is given twice, memory copy takes ~2 sec, but when a different set of files are given the copying takes ~72 sec.
EDIT - MORE INFORMATION
After further investigation, I found that this problem indeed stems from caching of memory. As part of the test I cleared cache (echo 3 > /proc/sys/vm/drop_caches), which results in longer time for the copy of memmap array to numpy array (to volatile memory).
As part of confirmation of the issue, when I pre-cache the binary files into memory using vmtouch
it takes ~3 sec for the copy (memmap to numpy array) to take place.
Though the solution to the problem is not yet found, as even the pre-caching takes ~52 sec, when done by vmtouch
, the reason for the problem is related to the caching of memory.
vmtouch
OUTPUT:
vmtouch -vt /data/ch01_SOURCE_Binary_20201011_110101.bin [OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 522720/522720 Files: 1 Directories: 0 Touched Pages: 522720 (1G) Elapsed: 52.403 seconds