4
\$\begingroup\$

I have multiple binary (structured) file, each of 2GB, which I am currently reading in pair, using memmap to cross-correlate the same. I want to minimise the time required by this IO process, in the code.

I am implementing this as a Cython function, though the copying of memmap array to numpy array is quite fast (~3 sec) when same set of files are processed twice, it takes large amount of time if new files are to be read (~71 sec), possibly because of cache memory, this is the same with numpy fromfile as well.

What is the efficient and fastest way of copying the memmap to numpy array?

Any suggestions on the same are appreciated.

Code used:

comf = np.memmap(file_name, dtype = dt, mode = 'c') comf1 = np.memmap(file_name1, dtype = dt, mode = 'c') cdef np.ndarray tempcomf = np.zeros((templen, 1024), dtype = np.int8) cdef np.ndarray tempcomf1 = np.zeros((templen1, 1024), dtype = np.int8) tempcomf = comf['data'] tempcomf1 = comf1['data'] 

EDIT:

Here is the function used:

cpdef tuple decrypt_file(file_name, file_name1): cdef long long int templen = 0 cdef long long int templen1= 0 cdef np.ndarray tempcomf = np.zeros((templen, 1024),dtype=np.int8) cdef np.ndarray tempcomf1 = np.zeros((templen1, 1024),dtype=np.int8) dt = np.dtype([('header', 'S8'), ('Source', 'S10'), ('header_rest', 'S10'), ('Packet', '>u4'), ('data', '>i1', 1024)]) comf = np.memmap(file_name, dtype = dt, mode = 'c') comf1 = np.memmap(file_name1, dtype = dt, mode = 'c') templen = comf['Packet'][-1]-comf['Packet'][0] templen1= comf1['Packet'][-1]-comf1['Packet'][0] t_1 = time.time() tempcomf = comf['data'] tempcomf1= comf1['data'] print('Time take for memarray copy...'+str(time.time()-t_1)) tempcomf = tempcomf.ravel() tempcomf1= tempcomf1.ravel() tempcomf_X = np.array(tempcomf[1::2], order = 'F') tempcomf_Y = np.array(tempcomf[0::2], order = 'F') tempcomf1_X= np.array(tempcomf1[1::2],order = 'F') tempcomf1_Y= np.array(tempcomf1[0::2],order = 'F') return tempcomf_X, tempcomf_Y, tempcomf1_X, tempcomf1_Y 

Input Data Structure: The binary file input has 32 bytes header and 1024 bytes data, the focus is on reading the latter memmap array to numpy array.

This is the function where the files are read and the data is separated from the header. If same file set is given twice, memory copy takes ~2 sec, but when a different set of files are given the copying takes ~72 sec.

EDIT - MORE INFORMATION

After further investigation, I found that this problem indeed stems from caching of memory. As part of the test I cleared cache (echo 3 > /proc/sys/vm/drop_caches), which results in longer time for the copy of memmap array to numpy array (to volatile memory).

As part of confirmation of the issue, when I pre-cache the binary files into memory using vmtouch it takes ~3 sec for the copy (memmap to numpy array) to take place.

Though the solution to the problem is not yet found, as even the pre-caching takes ~52 sec, when done by vmtouch, the reason for the problem is related to the caching of memory.

vmtouch OUTPUT:

vmtouch -vt /data/ch01_SOURCE_Binary_20201011_110101.bin [OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 522720/522720 Files: 1 Directories: 0 Touched Pages: 522720 (1G) Elapsed: 52.403 seconds 
\$\endgroup\$
8
  • \$\begingroup\$Read this article that may help you pythonspeed.com/articles/reduce-memory-array-copies\$\endgroup\$
    – camp0
    CommentedNov 5, 2020 at 20:19
  • 1
    \$\begingroup\$I feel that this question is more suitable for stackoverflow.com\$\endgroup\$
    – user228914
    CommentedNov 5, 2020 at 20:32
  • 2
    \$\begingroup\$@AryanParekh I disagree in this case: issues of performance are expressly permitted on CR. This question needs work, but for reasons of missing context, not due to its subject.\$\endgroup\$CommentedNov 5, 2020 at 20:53
  • 2
    \$\begingroup\$And I agree with @Reinderien that the question is missing context, there is no problem with asking performance related question on CR.\$\endgroup\$
    – pacmaninbw
    CommentedNov 5, 2020 at 20:55
  • 1
    \$\begingroup\$I have added the complete code.\$\endgroup\$CommentedNov 5, 2020 at 21:34

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.