Binning whilst avoiding the need for for-loops in Python?

Question

I am working on a relatively simple binning program, where I take a 5D array and bin it based on two 3D arrays to create a contour plot. See the sample code below. In actuality, my arrays are of size [27, 150, 20, 144, 288], so running a 4-nested for loop as shown below takes a LONG time.

Is there any way to speed up this loop and ideally avoid the need for all these loops? Apologies in advance if this isn't clear - I am new to all this!

S_mean = np.random.rand(5,10,10,10) T_mean = np.random.rand(5,10,10,10) Volume_mean = np.random.rand(2,5,10,10,10) T_bins = np.linspace(0,1,36) S_bins = np.linspace(0,1,100) int_temp = [] int_sal = [] for i in range(5): int_temp.append(np.digitize(np.ndarray.flatten(T_mean[i,:,:,:]), T_bins)) int_sal.append(np.digitize(np.ndarray.flatten(S_mean[i,:,:,:]), S_bins)) volume_sum = np.zeros((2,5,S_bins.size,T_bins.size)) # This is the problem loop for k in range(2): for l in range(5): for i in range(T_bins.size): for j in range(S_bins.size): volume_sum[k,l,j,i]=(np.nansum(np.ndarray.flatten(Volume_mean[k,l,:,:,:]) [np.argwhere(np.logical_and(int_temp[l] == i, int_sal[l] == j))])) # The output I am trying to get plt.pcolormesh(T_bins, S_bins, volume_sum[0,0,:,:]) plt.show()

AlexV · Accepted Answer · 2020-03-12 16:07:07Z

"Relatively simple" is in the eye of the beholder. Some ideas though:

This code

int_temp = [] int_sal = [] for i in range(5): int_temp.append(np.digitize(np.ndarray.flatten(T_mean[i,:,:,:]), T_bins)) int_sal.append(np.digitize(np.ndarray.flatten(S_mean[i,:,:,:]), S_bins))

can be transformed into two list comprehesions

int_temp = np.array([np.digitize(T_mean[i, :, :, :].flatten(), T_bins) for i in range(5)]) int_sal = np.array([np.digitize(S_mean[i, :, :, :].flatten(), S_bins) for i in range(5)])

I also converted the results into numpy arrays, for reasons that I'll refer to later. Also take note that np.ndarray.flatten(T_mean[i,:,:,:]) was rewritten as T_mean[i, :, :, :].flatten() which is the usual way to use np.ndarray.<fun>: call whatever.<fun>() (<fun> is meant to be a placeholder here, not actual code).

The next part of the code that caught my eye was

np.ndarray.flatten(volume_mean[k, l, :, :, :]) [np.argwhere(np.logical_and(int_temp[l] == i, int_sal[l] == j))])

This line is very hard to read, but can easily be rewritten as:

indices = np.argwhere(np.logical_and(int_temp[l] == i, int_sal[l] == j)) volume_sum[k, l, j, i] = nansum(volume_mean[k, l, :, :, :].flatten()[indices])

which looks muss less frightening. You can even get rid of np.argwhere and use np.logical_and(int_temp[l] == i, int_sal[l] == j) directly as binary mask:

mask = np.logical_and(int_temp[l] == i, int_sal[l] == j) volume_sum[k, l, j, i] = ( np.nansum(volume_mean[k, l, :, :, :].flatten()[mask]) )

To see if there is a difference in performance, I put the original code and the refactored version into functions:

def compute_volume_sum(volume_mean, s_mean, t_mean, s_bins, t_bins): """Original implementation""" int_temp = [] int_sal = [] for i in range(5): int_temp.append(np.digitize(np.ndarray.flatten(t_mean[i,:,:,:]), t_bins)) int_sal.append(np.digitize(np.ndarray.flatten(s_mean[i,:,:,:]), s_bins)) volume_sum = np.zeros((2, 5, s_bins.size, t_bins.size)) for k in range(2): for l in range(5): for i in range(t_bins.size): for j in range(s_bins.size): volume_sum[k, l, j, i] = (np.nansum( np.ndarray.flatten(volume_mean[k, l, :, :, :])[np.argwhere( np.logical_and(int_temp[l] == i, int_sal[l] == j))])) return volume_sum def compute_volume_sum_ref(volume_mean, s_mean, t_mean, s_bins, t_bins): """Refactored implementation""" int_temp = np.array([np.digitize(t_mean[i, :, :, :].flatten(), t_bins) for i in range(5)]) int_sal = np.array([np.digitize(s_mean[i, :, :, :].flatten(), s_bins) for i in range(5)]) volume_sum = np.zeros((2, 5, s_bins.size, t_bins.size)) for k in range(2): for l in range(5): for i in range(t_bins.size): for j in range(s_bins.size): mask = np.logical_and(int_temp[l] == i, int_sal[l] == j) volume_sum[k, l, j, i] = ( np.nansum(volume_mean[k, l, :, :, :].flatten()[mask]) ) return volume_sum

BTW: You can confirm that both functions return the same value using np.allclose

I then measured how long it takes to run each function ten times. The results were as follows:

original: 16.860085s refactored: 11.849922s

Not bad, but those loops still hurt.

Enter numba, a just-in-time Python compiler. It works nicely with numpy and can make loops quite a bit faster since they are compiled to native code. I recommend to use a scientific Python distribution like Anaconda if you want to try/use it.

Unfortunately, numba does not support the full feature set of Python (e.g. list comprehensions are not supported), so I had to resort to a little bit of trickery to get it working:

import numba def compute_volume_sum_nb(volume_mean, s_mean, t_mean, s_bins, t_bins): """numba version""" int_temp = np.array([np.digitize(t_mean[i, :, :, :].flatten(), t_bins) for i in range(5)]) int_sal = np.array([np.digitize(s_mean[i, :, :, :].flatten(), s_bins) for i in range(5)]) return _numba_inner(volume_mean, s_bins, t_bins, int_temp, int_sal) @numba.njit() def _numba_inner(volume_mean, s_bins, t_bins, int_temp, int_sal): volume_sum = np.zeros((2, 5, s_bins.size, t_bins.size)) for k in range(2): for l in range(5): for i in range(t_bins.size): for j in range(s_bins.size): mask = np.logical_and(int_temp[l] == i, int_sal[l] == j) volume_sum[k, l, j, i] = ( np.nansum(volume_mean[k, l, :, :, :].flatten()[mask]) ) return volume_sum

numba.njit() is a decorator that can be used on functions to mark their function body for jit-compilation. The first time compute_volume_sum_nb is called, numba triggers the compilation process so be sure to exclude the first call from any timings you do.

Let's see how the timings now look like:

original: 16.860085s refactored: 11.849922s numba: 0.833529s

That's quite a bit faster, isn't it?

I'm almost sure that some of the loops could be replaced by clever indexing and use of vectorized functions in numpy, but I don't have the time to dive deeper into the problem.

A few non-performance related notes: be sure so have a look at the Style Guide for Python Code (aka PEP 8). The code sometimes already follows those guidelines, but sometimes also doesn't, e.g. when using uppercase letters in variable names.

Maybe also think about how you can modularize your code, e.g by using functions. Those separated parts are also easier to document. You want to write """"documentation""" to save future you some headache, don't you?

This is great. Just wanted to point out that numpy.searchsorted is faster than numpy.digitize for large datasets. See here for more information — user182606, CommentedMar 13, 2020 at 5:31
@allthemikeysaretaken: That was fixed in 2014, which is also mentioned at the end of the issue you linked. — Graipher, CommentedMar 14, 2020 at 16:12

Stack Exchange Network

Binning whilst avoiding the need for for-loops in Python?

1 Answer 1

Hot Network Questions

Binning whilst avoiding the need for for-loops in Python?

1 Answer 1

Related

Hot Network Questions