How to fill 3D array in numpy with functional values at the same speed as Java?

Question

I tried to model voxels of 3D cylinder with the following code:

import math import numpy as np R0 = 500 hz = 1 x = np.arange(-1000, 1000, 1) y = np.arange(-1000, 1000, 1) z = np.arange(-10, 10, 1) xx, yy, zz = np.meshgrid(x, y, z) def density_f(x, y, z): r_xy = math.sqrt(x ** 2 + y ** 2) if r_xy <= R0 and -hz <= z <= hz: return 1 else: return 0 density = np.vectorize(density_f)(xx, yy, zz)

and it took many minutes to compute.

Equivalent suboptimal Java code runs 10-15 seconds.

How to make Python compute voxels at the same speed? Where to optimize?

Your bio lists MATLAB. Originally that was a wrapper for Fortran matrix code. To get good speed you had to think in terms of whole arrays. New MATLAB has a jit compiler that lets you 'cheat' and write iterative code. numpy is more like the older MATLAB. To get around that you have use added packages like cython and numba that create custom compiled code. I assume you know enough computer science to know the difference between C, Java, and Python. — hpaulj, CommentedOct 27, 2018 at 0:30
The question of how do a fast vectorization comes up often. Here's a simpler recent example: stackoverflow.com/questions/53016316/… — hpaulj, CommentedOct 27, 2018 at 4:23
Slightly unrelated to your question, but compare square of xy distance with precomputed square of radius - you can save sqrt in each iteration, which should make a big difference. — Artur Biesiadowski, CommentedOct 29, 2018 at 12:00

willeM_ Van Onsem · Accepted Answer · 2018-10-26 22:19:32Z

Please do not use .vectorize(..), it is not efficient since it will still do the processing at the Python level. .vectorize() should only be used as a last resort if for example the function can not be calculated in "bulk" because its "structure" is too complex.

But you do not need to use .vectorize here, you can implement your function to work over arrays with:

r_xy = np.sqrt(xx ** 2 + yy ** 2) density = (r_xy <= R0) & (-hz <= zz) & (zz <= hz)

or even a bit faster:

r_xy = xx * xx + yy * yy density = (r_xy <= R0 * R0) & (-hz <= zz) & (zz <= hz)

This will construct a 2000×2000×20 array of booleans. We can use:

intdens = density.astype(int)

to construct an array of ints.

Printing the array here is quite combersome, but it contains a total of 2'356'047 ones:

>>> density.astype(int).sum() 2356047

Benchmarks: If I run this locally 10 times, I get:

>>> timeit(f, number=10) 18.040479518999973 >>> timeit(f2, number=10) # f2 is the optimized variant 13.287886952000008

So on average, we calculate this matrix (including casting it to ints) in 1.3-1.8 seconds.

What about more complex cases with if-s? Can I apply functions elementwise efficiently? — Dims, CommentedOct 26, 2018 at 21:53
@Dims: because and in Python evaluates truthiness, and an array has no truthiness. & is the "elementwise" and of two matrices. It can require some "creativity" to write it in such variant. This is something one "masters" over time with experience. — willeM_ Van Onsem, CommentedOct 26, 2018 at 21:55
Thanks, but please explain, what to do with more complex functions, where are if-s, exceptions etc, which I can't write as single vectorized expression? — Dims, CommentedOct 26, 2018 at 22:01
@Dims: there are no ifs, etc. you need to encode this in a logical way. The same for exceptions. It thus requires "mathematical engineering". — willeM_ Van Onsem, CommentedOct 26, 2018 at 22:04

Community · Accepted Answer · 2020-06-20 09:12:55Z

You can also use a compiled version of the function to calculate density. You can use cython or numba for that. I use numba to jit compile the density calculation function in the ans, as it is as easy as putting in a decorator.

Pros :

You can write if conditions as you mention in your comments
Slightly faster than the numpy version mentioned in the ans by @Willem Van Onsem, as we have to iterate through the boolean array to calculate the sum in density.astype(int).sum().

Cons:

Write an ugly three level loop. Looses the beauty of the singlish liner numpy solution.

Code:

import numba as nb @nb.jit(nopython=True, cache=True) def calc_density(xx, yy, zz, R0, hz): threshold = R0 * R0 dimensions = xx.shape density = 0 for i in range(dimensions[0]): for j in range(dimensions[1]): for k in range(dimensions[2]): r_xy = xx[i][j][k] ** 2 + yy[i][j][k] ** 2 if(r_xy <= threshold and -hz <= zz[i][j][k] <= hz): density+=1 return density

Running times:

Willem Van Onsem solution, f2 variant : 1.28s without sum, 2.01 with sum.
Numba solution( calc_density, on second run, to discount the compile time) : 0.48s.

As suggested in the comments, we need not calculate the meshgrid also. We can directly pass the x, y, z to the function. Thus:

@nb.jit(nopython=True, cache=True) def calc_density2(x, y, z, R0, hz): threshold = R0 * R0 dimensions = len(x), len(y), len(z) density = 0 for i in range(dimensions[0]): for j in range(dimensions[1]): for k in range(dimensions[2]): r_xy = x[i] ** 2 + y[j] ** 2 if(r_xy <= threshold and -hz <= z[k] <= hz): density+=1 return density

Now, for fair comparison, we also include the time of np.meshgrid in @Willem Van Onsem's ans. Running times:

Willem Van Onsem solution, f2 variant(np.meshgrid time included) : 2.24s
Numba solution( calc_density2, on second run, to discount the compile time) : 0.079s.

1) Don't use global variables! Within Numba every "global" variable is a compile time constant. If you change the global variable and run your function again, Numba won't recognize this changes, which leads to hard to debug errors. 2) A small change (passing x,x,z directly) leads to a factor 10 better performance. 3) Parallelization would give another factor of Number of Cores. — max9111, CommentedOct 29, 2018 at 13:58
@max9111, thanks. 1) yes, indeed. Slipped my mind. Edited to incorporate 2). Can't seem to get 3). I guess the bookkeeping required for parallelization dwarfs the gains for the specific size. — Deepak Saini, CommentedOct 29, 2018 at 14:42
@Deepak Saini I have posted quite a duplicate answer, apart from parallelization. I think it would be better to add the parallelized aproach to your answer and I delete mine afterwards... BTW: cache=True is not working if paralization is set to True. — max9111, CommentedOct 29, 2018 at 15:38

max9111 · Accepted Answer · 2018-10-29 20:04:40Z

This is meant as a lengthy comment on the answer of Deepak Saini.

The main change is to not use the coordinates generated by np.meshgrid which contains unecessary repetitions. This isn't recommandable if you can avoid it (both in terms of memory usage and performance)

Code

import numba as nb import numpy as np @nb.jit(nopython=True,parallel=True) def calc_density_2(x, y, z,R0,hz): threshold = R0 * R0 density = 0 for i in nb.prange(y.shape[0]): for j in range(x.shape[0]): r_xy = x[j] ** 2 + y[i] ** 2 for k in range(z.shape[0]): if(r_xy <= threshold and -hz <= z[k] <= hz): density+=1 return density

Timings

R0 = 500 hz = 1 x = np.arange(-1000, 1000, 1) y = np.arange(-1000, 1000, 1) z = np.arange(-10, 10, 1) xx, yy, zz = np.meshgrid(x, y, z) #after the first call (compilation overhead) #calc_density_2 9.7 ms #calc_density_2 parallel 3.9 ms #@Deepak Saini 115 ms

Collectives™ on Stack Overflow

How to fill 3D array in numpy with functional values at the same speed as Java?

3 Answers 3

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Linked

Related