5
\$\begingroup\$

This is for a computation for laser physics. The speed is quite slow, which seems to be delayed a lot by the three nested for loops with indices j,k, and l (with comments below). And the computation time seems to be proportional to the fourth power of nx, but I want to increase nx more to ~100.

I also guess the 7 indices Numpy Array S drags the computation a lot. It would be better if I can remove the three nested for loops. However, that seems to be impossible. Are there any ways to improve speed?

import matplotlib.pyplot as plt import matplotlib.patches as mpatches import numpy as np import matplotlib as mpl import math from scipy import linalg, optimize from scipy.integrate import quad import scipy.integrate as integrate import scipy.special as special from matplotlib.colors import LogNorm import time from datetime import datetime import copy nz=70 nx=3 ny=nx ntau=70 gamsp = 1/(160.*10.**-12) Rincrease=10**0 R = Rincrease*400*10.**-9 nnu = 1.6*10.**25*Rincrease**-2 c = 3.*10.**8 n = nnu*math.pi*R**2 lambda0= 12.4/0.88*10**-10 delo = 4.*10.**-6 tau_max= 2./gamsp z_max=3.*10**-4 L=z_max dtau = tau_max/ntau dx=2*R/nx dy=dx dz = z_max/nz Omega=2*math.pi*c/lambda0 alpha=np.arctan(R/L) S=np.empty((nx,ny,nz,nx,ny,nz,ntau),dtype='complex_') S[:,:,:,:,:,:,0]=0 rho=np.empty((nx,ny,nz,ntau),dtype='complex_') for i in range(nx): for j in range(ny): rho[i,j,:,0]=1 if math.sqrt((i-(nx-1)/2)**2+(j-(ny-1)/2)**2)<nx/2 else 0 Intensitytrans=np.zeros((nx,ny,nz,ntau),dtype='complex_') Intensity=np.zeros((nz,ntau),dtype='complex_') Intensityn=np.zeros((nz,ntau),dtype='complex_') photonn=np.zeros((nz),dtype='complex_') g=np.zeros((2*(nx-1)+1,2*(ny-1)+1,2*(nz-1)+1),dtype ='complex_') g2=np.zeros((2*(nx-1)+1,2*(ny-1)+1,2*(nz-1)+1),dtype ='complex_') for i in range (2*(nx-1)+1): # store numerical values in g if i>nx-1: g[i]=copy.copy(np.conjugate(g[2*nx-2-i])) else: for j in range (2*(ny-1)+1): if j>ny-1: g[i,j]=copy.copy(np.conjugate(g[i,2*ny-2-j])) else: for k in range (2*(nz-1)+1): if k>nz-1: g[i,j,k]=copy.copy(np.conjugate(g[i,j,2*nz-2-k])) else: xvalue=copy.copy(-2*R+4*R*i/(2*(nx-1))) yvalue=copy.copy(-2*R+4*R*j/(2*(nx-1))) radius=copy.copy(math.sqrt(xvalue**2+yvalue**2)) zvalue=copy.copy(-L+2*L*k/(2*(nz-1))) Funcre = "2*math.pi*np.sin(x)*(np.cos(x))**2*np.cos(Omega/c*({})*(np.cos(x)-1))*special.jv(0,Omega/c*{}*math.sin(x))".format(zvalue,radius) Funcim = "2*math.pi*np.sin(x)*(np.cos(x))**2*np.sin(Omega/c*({})*(np.cos(x)-1))*special.jv(0,Omega/c*{}*math.sin(x))".format(zvalue,radius) def fre(x): fre=eval(Funcre) return fre def fim(x): fim=eval(Funcim) return fim g[i,j,k]=quad(fre,0,alpha)[0]+1j*quad(fim,0,alpha)[0] for i in range(2*(nz-1)+1): g2[:,:,i]=copy.copy(g[:,:,i])*(0 if i>nz else 1) for i in range(ntau): i1=np.zeros((nx,ny,nz),dtype='complex_') i2=np.zeros((nx,ny,nz),dtype='complex_') if i<ntau-1: snoise1=np.zeros((nx,ny,nz,nx,ny,nz),dtype='complex_') snoise2=np.zeros((nx,ny,nz,nx,ny,nz),dtype='complex_') snoise=np.zeros((nx,ny,nz,nx,ny,nz),dtype='complex_') s1=np.zeros((nx,ny,nz),dtype='complex_') s2=np.zeros((nx,ny,nz,nx,ny,nz),dtype='complex_') s3=np.zeros((nx,ny,nz,nx,ny,nz),dtype='complex_') for j in range(nx): for k in range(ny): for l in range(nz): # THESE LOOPS MAY SLOW THE COMPUTATION gc=copy.copy(g[(nx-1)-j:2*(nx-1)+1-j,(ny-1)-k:2*(ny-1)+1-k,(nz-1)-l:2*(nz-1)+1-l]) g2c=copy.copy(g2[(nx-1)-j:2*(nx-1)+1-j,(ny-1)-k:2*(ny-1)+1-k,(nz-1)-l:2*(nz-1)+1-l]) i1[j,k,l]=np.sum(S[:,:,:,:,:,:,i]*np.tensordot(np.conjugate(g2c),g2c,axes=0))*(dx*dy*dz)**2 i2[j,k,l]=np.sum((1+rho[:,:,:,i])/2*np.absolute(g2c)**2)*dx*dy*dz if i<ntau-1: s1+=S[:,:,:,j,k,l,i]*np.conjugate(g2c)*dx*dy*dz s2+=np.tensordot(np.conjugate(g2c)*rho[:,:,:,i],S[j,k,l,:,:,:,i],axes = 0)*dx*dy*dz s3+=np.tensordot(S[:,:,:,j,k,l,i],g2c*rho[:,:,:,i],axes = 0)*dx*dy*dz snoise1[:,:,:,j,k,l]=rho[:,:,:,i]*(1+rho[j,k,l,i])/2*np.conjugate(gc) snoise2[j,k,l,:,:,:]=rho[j,k,l,i]*(1+rho[:,:,:,i])/2*np.conjugate(gc) Intensitytrans[:,:,:,i]=3/(32*math.pi*lambda0**2*delo)*gamsp*(nnu**2*i1+nnu*i2) if i<ntau-1: for j in range(nz): for k in range(nz): snoise[:,:,j,:,:,k]=snoise1[:,:,j,:,:,k] if j>k else snoise2[:,:,j,:,:,k] rho[:,:,:,i+1]=rho[:,:,:,i]+gamsp*(-2*(1+rho[:,:,:,i])/2-3*nnu/(8*math.pi)*2*np.real(s1))*dtau S[:,:,:,:,:,:,i+1]=S[:,:,:,:,:,:,i]+gamsp*(-S[:,:,:,:,:,:,i]+3/(16*math.pi)*(nnu*(s2+s3)+snoise))*dtau for i in range(ntau): for jz in range (nz): Intensity[jz,i]=np.sum(Intensitytrans[:,:,jz,i])/(nx*ny) peak=np.empty((nz,2)) for i in range(nz): photonn[i]=np.sum(Intensity[i,:])/ntau*tau_max*delo*math.pi*R**2 peak[i,0]=np.argmax(Intensity[i,:])/ntau peak[i,1]=i/nz if max(Intensity[i,:])!=0: Intensityn[i,:]=Intensity[i,:]/max(Intensity[i,:]) 
\$\endgroup\$
8
  • 2
    \$\begingroup\$Why are your fre and fim functions redefined in a nested loop? As an eval, no less. Looping that much and that deep while you're working with numpy is probably a red flag by itself, but I'm quite confused what made you go this particular route for fre and fim. Have you profiled any of this?\$\endgroup\$
    – Mast
    CommentedFeb 8, 2022 at 15:25
  • \$\begingroup\$I intended to store numeric values for the array of g first, then use the value in the main loop where the comment is\$\endgroup\$
    – user16308
    CommentedFeb 8, 2022 at 15:32
  • \$\begingroup\$I don't think the loops containing eval increase the computation time that much\$\endgroup\$
    – user16308
    CommentedFeb 8, 2022 at 15:37
  • \$\begingroup\$You shouldn't THINK. You should TEST.\$\endgroup\$CommentedFeb 8, 2022 at 15:47
  • 1
    \$\begingroup\$The only output I need is 'Intensityn', 'peak', and 'photonn'. Any other intermediate values can be discarded!\$\endgroup\$
    – user16308
    CommentedFeb 8, 2022 at 18:01

1 Answer 1

6
\$\begingroup\$

The speed is quite slow, which seems to be delayed a lot by the three nested for loops with indices j,k, and l

This is a good start to finding out which part of the code is causing a bottleneck. Guessing which part is slow can be good for building up an intuition, so let's test the code and see what the slowdown is.

We can measure the performance of the code in a few ways. Since I have no understanding of the code, I would begin by throwing a general profiler at it. Python comes with a profiler called cProfile. Another profiler I find useful is pyinstrument.

Running the code through cProfile python -m cProfile -s time cr_laser_physics.py and killing it after a minute gave me a profile. The top of it looks like this

801051 function calls (786162 primitive calls) in 62.441 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 45.036 45.036 62.804 62.804 cr_laser_physics.py:1(<module>) 13081/8182 13.425 0.001 15.052 0.002 {built-in method numpy.core._multiarray_umath.implement_array_function} 26476 1.239 0.000 1.612 0.000 {built-in method builtins.eval} 12644 1.195 0.000 1.195 0.000 {method 'reduce' of 'numpy.ufunc' objects} 4719 0.226 0.000 13.868 0.003 numeric.py:943(tensordot) 104/102 0.201 0.002 0.203 0.002 {built-in method _imp.create_dynamic} 448 0.073 0.000 0.073 0.000 {method 'read' of '_io.FileIO' objects} 448 0.072 0.000 0.072 0.000 {built-in method marshal.loads} 14217 0.057 0.000 0.057 0.000 {method 'reshape' of 'numpy.ndarray' objects} 1 0.034 0.034 0.034 0.034 {built-in method mkl._py_mkl_service.get_version} ... 

Unfortunately, this doesn't look all that helpful. Since the majority of the time is spent in the module (45 seconds of it) and the remainder is a numpy method that we aren't calling directly, we haven't learned much. We can definitely say that the code overall is slow.

A google search lead me to this question. The problem they are having looks similar to ours, cProfile isn't being specific about where the performance problem is. The suggested answer is to split the code into more functions, so that cProfile can say which ones are slow.

So this looks like a good time to go through the code, and make whatever improvements we can. Some nice things to do would be

  1. Remove unused imports. They add a small bit of noise to the profiler. Pylance suggests that most of the imports are not needed. You can always add them back later when needed. The special import is falsely marked as unused, because Pylance cannot tell that the strings of code will be run here. This is a good indicator that the code is doing something in a very strange way. Can this be simplified?
  2. Put things into functions. While every function call will have an overhead, we need it so that cProfile can profile well.
  3. Split long lines into smaller chunks. This will make it easier to see what parts of the code depend on what.

I won't implement all of these suggestions, but I'll try and outline what the file contents will look like.

# Imports import copy import math import numpy as np import scipy.special as special from scipy.integrate import quad # Global constants. (The usual convention is to make these all uppercase). nz = 70 ny = nx = 3 ntau = 70 ... def main(): S=np.empty((nx,ny,nz,nx,ny,nz,ntau),dtype='complex_') S[:,:,:,:,:,:,0]=0 rho=np.empty((nx,ny,nz,ntau),dtype='complex_') ... if __name__ == "__main__": main() 

From this point, I would pick at the contents of the main method until it is split into as many functions as make sense. The hard part here is tracking which variables depend on which other variables.

Doing so, I end up with a main method that looks something like

def main(): rho = make_rho() g = make_g() g2 = make_g2(g) Intensitytrans = big_loop(g, g2, rho) Intensity = make_intensity(Intensitytrans) Intensityn, photonn = smaller_loop(Intensity) 

Here I've kept the variable the same, but moved as much code out into functions as I could. Each function follows a similar pattern of 'create variables -> loop to fill them with values -> return the variables referenced later'. As an example, here is make_g2.

def make_g2(g): g2=np.zeros((2*(nx-1)+1,2*(ny-1)+1,2*(nz-1)+1),dtype ='complex_') for i in range(2*(nz-1)+1): g2[:,:,i]=copy.copy(g[:,:,i])*(0 if i>nz else 1) return g2 

Running the profiler on the code now gives: (cut to just show the 5 most expensive functions)

516776 function calls (504160 primitive calls) in 60.096 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 44.271 44.271 59.010 59.010 cr_laser_physics.py:75(big_loop) 19107/11944 12.793 0.001 14.640 0.001 {built-in method numpy.core._multiarray_umath.implement_array_function} 19102 1.385 0.000 1.385 0.000 {method 'reduce' of 'numpy.ufunc' objects} 26460 0.803 0.000 1.060 0.000 {built-in method builtins.eval} 7163 0.236 0.000 13.265 0.002 numeric.py:943(tensordot) 

This tells us two pieces of information. You were right that most of the time (45 seconds when the code is run for 60 seconds). is spent in the heavily-nested loops. The other functions aren't noticably slow when compared to those loops, so we don't need to worry about them now.

The slow function looks like this

@profile def big_loop(g, g2, rho): # This has a lot going on, can it be simplified? S=np.empty((nx,ny,nz,nx,ny,nz,ntau),dtype='complex_') S[:,:,:,:,:,:,0]=0 Intensitytrans=np.zeros((nx,ny,nz,ntau),dtype='complex_') for i in range(ntau): i1=np.zeros((nx,ny,nz),dtype='complex_') i2=np.zeros((nx,ny,nz),dtype='complex_') if i<ntau-1: snoise1=np.zeros((nx,ny,nz,nx,ny,nz),dtype='complex_') snoise2=np.zeros((nx,ny,nz,nx,ny,nz),dtype='complex_') snoise=np.zeros((nx,ny,nz,nx,ny,nz),dtype='complex_') s1=np.zeros((nx,ny,nz),dtype='complex_') s2=np.zeros((nx,ny,nz,nx,ny,nz),dtype='complex_') s3=np.zeros((nx,ny,nz,nx,ny,nz),dtype='complex_') for j in range(nx): for k in range(ny): for l in range(nz): # THESE LOOPS MAY SLOW THE COMPUTATION gc=copy.copy(g[(nx-1)-j:2*(nx-1)+1-j,(ny-1)-k:2*(ny-1)+1-k,(nz-1)-l:2*(nz-1)+1-l]) g2c=copy.copy(g2[(nx-1)-j:2*(nx-1)+1-j,(ny-1)-k:2*(ny-1)+1-k,(nz-1)-l:2*(nz-1)+1-l]) i1[j,k,l]=np.sum(S[:,:,:,:,:,:,i]*np.tensordot(np.conjugate(g2c),g2c,axes=0))*(dx*dy*dz)**2 i2[j,k,l]=np.sum((1+rho[:,:,:,i])/2*np.absolute(g2c)**2)*dx*dy*dz if i<ntau-1: s1+=S[:,:,:,j,k,l,i]*np.conjugate(g2c)*dx*dy*dz s2+=np.tensordot(np.conjugate(g2c)*rho[:,:,:,i],S[j,k,l,:,:,:,i],axes = 0)*dx*dy*dz s3+=np.tensordot(S[:,:,:,j,k,l,i],g2c*rho[:,:,:,i],axes = 0)*dx*dy*dz snoise1[:,:,:,j,k,l]=rho[:,:,:,i]*(1+rho[j,k,l,i])/2*np.conjugate(gc) snoise2[j,k,l,:,:,:]=rho[j,k,l,i]*(1+rho[:,:,:,i])/2*np.conjugate(gc) Intensitytrans[:,:,:,i]=3/(32*math.pi*lambda0**2*delo)*gamsp*(nnu**2*i1+nnu*i2) if i<ntau-1: for j in range(nz): for k in range(nz): snoise[:,:,j,:,:,k]=snoise1[:,:,j,:,:,k] if j>k else snoise2[:,:,j,:,:,k] rho[:,:,:,i+1]=rho[:,:,:,i]+gamsp*(-2*(1+rho[:,:,:,i])/2-3*nnu/(8*math.pi)*2*np.real(s1))*dtau S[:,:,:,:,:,:,i+1]=S[:,:,:,:,:,:,i]+gamsp*(-S[:,:,:,:,:,:,i]+3/(16*math.pi)*(nnu*(s2+s3)+snoise))*dtau return Intensitytrans 

Since splitting this function up doesn't look easy, the next thing to do is use a different profiler. A line-by-line profiler should be more helpful here.

I installed line_profiler through pip, added @profile to specify the function to profile, and ran it using kernprof -l cr_laser_physics.py, killing it after 60 seconds. I then viewed the profile using python -m line_profiler cr_laser_physics.py.lprof. Keeping only the lines which contribute more than 5% of the time shows three lines are really slow.

Timer unit: 1e-06 s Total time: 60.4413 s File: cr_laser_physics.py Function: big_loop at line 74 Line # Hits Time Per Hit % Time Line Contents ============================================================== 74 @profile 75 def big_loop(g, g2, rho): 97 2433 22926489.0 9423.1 37.9 i1[j,k,l]=np.sum(S[:,:,:,:,:,:,i]*np.tensordot(np.conjugate(g2c),g2c,axes=0))*(dx*dy*dz)**2 101 2433 18182798.0 7473.4 30.1 s2+=np.tensordot(np.conjugate(g2c)*rho[:,:,:,i],S[j,k,l,:,:,:,i],axes = 0)*dx*dy*dz 102 2433 18384856.0 7556.5 30.4 s3+=np.tensordot(S[:,:,:,j,k,l,i],g2c*rho[:,:,:,i],axes = 0)*dx*dy*dz 

Each of those lines are 'hot', they take a long time and are run many many times. I'm not too confident about what can be changed, so I'll leave some suggestions below.

i1[j,k,l] = np.sum(S[:,:,:,:,:,:,i]*np.tensordot(np.conjugate(g2c),g2c,axes=0))*(dx*dy*dz)**2 

Both np.sum(S[:,:,:,:,:,:,i] and (dx*dy*dz)**2 will always give the same result, even when j, k, and l change. They can be done once before that triple-nested loop. I suspect that the sum is very costly to compute so often.

Each of the slow lines has a call np.conjugate(g2c). I would recommend doing this once and storing it in a variable.

The latter two slow lines are both adding a tensor product to a sum. Could you specify more axes to np.tensordot in order to remove some of the loop nesting? In general looping and summing in pure python is slower than an equivalent call to a numpy function.


\$\endgroup\$
2
  • \$\begingroup\$Thank you for the detailed reply. np.sum(S[:,:,:,:,:,:,I] is invariant under change of j,k, and l, but g2c is not. I changed the code to store the value of np.conjugate(g2c). For now, I don't think the axes to np.tensordot can be specified further. After the revision, the speed increase by factor of ~2. I'm trying to increase further.\$\endgroup\$
    – user16308
    CommentedFeb 8, 2022 at 23:18
  • \$\begingroup\$@user16308 that is good to hear. I failed to notice that the tensor product is inside np.sum, apologies. Since you can run the code to completition, maybe a follow up question with a profile from the machine you run it one would be good. The profiles I've produced might not be representitive of the code running on your machine (but hopefully they are decently close).\$\endgroup\$
    – spyr03
    CommentedFeb 9, 2022 at 10:25

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.