7
\$\begingroup\$

Given a set of discrete locations (e.g. "sites") that are pairwise related in some categorical ways (e.g. general proximity) and contains local level data (e.g. population size), I wish to efficiently compute the mean correlation coefficients between local level data on pairwise locations that are characterized by the same relationships.

For example, I assumed 100 sites and randomized their pairwise relations using values 1 to 25, yielding the triangular matrix relations:

import numpy as np sites = 100 categ = 25 relations = np.random.randint(low=1, high=categ+1, size=(sites, sites)) relations = np.triu(relations) # set relation_ij = relation_ji np.fill_diagonal(relations, 0) # ignore self-relation 

I also have 5000 replicates of simulation results on each site:

sims = 5000 res = np.round(np.random.rand(sites, sims),1) 

To compute the mean pairwise correlation for each specific relation category, I first calculated for each relation category i the correlation coefficient rho[j] between the simulation results res of each unique site pairs j, and then taking the average across all possible pairs with relation i:

rho_list = np.ones(categ)*99 for i in range(1, categ+1): idr = np.transpose(np.where(relations == i)) # pairwise site indices of the same relation category comp = np.vstack([res[idr[:,0]].ravel(), res[idr[:,1]].ravel()]) # pairwise comparisons of simulation results from the same relation category comp_uniq = np.reshape(comp.T, (len(idr), res.shape[1], -1)) # reshape above into pairwise comparisons of simulation results between unique site pairs rho = np.ones(len(idr))*99 # correlation coefficients of all unique site pairs of current relation category for j in range(len(idr)): # loop through unique site pairs comp_uniq_s = comp_uniq[j][np.all(comp_uniq!=0, axis=2)[j]].T # shorten comparisons by removing pairs with zero-valued result rho[j] = np.corrcoef(comp_uniq_s[0], comp_uniq_s[1])[0,1] rho_list[i-1] = np.nanmean(rho) 

Although this script works, but once I increase sites = 400, then the entire computation can take more than 6 hrs to finish, which leads me to question my use of array functions. What is the reason for this poor performance? And how can I optimize the algorithm?

\$\endgroup\$

    1 Answer 1

    1
    \$\begingroup\$

    What is the reason for this poor performance? And how can I optimize the algorithm?

    You mentioned that sites=100 is a "fast" run, and sites=400 is a "slow" six-hour run, without posting profiling data. I don't know what "fast" is, but if it's around 22 minutes then you're just looking at quadratic growth of your data. I'm going to guess "fast" was less than 22 minutes, and then I'll speculate about your cache size. Perhaps you blew out your L3 (or L2) last-level cache. Which would suggest looking for a divide and conquer approach on overlapping neighborhoods, so each sub-problem is conveniently small enough to fit within cache.

    Other than that, perhaps you can "cheat" by changing the problem, approximating results. Consider a fixed budget of 10^4 pairs for which you're willing to compute correlation coefficients. In the sites=400 case, could you maybe rank order pairs so the ones with negligible contribution sort to the bottom of the list, outside of budget? Then they would simply never be considered.

    \$\endgroup\$

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.