Improving excution time for list and sampling in Python

Question

Here is my code for generating edge list based on Price model algorithm. I think there have two points to reduce execution time:

use proper data type
use faster random sampling

Originally, I used list instead of NumPy array and then changed from extending array till c x n to setting initial fixed array (and change the initial values).

It made execution time faster but still way slower than same implementation written in Matlab.

import pickle from random import random, sample import numpy as np def gen_edge_list(a, c, n): """ Generate edge list based on 14.1.1 """ edge_list = np.zeros(c*n) p = float(c) / (float(c)+a) # % for preferential attachment edges edge_list[0] = 1 # edge from vertex 2 to 1 idx = 1 # point to the next index for t in range(3, n+1): if t <= c+1: """ If network size is smaller than c+1, connect among all vertices. """ edge_list[idx:idx+t-1] = [i for i in range(1, t)] idx = idx+t-1 else: """ decide preferential attachment or uniformly random attachment by given p """ n_pref = len([True for i in range(c) if random() < p]) edge_list[idx:idx+n_pref] = sample(edge_list[0:idx-1], n_pref) idx += n_pref edge_list[idx:idx+c-n_pref] = sample(range(1, t+1), c-n_pref) idx = idx + c - n_pref if __name__ == "__main__": a = [1.5] c = [3] n = 10**6 edge_lists = [] for i in range(len(a)): edge_lists.append(gen_edge_list(a[i], c[i], n)) output = open('edge_list.pkl', 'wb') pickle.dump(edge_lists, output) output.close()

My biggest concern is especially the following:

 """ decide preferential attachment or uniformly random attachment by given p """ n_pref = len([True for i in range(c) if random() < p]) edge_list[idx:idx+n_pref] = sample(edge_list[0:idx-1], n_pref) idx += n_pref edge_list[idx:idx+c-n_pref] = sample(range(1, t+1), c-n_pref) idx = idx + c - n_pref

Here is my friends code written in Matlab:

a = [1.5]; c = [3]; n = 1^6; edgelist = cell(numel(c),1); for i = 1:numel(c) p = c(i)./(c(i)+a(i)); edgelist{i} = zeros(1, c(i).*n); edgelist{i}(1) = 1; idx = 2; for t = 3:n if t <= c(i)+1 edgelist{i}(idx:(idx+t-2)) = 1:(t-1); idx = idx+t-1; else pref_or_rand = rand(1,c(i)) < p; prefn = sum(pref_or_rand); edgelist{i}(idx:(idx+c(i)-1)) = [edgelist{i}(randi(idx-1,1,prefn)) randi(t,1,c(i)-prefn)]; idx = idx+c(i); end end end

I don't know what makes this huge difference on execution time between them. (40 sec in Matlab on mac book pro vs 40 min with Python code in recent i5 machine on Debian)

If you have any idea, please let me know.

sangheestyle · Accepted Answer · 2014-11-20 10:24:12Z

I have changed sampling part and it took only 19 sec. I have realized that numpy's random sampling is way faster than python's built-in random sampling.

import pickle import numpy as np from numpy.random import randint, rand, choice def gen_edge_list(a, c, n): """ Generate edge list based on 14.1.1 """ edge_list = np.zeros(c*n) p = float(c) / (float(c)+a) # % for preferential attachment edges edge_list[0] = 1 # edge from vertex 2 to 1 idx = 1 # point to the next index for t in range(3, n+1): print t if t <= c+1: """ If network size is smaller than c+1, connect among all vertices. """ edge_list[idx:idx+t-1] = [i for i in range(1, t)] idx = idx+t-1 else: """ decide preferential attachment or uniformly random attachment by given p """ n_pref = np.sum(rand(c) < p) edge_list[idx:idx+n_pref] = choice(edge_list[0:idx-1], n_pref) idx += n_pref edge_list[idx:idx+c-n_pref] = randint(1, t+1, c-n_pref) idx = idx + c - n_pref

TheBlackCat · Accepted Answer · 2016-05-06 19:37:52Z

Some more suggestions on your latest version:

Assigning a numpy array to a numpy array is faster than assigning a list to a numpy array. So edge_list[idx:idx+t-1] = np.arange(1, t) is faster than edge_list[idx:idx+t-1] = [i for i in range(1, t)].
It is a little faster, and in my opinion cleaner, to use foo.sum() for numpy arrays instead of np.sum(foo) (or in your case (foo < bar).sum() versus np.sum(foo < bar).
You can use in-place operations more. So idx += t-1 instead of idx = idx+t-1.
In my opinion it would be cleaner to have two loops, one for t <= c+1, the other for t > c+1. This will save you an indentation level, and a comparison per loop.
You add, then immediately remove, n_pref from idx. This seems redundant. I would use another variable there.
You re-do some of the math several times. I think it would be better to switch variables around.
You can tell what all the idx values will be ahead of time, so you can pre-compute them.
You can pre-compute the range and re-use it.

You are right. np.arange(1, t) must be faster and make it consistent. — sangheestyle, CommentedMay 6, 2016 at 20:12

Stack Exchange Network

Improving excution time for list and sampling in Python

2 Answers 2

Hot Network Questions

Improving excution time for list and sampling in Python

2 Answers 2

Related

Hot Network Questions