Grid search parameter combinations in Python

Question

For a computation engineering model, I want to do a grid search for all feasible parameter combinations. Each parameter has a certain possibility range, e.g. (0 … 100) and the parameter combination must fulfil the condition a+b+c=100. An example:

ranges = { 'a': (95, 99), 'b': (1, 4), 'c': (1, 2)} increment = 1.0 target = 100.0

So the combinations that fulfil the conditiona+b+c=100 are:

[(95, 4, 1), (95, 3, 2), (96, 2, 2), (96, 3, 1), (97, 1, 2), (97, 2, 1), (98, 1, 1)]

This algorithm should run with any number of parameters, range lengths, and increments.

The solutions I have come up with is brute-forcing the problem. That means calculating all combinations and then discarding the ones that do not fulfil the given condition. I have to use np.isclose(), because when using floats, the rounding error in Python's will not lead to an exact sum.

def solution(ranges, increment, target): combinations = [] for parameter in ranges: combinations.append(list(np.arange(ranges[parameter][0], ranges[parameter][1], increment))) # np.arange() is exclusive of the upper bound, let's fix that if combinations[-1][-1] != ranges[parameter][1]: combinations[-1].append(ranges[parameter][1]) result = [] for combination in itertools.product(*combinations): # using np.isclose() so that the algorithm works for floats if np.isclose(sum(combination), target): result.append(combination) df = pd.DataFrame(result, columns=ranges.keys()) return df

However, this quickly takes a few days to compute. Hence, both solutions are not viable for large number of parameters and ranges. For instance, one set that I am trying to solve is (already unpacked combinations variable):

[[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0], [22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0], [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0], [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0], [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0], [0.0, 1.0, 2.0], [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0], [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0], [0.0], [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0], [0.0]]

This results in memory use of >40 GB and calculation time >400 hours. Do you see a solution that is either faster or more intelligent, i.e. not trying to brute-force the problem?

Maybe a bit trivial, but just found that doing a first check with sum()before doing the actual check with np.isclose() saves a lot of time. — n1000, CommentedJan 2, 2019 at 12:14
Welcome to CodeReview.StackExchange! Do you know what the condition is going to be like ? In your case, it seems to be linear constraint. Will it always be the case ? — SylvainD, CommentedJan 2, 2019 at 12:21
Sorry - not sure if I understand the question. The condition is that the sum of the combinations must equal the target variable, which is constant. — n1000, CommentedJan 2, 2019 at 12:23
Ha ok, fair enough, I didn't understand that the condition was always that the sum of the params should be equal to the target value. — SylvainD, CommentedJan 2, 2019 at 12:27
Looks suspiciously similar to a subset sum problem. Don't hold your breath. — vnp, CommentedJan 2, 2019 at 21:34

Community · Accepted Answer · 2020-06-10 13:24:26Z

split

The method does a few things at the same time.

It expands the ranges
generates the combinations
takes the sums
compares to the target
assembles a DataFrame

better would be to split it up in a few logical units. This will not only help freeing memory in-between steps, but also help with debugging the intermediate values

looping

Check the talk "Looping like a pro"

You loop over the index, and end up with real long variables like ranges[parameter][0]. If you use tuple expansion, you can do this:

def expand(ranges, increment): for start, end in ranges.values(): expansion= list(np.arange(start, end, increment)) # np.arange() is exclusive of the upper bound, let's fix that if expansion[-1] != end: expansion.append(end) yield expansion

For me, this is a lot clearer. It also has the benefit of keeping the lines shorter.

Since it are more the expansions of the ranges you create instead of the combinations, I renamed the method and variables.

This also uses a generator instead of an intermediate list.

This alternative might even simpler:

def expand(ranges, increment): for start, end in ranges.values(): yield np.arange(start, end + increment/2, increment)

This can then be combined and summed like this:

def combine_naive(expansions, target): for combination in itertools.product(*expansions): # using np.isclose() so that the algorithm works for floats if np.isclose(sum(combination), target): yield combination

I like your comment there. It explains why you use np.isclose instead of ==

Alternative approaches

numpy

instead of having itertools.product assemble the combinations, you could use numpy.meshgrid

def combine_numpy(expansions, target): grid = np.meshgrid(*expansions) summary = np.array(grid).sum(axis=0) indices = np.where(np.isclose(summary, target)) return ( [row[i] for row, i in zip(expansions, idx)] for idx in zip(*indices) )

This is approximately the same, but staying in the numpy universe. You can reduce the memory footprint here with the correct choice of dtype and reducing the number of intermediary variables

sorted

or you could use the fact the expansions are sorted, and break when a sum is over the target. This uses recursion, tuple unpacking and the fact an empty sequence is Falsey.

def combine_sorted(expansions, target, previous=()): series, *rest = expansions sum_max = sum(item[-1] for item in rest) for item in series: current = previous + (item,) sum_current = sum(current) if rest: if sum_current + sum_max < target and not np.isclose( sum_current + sum_max, target ): continue yield from combine_sorted(rest, target, previous=current) else: if np.isclose(sum_current, target): yield current elif sum_current > target: return

The sum_max is to skip numbers where the total some will be less than target as soon as possible.

testing and timing with real data will show which option is more memory and CPU intensive.

if __name__ == "__main__": ranges = {"a": (95, 99), "b": (1, 4), "c": (1, 2)} increment = 1.0 target = 100.0 df = solution(ranges, increment, target) expansions = list(expand(ranges, increment)) results = list(combine_naive(expansions, target)) df1 = pd.DataFrame(results, columns=ranges.keys()) print(df1) results_sorted = list(combine_sorted(expansions, target)) print(results_sorted) df_sorted = pd.DataFrame(results_sorted, columns=ranges.keys()) print(df_sorted) results_numpy = list(combine_numpy(expansions, target)) print(results_numpy) df_sorted = pd.DataFrame(results_numpy, columns=ranges.keys()) print(df_sorted)

Actually my first try was to put all combinations into a dataframe and drop all that were not equal to the target. But I ran into memory issues. Your solutions and feedback are nice! Here are calculation time and max. memory usage for a relatively large example: combine_naive(): 32.3min, 60.3MB, combine_sorted(): 6.1min, 60.5MB, combine_numpy(): 0.4min, 6761MB. Seems like combine_sorted() may be a great compromise between speed and memory. — n1000, CommentedJan 4, 2019 at 16:24
you can do both. Try the combine_numpy and catch the MemoryError. If it doesn't succeed, fall back to the combine_sorted — Maarten Fabré, CommentedJan 7, 2019 at 8:38
Hmm. Intereseting. I did not get a MemoryError, however. MacOS started swapping like crazy. I am not sure if it is easily possible to keep taps on the memory consumption of a script. @Maarten Fabré — n1000, CommentedJan 7, 2019 at 15:19
another possibility would be to calculate the number of combinations (product of the lengths of the ranges), and decide based on that — Maarten Fabré, CommentedJan 7, 2019 at 17:00

Reinderien · Accepted Answer · 2019-01-03 01:16:06Z

This is (nearly) the perfect application for ILP, with a catch - whereas you want to enumerate all possible combinations, ILP is for convex optimization within certain constraints, has a cost metric, and only outputs the best possible combination (where you define best).

You haven't given us a lot of detail on the "computation engineering model", but it's likely that not all of your output combinations are preferable, and there's some other selection process at work after your Python program has run. If you can capture this selection process as a linear cost metric, then ILP was made for your problem.

If you're married to Python, then you can use Numpy and the scipy.optimize.linprog package, but it has some shortcomings - prominently, that it doesn't support ILP, only LP (with floating-point values). If you aren't married to Python, then use something like glpk - either directly or through something like Octave's interface.

To illustrate, following along with Chapter 1.1 of the glpk documentation and matching it to your example:

Your a, b, and c would form a three-element vector x that glpk is responsible for determining
You would have to create a matrix c to determine the cost ("objective") function against which glpk will optimize
Your ranges would be converted into the vectors l and u
The constraint coefficient matrix a would be set to a vector of all 1s
Your target would be converted into the "constraint auxiliary variable"

If you actually need to constrain the output to whole numbers (i.e. your increment is 1), then use MIP ("mixed integer linear programming") mode directly.

If increment is not 1 and you still need to constrain output to a linear space following those increments, you'll need a pre- and post-scaling operation so that you're still using MIP.

If it actually doesn't matter that the numbers be constrained to increments at all, then just use LP mode.

Stack Exchange Network

Grid search parameter combinations in Python

2 Answers 2

split

looping

Alternative approaches

numpy

sorted

Hot Network Questions

Grid search parameter combinations in Python

2 Answers 2

split

looping

Alternative approaches

numpy

sorted

Related

Hot Network Questions