PERF Call KDTree in mutual_info to reduce memory footprint#17878

noelano · 2020-07-09T22:31:09Z

What does this implement/fix? Explain your changes.

The mutual_info functions can end up requiring high amounts of memory which makes it particularly tricky to perform multiple MI computations for features in parallel.
The biggest impact is the nn.radius_neighbours call which loads the full nested arrays of all neighbours into memory before getting their sizes.

Since the algorithm is already being set to kd_tree anyway we can explicity call the KDTree class in order to use the query_radius method to get the count of neighbours without having to first store them.

jnothman · 2020-07-11T11:50:21Z

Thank you @noelano

Please add an |Efficiency| entry to the change log at doc/whats_new/v0.24.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

thomasjpfan

Not blocking thought: I wonder if this metric can take advantage of n_jobs + chunking with a threading backend.

Thank you for the PR @noelano! LGTM

sklearn/feature_selection/_mutual_info.py

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

rth · 2020-07-15T11:59:00Z

I applied suggestions, but now I see that we still need to have the "what's new" entry to merge.

I wonder if this metric can take advantage of n_jobs + chunking with a threading backend.

We could investigate, but maybe let's merge this PR first..

@noelano Would you have benchmark results before/after this change?

…t-learn into mutual_info_kdtree

doc/whats_new/v0.24.rst

thomasjpfan · 2020-07-17T18:47:17Z

doc/whats_new/v0.24.rst

@@ -130,6 +130,11 @@ Changelog
 :pr:`17090` by :user:`Lisa Schwetlick <lschwetlick>` and
 :user:`Marija Vlajic Wheeler <marijavlajic>`.

+- |Efficiency| Reduce memory footprint in :func:`feature_selection._compute_mi_cd`
+ and :func:`feature_selection._compute_mi_cc` by calling :class:`neighbors.KDTree`
+ for counting nearest neighbors


Suggested change
for counting nearest neighbors
for counting nearest neighbors.

noelano · 2020-07-17T18:52:38Z

In terms of benchmark I can't share the data I was using to test but here's a quick illustration with 1m points using memory_profiler:

x = np.random.random(size=1000000) y = np.random.choice(['a', 'b'], size=1000000) _compute_mi_cd(x, y, 3)

before:

152.3 MiB 0.0 MiB nn.set_params(algorithm='kd_tree')
154.3 MiB 2.0 MiB nn.fit(c)
323.5 MiB 169.2 MiB ind = nn.radius_neighbors(radius=radius, return_distance=False)
331.8 MiB 0.1 MiB m_all = np.array([i.size for i in ind])

after:
157.7 MiB 0.0 MiB nn = KDTree(c)
165.4 MiB 7.6 MiB m_all = nn.query_radius(c, radius, count_only=True, return_distance=False)
165.4 MiB 0.0 MiB m_all = np.array(m_all) - 1.0

Execution time is slight improvement also

x = np.random.random(size=1000000) y = np.random.choice(['a', 'b'], size=1000000) print(timeit.timeit('mi = _compute_mi_cd(x, y, 3)', number=100, setup="from __main__ import x, y, _compute_mi_cd")) print(timeit.timeit('mi = _compute_mi_cd(x, y, 3, "new")', number=100, setup="from __main__ import x, y, _compute_mi_cd"))

gives
71.61950279999999
64.1891042

In terms of n_jobs and chunking I hadn't looked into it since the public functions don't expose those anyway. I guess it'd be a separate task to look into adding that as an enhancement

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

thomasjpfan · 2020-07-17T19:15:43Z

doc/whats_new/v0.24.rst

+- |Efficiency| Reduce memory footprint in :func:`feature_selection.mutual_info_classif`
+ and :func:`feature_selection.mutual_info_regression` by calling :class:`neighbors.KDTree`
+ for counting nearest neighbors


May need to rewrap to make this pass linting.

thomasjpfan · 2020-07-17T19:15:56Z

Thank you for the benchmark!

In terms of n_jobs and chunking I hadn't looked into it since the public functions don't expose those anyway. I guess it'd be a separate task to look into adding that as an enhancement

This PR should not consider chunking. It was a passing thought when reviewing the code.

rth · 2020-07-17T22:08:29Z

Thank you @noelano and reviewers!

…arn#17878) Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Use KDTree for mutual_info to reduce memory footprint
e5c9860

github-actionsbot added the module:feature_selection label Jul 9, 2020

jnothman approved these changes Jul 11, 2020
View reviewed changes

thomasjpfan approved these changes Jul 12, 2020
View reviewed changes

sklearn/feature_selection/_mutual_info.py Outdated Show resolvedHide resolved
sklearn/feature_selection/_mutual_info.py Outdated Show resolvedHide resolved
sklearn/feature_selection/_mutual_info.py Outdated Show resolvedHide resolved

Apply suggestions from code review by Thomas
b94fc15
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

noelano added 2 commits July 17, 2020 19:29

Update release doc
a2a9e62

Merge branch 'mutual_info_kdtree' ofhttps://github.com/noelano/sciki…
e701be3
…t-learn into mutual_info_kdtree

thomasjpfan reviewed Jul 17, 2020
View reviewed changes

Update doc/whats_new/v0.24.rst
b6118ad
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

thomasjpfan reviewed Jul 17, 2020
View reviewed changes

Merge conflict resolution
180c3e0

rth changed the title ~~Explicitly call KDTree in mutual_info to reduce memory footprint~~PERF Call KDTree in mutual_info to reduce memory footprintJul 17, 2020

rth merged commit 0b0afd2 into scikit-learn:masterJul 17, 2020

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
PERF Call KDTree in mutual_info to reduce memory footprint (scikit-le…
02bd8ae
…arn#17878) Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF Call KDTree in mutual_info to reduce memory footprint#17878

PERF Call KDTree in mutual_info to reduce memory footprint #17878

noelano commented Jul 9, 2020

jnothman commented Jul 11, 2020

thomasjpfan left a comment

rth commented Jul 15, 2020

thomasjpfanJul 17, 2020

noelano commented Jul 17, 2020•
edited
Loading

thomasjpfanJul 17, 2020

thomasjpfan commented Jul 17, 2020

rth commented Jul 17, 2020

	for counting nearest neighbors
	for counting nearest neighbors.

PERF Call KDTree in mutual_info to reduce memory footprint#17878

PERF Call KDTree in mutual_info to reduce memory footprint #17878

Conversation

noelano commented Jul 9, 2020

What does this implement/fix? Explain your changes.

jnothman commented Jul 11, 2020

thomasjpfan left a comment

Choose a reason for hiding this comment

rth commented Jul 15, 2020

thomasjpfanJul 17, 2020

Choose a reason for hiding this comment

noelano commented Jul 17, 2020• edited Loading

thomasjpfanJul 17, 2020

Choose a reason for hiding this comment

thomasjpfan commented Jul 17, 2020

rth commented Jul 17, 2020

noelano commented Jul 17, 2020•
edited
Loading