Skip to content

PERF Call KDTree in mutual_info to reduce memory footprint#17878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 17, 2020

Conversation

noelano
Copy link
Contributor

What does this implement/fix? Explain your changes.

The mutual_info functions can end up requiring high amounts of memory which makes it particularly tricky to perform multiple MI computations for features in parallel.
The biggest impact is the nn.radius_neighbours call which loads the full nested arrays of all neighbours into memory before getting their sizes.

Since the algorithm is already being set to kd_tree anyway we can explicity call the KDTree class in order to use the query_radius method to get the count of neighbours without having to first store them.

@jnothman
Copy link
Member

Thank you @noelano

Please add an |Efficiency| entry to the change log at doc/whats_new/v0.24.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

Copy link
Member

@thomasjpfanthomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking thought: I wonder if this metric can take advantage of n_jobs + chunking with a threading backend.

Thank you for the PR @noelano! LGTM

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@rth
Copy link
Member

rth commented Jul 15, 2020

I applied suggestions, but now I see that we still need to have the "what's new" entry to merge.

I wonder if this metric can take advantage of n_jobs + chunking with a threading backend.

We could investigate, but maybe let's merge this PR first..

@noelano Would you have benchmark results before/after this change?

@@ -130,6 +130,11 @@ Changelog
:pr:`17090` by :user:`Lisa Schwetlick <lschwetlick>` and
:user:`Marija Vlajic Wheeler <marijavlajic>`.

- |Efficiency| Reduce memory footprint in :func:`feature_selection._compute_mi_cd`
and :func:`feature_selection._compute_mi_cc` by calling :class:`neighbors.KDTree`
for counting nearest neighbors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for counting nearest neighbors
for counting nearest neighbors.
@noelano
Copy link
ContributorAuthor

noelano commented Jul 17, 2020

In terms of benchmark I can't share the data I was using to test but here's a quick illustration with 1m points using memory_profiler:

x = np.random.random(size=1000000) y = np.random.choice(['a', 'b'], size=1000000) _compute_mi_cd(x, y, 3) 

before:

152.3 MiB 0.0 MiB nn.set_params(algorithm='kd_tree')
154.3 MiB 2.0 MiB nn.fit(c)
323.5 MiB 169.2 MiB ind = nn.radius_neighbors(radius=radius, return_distance=False)
331.8 MiB 0.1 MiB m_all = np.array([i.size for i in ind])

after:
157.7 MiB 0.0 MiB nn = KDTree(c)
165.4 MiB 7.6 MiB m_all = nn.query_radius(c, radius, count_only=True, return_distance=False)
165.4 MiB 0.0 MiB m_all = np.array(m_all) - 1.0

Execution time is slight improvement also

x = np.random.random(size=1000000) y = np.random.choice(['a', 'b'], size=1000000) print(timeit.timeit('mi = _compute_mi_cd(x, y, 3)', number=100, setup="from __main__ import x, y, _compute_mi_cd")) print(timeit.timeit('mi = _compute_mi_cd(x, y, 3, "new")', number=100, setup="from __main__ import x, y, _compute_mi_cd")) 

gives
71.61950279999999
64.1891042

In terms of n_jobs and chunking I hadn't looked into it since the public functions don't expose those anyway. I guess it'd be a separate task to look into adding that as an enhancement

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Comment on lines 133 to 135
- |Efficiency| Reduce memory footprint in :func:`feature_selection.mutual_info_classif`
and :func:`feature_selection.mutual_info_regression` by calling :class:`neighbors.KDTree`
for counting nearest neighbors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May need to rewrap to make this pass linting.

@thomasjpfan
Copy link
Member

Thank you for the benchmark!

In terms of n_jobs and chunking I hadn't looked into it since the public functions don't expose those anyway. I guess it'd be a separate task to look into adding that as an enhancement

This PR should not consider chunking. It was a passing thought when reviewing the code.

@rthrth changed the title Explicitly call KDTree in mutual_info to reduce memory footprintPERF Call KDTree in mutual_info to reduce memory footprintJul 17, 2020
@rthrth merged commit 0b0afd2 into scikit-learn:masterJul 17, 2020
@rth
Copy link
Member

rth commented Jul 17, 2020

Thank you @noelano and reviewers!

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment
4 participants
@noelano@jnothman@rth@thomasjpfan
close