- Notifications
You must be signed in to change notification settings - Fork 25.8k
PERF Call KDTree in mutual_info to reduce memory footprint#17878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you @noelano Please add an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking thought: I wonder if this metric can take advantage of n_jobs + chunking with a threading backend.
Thank you for the PR @noelano! LGTM
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
I applied suggestions, but now I see that we still need to have the "what's new" entry to merge.
We could investigate, but maybe let's merge this PR first.. @noelano Would you have benchmark results before/after this change? |
doc/whats_new/v0.24.rst Outdated
@@ -130,6 +130,11 @@ Changelog | |||
:pr:`17090` by :user:`Lisa Schwetlick <lschwetlick>` and | |||
:user:`Marija Vlajic Wheeler <marijavlajic>`. | |||
- |Efficiency| Reduce memory footprint in :func:`feature_selection._compute_mi_cd` | |||
and :func:`feature_selection._compute_mi_cc` by calling :class:`neighbors.KDTree` | |||
for counting nearest neighbors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for counting nearest neighbors | |
for counting nearest neighbors. |
In terms of benchmark I can't share the data I was using to test but here's a quick illustration with 1m points using memory_profiler:
before: 152.3 MiB 0.0 MiB nn.set_params(algorithm='kd_tree') after: Execution time is slight improvement also
gives In terms of n_jobs and chunking I hadn't looked into it since the public functions don't expose those anyway. I guess it'd be a separate task to look into adding that as an enhancement |
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
doc/whats_new/v0.24.rst Outdated
- |Efficiency| Reduce memory footprint in :func:`feature_selection.mutual_info_classif` | ||
and :func:`feature_selection.mutual_info_regression` by calling :class:`neighbors.KDTree` | ||
for counting nearest neighbors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May need to rewrap to make this pass linting.
Thank you for the benchmark!
This PR should not consider chunking. It was a passing thought when reviewing the code. |
Thank you @noelano and reviewers! |
…arn#17878) Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
What does this implement/fix? Explain your changes.
The mutual_info functions can end up requiring high amounts of memory which makes it particularly tricky to perform multiple MI computations for features in parallel.
The biggest impact is the
nn.radius_neighbours
call which loads the full nested arrays of all neighbours into memory before getting their sizes.Since the algorithm is already being set to
kd_tree
anyway we can explicity call the KDTree class in order to use thequery_radius
method to get the count of neighbours without having to first store them.