Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: spatial: remove redundant weight summing from hamming distance #20703

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

DavidM603
Copy link

Instead of re-summing all the weights each time a pair of inputs is compared, sum the weights once and re-use the result. When I measured locally, trimming this extra work allowed about 25% more throughput on chars and 5% on doubles. I'll make this a draft to start, and seek review if the ci benchmarks corroborate.

Reference issue

n/a

What does this implement/fix?

Moderate performance improvements to existing functionality.

Additional information

n/a

Instead of re-summing all the weights each time a pair of inputs is compared, sum the weights once and re-use the result.
@github-actions github-actions bot added scipy.spatial C/C++ Items related to the internal C/C++ code base enhancement A new feature or improvement labels May 12, 2024
Copy link
Contributor

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far it doesn't seem to cross the threshold for the current benchmarks via i.e., asv continuous -e -b ".*Xdist.*" main hamming_distance_speedup locally on x86_64 Linux.

That said, 25 % for a specific data type might be promising. If you find that this data type is not covered by benchmarks/benchmarks/spatial.py, which I think focuses on float types, it would probably be fine to either expand those benchmarks to include the new type, or add another benchmark to cover the case. I believe string types may only be of interest for a subset of the distance metrics, so may need a small separate benchmark in there I suppose.

XdistWeighted may be the closest benchmark to what you are touching at the moment?

@DavidM603
Copy link
Author

So far it doesn't seem to cross the threshold for the current benchmarks via i.e., asv continuous -e -b ".*Xdist.*" main hamming_distance_speedup locally on x86_64 Linux.

That said, 25 % for a specific data type might be promising. If you find that this data type is not covered by benchmarks/benchmarks/spatial.py, which I think focuses on float types, it would probably be fine to either expand those benchmarks to include the new type, or add another benchmark to cover the case. I believe string types may only be of interest for a subset of the distance metrics, so may need a small separate benchmark in there I suppose.

XdistWeighted may be the closest benchmark to what you are touching at the moment?

I took a look at the benchmarking code in that file and found something on line 293. All of the distance measures are using 3 dimensions, but for some metrics that's a few orders of magnitude smaller than the ordinary use and doesn't really provide a good sample. For example, in XdistWeighted.time_pdist the largest sample size for hamming distance (100 x 3) runs in 291us, but the C part alone clocks in around 12us at that scale, so ~95% of that bench is spent in wrappers. I'd like to take on adding larger, but reasonable, benchmarks for the metrics that warrant higher dimensions. Maybe later I could add benches for the metrics that take non-double types too.

@lucascolley lucascolley changed the title ENH: remove redundant weight summing from hamming distance ENH: spatial: remove redundant weight summing from hamming distance May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C/C++ Items related to the internal C/C++ code base enhancement A new feature or improvement scipy.spatial
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants