Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: nancorr_spearman #41857

Merged
merged 5 commits into from Jun 9, 2021
Merged

PERF: nancorr_spearman #41857

merged 5 commits into from Jun 9, 2021

Conversation

mzeitlin11
Copy link
Member

Think a lot more can be squeezed out, so not closing the issue. Initial results:

       before           after         ratio
     [dae5c597]       [d60902af]
     <master>         <perf/corr>
              69M            69.7M     1.01  stat_ops.Correlation.peakmem_corr_wide('spearman')
      1.47±0.02ms       1.60±0.6ms     1.09  stat_ops.Correlation.time_corr('spearman')
         690±20μs         629±90μs     0.91  stat_ops.Correlation.time_corr_series('spearman')
-        30.4±1ms       13.9±0.4ms     0.46  stat_ops.Correlation.time_corr_wide('spearman')
         589±60ms         528±20ms    ~0.90  stat_ops.Correlation.time_corr_wide_nans('spearman')
         12.0±2ms       9.02±0.9ms    ~0.75  stat_ops.Correlation.time_corrwith_cols('spearman')
         253±10ms          241±4ms     0.95  stat_ops.Correlation.time_corrwith_rows('spearman')

This speedup factor increases the larger the frame since more time gets spent in the ~O(N^2K) algo. With the nogil added, we could also explore using prange on the outer loop and exposing some new kwarg to allow cython parallelism.

@mzeitlin11 mzeitlin11 added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Performance Memory or execution speed performance labels Jun 7, 2021
@jbrockmendel
Copy link
Member

exposing some new kwarg to allow cython parallelism

What is the scenario in which prange is available but the user would want to disable it?

@mzeitlin11
Copy link
Member Author

exposing some new kwarg to allow cython parallelism

What is the scenario in which prange is available but the user would want to disable it?

Sorry "allow" was a poor word choice. More specifically, I think it would make sense for prange to default to num_threads=1, with the user then able to specify using more threads.

By default prange would defer to OpenMP to choose the number of threads, which would usually set the value to the number of cores, which can be problematic for cases like batch computing. You could probably manually set an environment variable to get around that on the user-end, but having the default behavior to use all cores would probably be a surprising change that should be opt-in.

@jbrockmendel
Copy link
Member

which can be problematic for cases like batch computing

thanks, thats what i was missing.

Would any of this Just Work if this were implemented in numba instead of cython?

I don't have a strong opinion on this per se, but am wary of a) using a cython feature (prange) that we don't currently use and that i don't know i) how well supported it is or ii) if its behavior is e.g. platform-dependent and b) user-facing kwargs that will give us combinatorially more cases to test

@jreback
Copy link
Contributor

jreback commented Jun 8, 2021

numba will just work for parallelism

+1 on adding similar apis as we have for window functions

cc @mroeschke

@mzeitlin11
Copy link
Member Author

mzeitlin11 commented Jun 8, 2021

I don't have a strong opinion on this per se, but am wary of a) using a cython feature (prange) that we don't currently use and that i don't know i) how well supported it is or ii) if its behavior is e.g. platform-dependent and b) user-facing kwargs that will give us combinatorially more cases to test

Agree with the concerns, was just mentioning as something to consider for the future, not meant for this pr

EDIT: also agree numba would make things easier, do you know if there has been any exploration of using numba instead of cython?

@@ -99,6 +99,7 @@ class Correlation:
param_names = ["method"]

def setup(self, method):
np.random.seed(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this random seed. The imported setup function sets the random seed for all benchmarks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will remove

@mroeschke
Copy link
Member

EDIT: also agree numba would make things easier, do you know if there has been any exploration of using numba instead of cython?

All window aggregations and groupby transform/agg have API support for a engine argument that can accept 'numba' as an argument which will do the computation using numba. It essentially means rewriting the cython code back to python and adding the jit decorator.

You can see pandas/core/window/numba_.py and pandas/core/groupby/numba_.py for the pattern I was using to implement these.

@simonjayhawkins
Copy link
Member

EDIT: also agree numba would make things easier, do you know if there has been any exploration of using numba instead of cython?

see #40530, although that's a bit ambitious as the goal there is to remove cython completely to create a no-arch pandas build. ( I think would maybe be good for new contributors or for exploration of the libs api, see later). I'm working through slowly but still on 'a' for algos. (next is wrappers for take_2d in algos.)

A significant benefit I would see for using numba would be the native handling of datetimes by numba avoiding i8 conversions in the main code and passing is_datetime like flags to the cython libs. I originally started to this also, but makes keeping my branch in sync harder. So I'm just looking to swap out the libraries as a first steps and keep the same api for the libs.

numba will just work for parallelism

I think you still have the same issue regarding what the default setting should be. IIRC in #40530 where I made take_1d parallel, running the test suite with n=auto was a lot slower and needed to use the environment settings to turn off parallelism. although I have since changed that to only use parallelism when the number of rows > 10_000, so probably no longer an issue for the test suite.

@mzeitlin11
Copy link
Member Author

Thanks for explaining @mroeschke and @simonjayhawkins!

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a question

result[xi, yi] = result[yi, xi] = NaN
else:
if not all_ranks:
with gil:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the gil here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rank_1d can't be called with nogil. Perhaps some refactoring could allow calling some nogil rank_1d helper instead, but that would be a larger change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i c, ok i think its worthile to make that nogil (but not in this PR), followon preferred.

# We need to slice back to nobs because rank_1d will
# require arrays of nobs length
labels_nobs = np.zeros(nobs, dtype=np.int64)
rankedx = rank_1d(np.array(maskedx)[:nobs],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah should really take a memory view (or have a helper function to do it)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some followon suggestions.

@jreback jreback added this to the 1.3 milestone Jun 9, 2021
@jreback jreback merged commit 63c20d2 into pandas-dev:master Jun 9, 2021
@mzeitlin11
Copy link
Member Author

some followon suggestions.

Will look into removing that gil section. The main complication is that rank_1d requires a np.lexsort step that would have to be done with the gil (or perhaps some kind of presorting could be done, will look into it)

@mzeitlin11 mzeitlin11 deleted the perf/corr branch June 9, 2021 00:39
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants