Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: parallelize DataFrame.corr #40956

Open
Vysybyl opened this issue Apr 14, 2021 · 6 comments
Open

ENH: parallelize DataFrame.corr #40956

Vysybyl opened this issue Apr 14, 2021 · 6 comments
Assignees
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff cov/corr Enhancement Multithreading Parallelism in pandas Performance Memory or execution speed performance

Comments

@Vysybyl
Copy link

Vysybyl commented Apr 14, 2021

Is your feature request related to a problem?

DataFrame.corr(method="spearman") is extremely slow.
method="pearson" is quite slow too.
I can see from my machine resource monitor that the implementation is single threaded. Is it a design choice? If so, there should be at least an optional argument to parallelize it (at C++ level, of course).
I did not check the actual code implementing this method.

Describe the solution you'd like

scipy.stats.spearmanr implements this computation on a numpy array in 1/20 of the time in my 6-core machine.

API breaking implications

None.

Describe alternatives you've considered

Add an optional argument (ex. "parallelize"=[True, False]) so that you give the user this option.
Then, the method should either be reimplemented from scratch at C++ level or we must use the existing scipy.stats function
on the DataFrame.values, wrapping the returned array in a new DataFrame.

Additional context

IMPORTANT: DataFrame.corr and spearmanr gives slightly different results (some kind of small rounding error of about 10e-15)

import numpy as np
from scipy.stats import spearmanr
import pandas as pd

df = pd.DataFrame(np.random.rand(1000, 2000))
pd_corr = df.corr(method='spearman')  # a few seconds
scipy_corr, p_value = spearmanr(df.values)  # <1 sec

np.equal(pd_corr.values, scipy_corr)  # False
np.sum(np.abs(corr_m.values - corr_m_sci) > 1e-15)  # 0
@Vysybyl Vysybyl added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 14, 2021
@jreback
Copy link
Contributor

jreback commented Apr 14, 2021

@Vysybyl you're welcome to contribute an implementation

@lithomas1 lithomas1 added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 15, 2021
@lithomas1 lithomas1 added this to the Contributions Welcome milestone Apr 15, 2021
@lithomas1
Copy link
Member

@Vysybyl We use cython for the corr code not C++. The relevant code can be found here.

def nancorr(const float64_t[:, :] mat, bint cov=False, minp=None):

@mzeitlin11 mzeitlin11 self-assigned this Jun 7, 2021
@mzeitlin11 mzeitlin11 mentioned this issue Jun 7, 2021
3 tasks
@jreback jreback modified the milestones: Contributions Welcome, 1.3 Jun 8, 2021
@GF-Huang
Copy link

GF-Huang commented Jul 9, 2021

So how to parallel?

@mzeitlin11 mzeitlin11 mentioned this issue Jul 28, 2021
4 tasks
@astrojuanlu
Copy link

Was this fixed in #42761?

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@jbrockmendel jbrockmendel added the Performance Memory or execution speed performance label Jan 23, 2023
@jbrockmendel
Copy link
Member

Looks like we have a min_periods keyword that scipy doesnt. Other than that i don't see why we couldn't do e.g

try:
    import scipy.stats
except ImportError:
    result = do_what_we_do_now()
else:
    result = scipy.stats.spearmanr(values)

@jbrockmendel jbrockmendel added the Multithreading Parallelism in pandas label Feb 25, 2023
@CangyuanLi
Copy link

Hi, I am interested in contributing to this issue! In my own project, I just modify the code slightly to remove the nested loop and use cython.parallel.prange. However, this requires OpenMP, which if I understand correctly Pandas doesn't rely on at the moment? If this isn't an issue, I would be happy to submit a PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff cov/corr Enhancement Multithreading Parallelism in pandas Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

9 participants