ENH: parallelize DataFrame.corr #40956

Vysybyl · 2021-04-14T22:47:18Z

Is your feature request related to a problem?

DataFrame.corr(method="spearman") is extremely slow.
method="pearson" is quite slow too.
I can see from my machine resource monitor that the implementation is single threaded. Is it a design choice? If so, there should be at least an optional argument to parallelize it (at C++ level, of course).
I did not check the actual code implementing this method.

Describe the solution you'd like

scipy.stats.spearmanr implements this computation on a numpy array in 1/20 of the time in my 6-core machine.

API breaking implications

None.

Describe alternatives you've considered

Add an optional argument (ex. "parallelize"=[True, False]) so that you give the user this option.
Then, the method should either be reimplemented from scratch at C++ level or we must use the existing scipy.stats function
on the DataFrame.values, wrapping the returned array in a new DataFrame.

Additional context

IMPORTANT: DataFrame.corr and spearmanr gives slightly different results (some kind of small rounding error of about 10e-15)

import numpy as np
from scipy.stats import spearmanr
import pandas as pd

df = pd.DataFrame(np.random.rand(1000, 2000))
pd_corr = df.corr(method='spearman')  # a few seconds
scipy_corr, p_value = spearmanr(df.values)  # <1 sec

np.equal(pd_corr.values, scipy_corr)  # False
np.sum(np.abs(corr_m.values - corr_m_sci) > 1e-15)  # 0

jreback · 2021-04-14T23:16:23Z

@Vysybyl you're welcome to contribute an implementation

lithomas1 · 2021-04-15T00:54:27Z

@Vysybyl We use cython for the corr code not C++. The relevant code can be found here.

pandas/pandas/_libs/algos.pyx

Line 324 in 84d9c5e

def nancorr(const float64_t[:, :] mat, bint cov=False, minp=None):

GF-Huang · 2021-07-09T06:51:17Z

So how to parallel?

astrojuanlu · 2022-03-10T09:59:07Z

Was this fixed in #42761?

jbrockmendel · 2023-02-15T02:54:55Z

Looks like we have a min_periods keyword that scipy doesnt. Other than that i don't see why we couldn't do e.g

try:
    import scipy.stats
except ImportError:
    result = do_what_we_do_now()
else:
    result = scipy.stats.spearmanr(values)

CangyuanLi · 2024-04-28T03:49:55Z

Hi, I am interested in contributing to this issue! In my own project, I just modify the code slightly to remove the nested loop and use cython.parallel.prange. However, this requires OpenMP, which if I understand correctly Pandas doesn't rely on at the moment? If this isn't an issue, I would be happy to submit a PR!

Vysybyl added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 14, 2021

lithomas1 added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 15, 2021

lithomas1 added this to the Contributions Welcome milestone Apr 15, 2021

mzeitlin11 self-assigned this Jun 7, 2021

mzeitlin11 mentioned this issue Jun 7, 2021

PERF: nancorr_spearman #41857

Merged

3 tasks

jreback modified the milestones: Contributions Welcome, 1.3 Jun 8, 2021

mzeitlin11 mentioned this issue Jun 9, 2021

PERF: nancorr_spearman fastpath #41885

Merged

mzeitlin11 mentioned this issue Jul 28, 2021

PERF: nancorr pearson #42761

Merged

4 tasks

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

jbrockmendel added the Performance Memory or execution speed performance label Jan 23, 2023

jbrockmendel added the Multithreading Parallelism in pandas label Feb 25, 2023

jbrockmendel added the cov/corr label Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: parallelize DataFrame.corr #40956

ENH: parallelize DataFrame.corr #40956

Vysybyl commented Apr 14, 2021 •

edited

jreback commented Apr 14, 2021

lithomas1 commented Apr 15, 2021

GF-Huang commented Jul 9, 2021

astrojuanlu commented Mar 10, 2022

jbrockmendel commented Feb 15, 2023

CangyuanLi commented Apr 28, 2024

ENH: parallelize DataFrame.corr #40956

ENH: parallelize DataFrame.corr #40956

Comments

Vysybyl commented Apr 14, 2021 • edited

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

jreback commented Apr 14, 2021

lithomas1 commented Apr 15, 2021

GF-Huang commented Jul 9, 2021

astrojuanlu commented Mar 10, 2022

jbrockmendel commented Feb 15, 2023

CangyuanLi commented Apr 28, 2024

Vysybyl commented Apr 14, 2021 •

edited