groupby and summarize is extremly slow for large number of groups #19

talegari · 2020-03-14T10:06:49Z

Why is groupby + summarize so slow compared to simple pandas. Please find the reproducible example.

import pandas as pd
import numpy as np
from plydata import *

df = pd.DataFrame({"a": np.random.choice(100000, 1000000, replace = True)
                   , "b": np.random.normal(10, 2, 1000000)}
                 )
df

%%time
# with plydata
res = (df >>
  group_by("a") >>
  summarise(c = "mean(b)" , d = "std(b)")
  )
res # took around ~ 7 mins

%%time
# plain pandas
df.groupby("a").agg(c = ("b", np.mean), d = ("b", np.std)).reset_index() # took 122 ms

The text was updated successfully, but these errors were encountered:

has2k1 · 2020-03-14T12:14:16Z

Thanks for the example. Yes this is slow and related to part 1 of this issue. The reason is pandas avoids creating group dataframes which is expensive. I plan to look at this problem after the next release v0.4.0.

talegari · 2020-03-17T06:34:23Z

Appreciate your commitment.

Currently, how is the groupby done? Would using the pandas's inherent groupby structure help, as it would mean translating pd.groupby.agg to group_by >> summarise, pd.groupby.assign to group_by >> mutate.

I would be happy to contribute.

has2k1 · 2020-03-17T15:23:19Z

Currently, how is the groupby done?

It is the straight forward (and naive) way, a loop over the group dataframes. For each group calculate the summary statistic, after which concatenate the individual summaries into the final result. So far this is the same process for all calculations that involve grouped data i.e mutate, create, do, arrange and summarise.

I do not know how easy it would be translate summarise to pd.groupby.agg because summarise is a more handy function, i.e. you can do this

summarise(c = "mean(a+b)" , d = "mean(np.sin(a+b))")

I do not think that is possible with groupby.agg!

The real problem is, completing a pandas groupby operation to obtain the group dataframes is very slow. groupby.agg labours to avoid it and where possible it substitutes and uses it's own cython functions to compute the aggregates i.e if your function is np.mean, pandas uses a private smarter implementation to compute the mean of the groups where by it does not fully partition the data.

I think part of the solution will be to recognise the simple uses of summarise and translate those to groupby.agg .

has2k1 added the enhancement label Mar 14, 2020

pwwang mentioned this issue Jul 9, 2021

Optimize DataFrameGroupBy.apply() pwwang/datar#27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby and summarize is extremly slow for large number of groups #19

groupby and summarize is extremly slow for large number of groups #19

talegari commented Mar 14, 2020

has2k1 commented Mar 14, 2020

talegari commented Mar 17, 2020

has2k1 commented Mar 17, 2020

groupby and summarize is extremly slow for large number of groups #19

groupby and summarize is extremly slow for large number of groups #19

Comments

talegari commented Mar 14, 2020

has2k1 commented Mar 14, 2020

talegari commented Mar 17, 2020

has2k1 commented Mar 17, 2020