Memory efficiency #4

has2k1 · 2017-11-09T19:55:10Z

Limit the creation of group_by dataframes. There are cases where the group indices are sufficient, e.g. for the summarise verbs. This may be possible for all verbs that cannot nest other verb operations, i.e not possible with do.
Provide a way for a whole pipeline of manipulations to limit the copying of dataframes. Only the input dataframe needs to be preserved. The current way (modify_input_data option) that approaches this effect is cumbersome and the user has to provide a copy of the input data. Maybe with a function.

ply(
    data,
    define(y='x'),
    define(z='2*y'),
    ...
)

The text was updated successfully, but these errors were encountered:

keeler · 2017-12-22T23:45:02Z

Thank you for building this package, it's great for me so far.

There's noticeable sluggishness for me when aggregating. I want to make the operations faster but keep the original dataframe unmodified (since I don't want to have to reload it if I mess up). After looking at the docs here's what I came up with:

truncate_to_hour = lambda ts: ts.replace(minute=0, second=0, microsecond=0)

options.set_option('modify_input_data', True)
df = (stocks.copy() >>
    mutate(timestamp='timestamp.apply(truncate_to_hour)') >>
    group_by('timestamp') >>
    summarize(high = 'max(high)', low = 'min(low)',
              open = 'first(open)', close = 'last(close)',
              volume = 'sum(volume)', trades = 'sum(trades)')
)
options.set_option('modify_input_data', False)

Maybe this feature could be something like the following?

with plydata.inplace():
    stocks.copy() >>
    mutate(timestamp='timestamp.apply(truncate_to_hour)') >>
    group_by('timestamp') >>
    summarize(high = 'max(high)', low = 'min(low)',
              open = 'first(open)', close = 'last(close)',
              volume = 'sum(volume)', trades = 'sum(trades)')

Or maybe just something like this?

inplace(stocks.copy()) >>
mutate(timestamp='timestamp.apply(truncate_to_hour)') >>
group_by('timestamp') >>
summarize(high = 'max(high)', low = 'min(low)',
          open = 'first(open)', close = 'last(close)',
          volume = 'sum(volume)', trades = 'sum(trades)')

I was thinking mutate instead of inplace but mutate is already a verb of course.

has2k1 · 2017-12-23T06:53:20Z

There is already a way to use a context manager with the options

from plydata.options import options

with options(modify_input_data=True):
    stocks.copy() >>
    mutate(timestamp='timestamp.apply(truncate_to_hour)') >>
    group_by('timestamp') >>
    summarize(high = 'max(high)', low = 'min(low)',
              open = 'first(open)', close = 'last(close)',
              volume = 'sum(volume)', trades = 'sum(trades)')

I think that is also not convenient enough. Wrapping the dataframe i.e inplace(stocks.copy()) and doing "magic" base on that is not an option. So far the best option is still as noted in point 2 above, it does away with the >> operator which may be a bonus.

keeler · 2017-12-25T01:12:39Z

Thanks for the link, using a context manager is much better than what I came up with.

Personally, I like that you borrowed the >> syntax from magrittr/dplyr. It denotes piping the data explicitly, and it's easy to translate "thinking in dplyr" into Python. However, I don't find it particularly elegant in Python due to its interaction with newlines. For example, I'm unable to run your example without either wrapping the plydata statements in parens (as in my first example above) or continuing the statement with \ at the end of each line. Otherwise I get this error:

stocks.copy() >>
                    ^
SyntaxError: invalid syntax

Is that why you're saying that it's a bonus to do away with >> by creating the ply() function? Or some other reason?

By the way, I notice you're the only contributor. What kind of contributions (if any) would be helpful to you/this project right now?

has2k1 · 2017-12-25T06:32:19Z

The ply() method will not replace the >> operator, it will be an alternative albeit with better performance.

You have to wrap multi-line statements in parens, and when I do this I always put >> at the beginning of the line (except when I slip up). Examples here and here.

Overloading the >> operator is "abusing python" so if you care for such things you have a way out.

The API is rather complete, except if dplyr comes up with anything worth stealing or there is a solution to a common Pandas annoyance that is appropriate for this library. The way I have done it so far is to implement the features when I recognise/need the convenience they bring.

The pressing issues are those flagged as enhancement, at the moment all have to do with performance. I think the 2 points raised by this issue would bring the most improvement. However, I want to have proper bench-marking in place first. Wherever there is an option, I do not want to sacrifice performance for convenience.

has2k1 · 2018-03-19T10:06:52Z

Benchmarks

has2k1 · 2020-02-02T20:35:10Z

ply method (part 2 of the issue) has been added. Need to think about part 1.

has2k1 added enhancement feature labels Nov 25, 2017

has2k1 mentioned this issue Mar 14, 2020

groupby and summarize is extremly slow for large number of groups #19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory efficiency #4

Memory efficiency #4

has2k1 commented Nov 9, 2017

keeler commented Dec 22, 2017

has2k1 commented Dec 23, 2017

keeler commented Dec 25, 2017

has2k1 commented Dec 25, 2017

has2k1 commented Mar 19, 2018

has2k1 commented Feb 2, 2020

Memory efficiency #4

Memory efficiency #4

Comments

has2k1 commented Nov 9, 2017

keeler commented Dec 22, 2017

has2k1 commented Dec 23, 2017

keeler commented Dec 25, 2017

has2k1 commented Dec 25, 2017

has2k1 commented Mar 19, 2018

has2k1 commented Feb 2, 2020