Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory efficiency #4

Open
has2k1 opened this issue Nov 9, 2017 · 6 comments
Open

Memory efficiency #4

has2k1 opened this issue Nov 9, 2017 · 6 comments

Comments

@has2k1
Copy link
Owner

has2k1 commented Nov 9, 2017

  1. Limit the creation of group_by dataframes. There are cases where the group indices are sufficient, e.g. for the summarise verbs. This may be possible for all verbs that cannot nest other verb operations, i.e not possible with do.

  2. Provide a way for a whole pipeline of manipulations to limit the copying of dataframes. Only the input dataframe needs to be preserved. The current way (modify_input_data option) that approaches this effect is cumbersome and the user has to provide a copy of the input data. Maybe with a function.

ply(
    data,
    define(y='x'),
    define(z='2*y'),
    ...
)
@keeler
Copy link

keeler commented Dec 22, 2017

Thank you for building this package, it's great for me so far.

There's noticeable sluggishness for me when aggregating. I want to make the operations faster but keep the original dataframe unmodified (since I don't want to have to reload it if I mess up). After looking at the docs here's what I came up with:

truncate_to_hour = lambda ts: ts.replace(minute=0, second=0, microsecond=0)

options.set_option('modify_input_data', True)
df = (stocks.copy() >>
    mutate(timestamp='timestamp.apply(truncate_to_hour)') >>
    group_by('timestamp') >>
    summarize(high = 'max(high)', low = 'min(low)',
              open = 'first(open)', close = 'last(close)',
              volume = 'sum(volume)', trades = 'sum(trades)')
)
options.set_option('modify_input_data', False)

Maybe this feature could be something like the following?

with plydata.inplace():
    stocks.copy() >>
    mutate(timestamp='timestamp.apply(truncate_to_hour)') >>
    group_by('timestamp') >>
    summarize(high = 'max(high)', low = 'min(low)',
              open = 'first(open)', close = 'last(close)',
              volume = 'sum(volume)', trades = 'sum(trades)')

Or maybe just something like this?

inplace(stocks.copy()) >>
mutate(timestamp='timestamp.apply(truncate_to_hour)') >>
group_by('timestamp') >>
summarize(high = 'max(high)', low = 'min(low)',
          open = 'first(open)', close = 'last(close)',
          volume = 'sum(volume)', trades = 'sum(trades)')

I was thinking mutate instead of inplace but mutate is already a verb of course.

@has2k1
Copy link
Owner Author

has2k1 commented Dec 23, 2017

There is already a way to use a context manager with the options

from plydata.options import options

with options(modify_input_data=True):
    stocks.copy() >>
    mutate(timestamp='timestamp.apply(truncate_to_hour)') >>
    group_by('timestamp') >>
    summarize(high = 'max(high)', low = 'min(low)',
              open = 'first(open)', close = 'last(close)',
              volume = 'sum(volume)', trades = 'sum(trades)')

I think that is also not convenient enough. Wrapping the dataframe i.e inplace(stocks.copy()) and doing "magic" base on that is not an option. So far the best option is still as noted in point 2 above, it does away with the >> operator which may be a bonus.

@keeler
Copy link

keeler commented Dec 25, 2017

Thanks for the link, using a context manager is much better than what I came up with.

Personally, I like that you borrowed the >> syntax from magrittr/dplyr. It denotes piping the data explicitly, and it's easy to translate "thinking in dplyr" into Python. However, I don't find it particularly elegant in Python due to its interaction with newlines. For example, I'm unable to run your example without either wrapping the plydata statements in parens (as in my first example above) or continuing the statement with \ at the end of each line. Otherwise I get this error:

stocks.copy() >>
                    ^
SyntaxError: invalid syntax

Is that why you're saying that it's a bonus to do away with >> by creating the ply() function? Or some other reason?

By the way, I notice you're the only contributor. What kind of contributions (if any) would be helpful to you/this project right now?

@has2k1
Copy link
Owner Author

has2k1 commented Dec 25, 2017

The ply() method will not replace the >> operator, it will be an alternative albeit with better performance.

You have to wrap multi-line statements in parens, and when I do this I always put >> at the beginning of the line (except when I slip up). Examples here and here.

Overloading the >> operator is "abusing python" so if you care for such things you have a way out.

The API is rather complete, except if dplyr comes up with anything worth stealing or there is a solution to a common Pandas annoyance that is appropriate for this library. The way I have done it so far is to implement the features when I recognise/need the convenience they bring.

The pressing issues are those flagged as enhancement, at the moment all have to do with performance. I think the 2 points raised by this issue would bring the most improvement. However, I want to have proper bench-marking in place first. Wherever there is an option, I do not want to sacrifice performance for convenience.

@has2k1
Copy link
Owner Author

has2k1 commented Mar 19, 2018

Benchmarks

@has2k1
Copy link
Owner Author

has2k1 commented Feb 2, 2020

ply method (part 2 of the issue) has been added. Need to think about part 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants