-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory efficiency #4
Comments
Thank you for building this package, it's great for me so far. There's noticeable sluggishness for me when aggregating. I want to make the operations faster but keep the original dataframe unmodified (since I don't want to have to reload it if I mess up). After looking at the docs here's what I came up with: truncate_to_hour = lambda ts: ts.replace(minute=0, second=0, microsecond=0)
options.set_option('modify_input_data', True)
df = (stocks.copy() >>
mutate(timestamp='timestamp.apply(truncate_to_hour)') >>
group_by('timestamp') >>
summarize(high = 'max(high)', low = 'min(low)',
open = 'first(open)', close = 'last(close)',
volume = 'sum(volume)', trades = 'sum(trades)')
)
options.set_option('modify_input_data', False) Maybe this feature could be something like the following? with plydata.inplace():
stocks.copy() >>
mutate(timestamp='timestamp.apply(truncate_to_hour)') >>
group_by('timestamp') >>
summarize(high = 'max(high)', low = 'min(low)',
open = 'first(open)', close = 'last(close)',
volume = 'sum(volume)', trades = 'sum(trades)') Or maybe just something like this? inplace(stocks.copy()) >>
mutate(timestamp='timestamp.apply(truncate_to_hour)') >>
group_by('timestamp') >>
summarize(high = 'max(high)', low = 'min(low)',
open = 'first(open)', close = 'last(close)',
volume = 'sum(volume)', trades = 'sum(trades)') I was thinking |
There is already a way to use a context manager with the options from plydata.options import options
with options(modify_input_data=True):
stocks.copy() >>
mutate(timestamp='timestamp.apply(truncate_to_hour)') >>
group_by('timestamp') >>
summarize(high = 'max(high)', low = 'min(low)',
open = 'first(open)', close = 'last(close)',
volume = 'sum(volume)', trades = 'sum(trades)') I think that is also not convenient enough. Wrapping the dataframe i.e |
Thanks for the link, using a context manager is much better than what I came up with. Personally, I like that you borrowed the
Is that why you're saying that it's a bonus to do away with By the way, I notice you're the only contributor. What kind of contributions (if any) would be helpful to you/this project right now? |
The You have to wrap multi-line statements in parens, and when I do this I always put Overloading the The API is rather complete, except if dplyr comes up with anything worth stealing or there is a solution to a common Pandas annoyance that is appropriate for this library. The way I have done it so far is to implement the features when I recognise/need the convenience they bring. The pressing issues are those flagged as enhancement, at the moment all have to do with performance. I think the 2 points raised by this issue would bring the most improvement. However, I want to have proper bench-marking in place first. Wherever there is an option, I do not want to sacrifice performance for convenience. |
|
Limit the creation of group_by dataframes. There are cases where the group indices are sufficient, e.g. for the summarise verbs. This may be possible for all verbs that cannot nest other verb operations, i.e not possible with
do
.Provide a way for a whole pipeline of manipulations to limit the copying of dataframes. Only the input dataframe needs to be preserved. The current way (
modify_input_data
option) that approaches this effect is cumbersome and the user has to provide a copy of the input data. Maybe with a function.The text was updated successfully, but these errors were encountered: