Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

two dataframes with different values hitting cache incorrectly #7

Open
monstrorivas opened this issue Apr 15, 2020 · 3 comments
Open
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@monstrorivas
Copy link

Are pandas dataframes supported as function arguments in a @cached decorated function?

I tried to simplify this example with a smaller dataframe but @cached does seem to behave as one would expect for smaller dataframes.

However, when I tried the minimal code below with the attached data I ran into a problem where the two clearly different dataframes are being interpreted as identical in the @cached decorated function. Thus, df2 doesn't make it through which_df but instead gets the value from the cache since it assumes df2 is equals to df1 (and it is not!)

This is the test to replicate. Please use the attached data get the unexpected behavior explained in this issue

import pandas as pd
from memoization import cached

@cached()
def which_df(df):
#     print("got inside function")
    return df.name
    
    
df1 = pd.read_pickle('memoization_test.pkl')
df1.name = "This is DF No. 1"
df2 = df1.interpolate()
df2.name = "This is DF No. 2"

df1.equals(df2)   # ==> False, since they are not identical
print(which_df(df1) + ', and it should be DF No. 1')
print(which_df(df2) + ', BUT it should be DF No. 2')

memoization_test.zip

@lonelyenvoy lonelyenvoy added bug Something isn't working help wanted Extra attention is needed labels May 12, 2020
@lonelyenvoy
Copy link
Owner

Hi monstrorivas,

Thanks for your issue. This was a bug - in order to memorize a pandas dataframe, memoization converted it to a string using str(). memoization assumed that this string exactly represented the internal states of a dataframe. This is true for built-in types, but not for dataframes. That's because when turned into a string, a dataframe omits parts of its content. Take your data for example:

>>> print(str(df1))
                            with_nans
2019-01-12 03:20:30-06:00  655.559113
2019-01-12 03:21:00-06:00  658.763224
2019-01-12 03:21:30-06:00  655.639191
2019-01-12 03:22:00-06:00  651.353745
2019-01-12 03:22:30-06:00  648.590169
...                               ...
2019-02-11 13:18:00-06:00  668.615855
2019-02-11 13:18:30-06:00  673.101573
2019-02-11 13:19:00-06:00  675.024038
2019-02-11 13:19:30-06:00  676.706156
2019-02-11 13:20:00-06:00  663.969849

[87600 rows x 1 columns]

So, two dataframes may be considered equal if you merely compare them using str(), as long as the first 5 lines and the last 5 lines are the same.

To address this issue, I have published a new release v0.3.1. Please run pip install --upgrade memoization to upgrade and read the tutorial about custom cache keys so that pandas dataframes can be properly cached. Feel free to ask for help if needed.

@monstrorivas
Copy link
Author

ok, that makes sense. What I did as a workaround is to pickle the dataframe before passing it to the memoized function. Then, I have to deserialize it within the function. Would something like that work in your implementation, instead of using str()?

Could you give me an example of what to use for the custom_key_maker for a dataframe?

@lonelyenvoy lonelyenvoy added question Further information is requested and removed question Further information is requested labels Dec 19, 2020
@judahrand
Copy link

judahrand commented Oct 19, 2021

Why not assemble all arguments into a single dictionary and pickle.dumps it? The result can be hashed either with hashlib.md5 or xxhash.xx3h_64 or similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants