two dataframes with different values hitting cache incorrectly #7

monstrorivas · 2020-04-15T04:46:22Z

Are pandas dataframes supported as function arguments in a @cached decorated function?

I tried to simplify this example with a smaller dataframe but @cached does seem to behave as one would expect for smaller dataframes.

However, when I tried the minimal code below with the attached data I ran into a problem where the two clearly different dataframes are being interpreted as identical in the @cached decorated function. Thus, df2 doesn't make it through which_df but instead gets the value from the cache since it assumes df2 is equals to df1 (and it is not!)

This is the test to replicate. Please use the attached data get the unexpected behavior explained in this issue

import pandas as pd
from memoization import cached

@cached()
def which_df(df):
#     print("got inside function")
    return df.name
    
    
df1 = pd.read_pickle('memoization_test.pkl')
df1.name = "This is DF No. 1"
df2 = df1.interpolate()
df2.name = "This is DF No. 2"

df1.equals(df2)   # ==> False, since they are not identical
print(which_df(df1) + ', and it should be DF No. 1')
print(which_df(df2) + ', BUT it should be DF No. 2')

memoization_test.zip

The text was updated successfully, but these errors were encountered:

lonelyenvoy · 2020-05-12T15:43:54Z

Hi monstrorivas,

Thanks for your issue. This was a bug - in order to memorize a pandas dataframe, memoization converted it to a string using str(). memoization assumed that this string exactly represented the internal states of a dataframe. This is true for built-in types, but not for dataframes. That's because when turned into a string, a dataframe omits parts of its content. Take your data for example:

>>> print(str(df1))
                            with_nans
2019-01-12 03:20:30-06:00  655.559113
2019-01-12 03:21:00-06:00  658.763224
2019-01-12 03:21:30-06:00  655.639191
2019-01-12 03:22:00-06:00  651.353745
2019-01-12 03:22:30-06:00  648.590169
...                               ...
2019-02-11 13:18:00-06:00  668.615855
2019-02-11 13:18:30-06:00  673.101573
2019-02-11 13:19:00-06:00  675.024038
2019-02-11 13:19:30-06:00  676.706156
2019-02-11 13:20:00-06:00  663.969849

[87600 rows x 1 columns]

So, two dataframes may be considered equal if you merely compare them using str(), as long as the first 5 lines and the last 5 lines are the same.

To address this issue, I have published a new release v0.3.1. Please run pip install --upgrade memoization to upgrade and read the tutorial about custom cache keys so that pandas dataframes can be properly cached. Feel free to ask for help if needed.

monstrorivas · 2020-05-12T17:25:34Z

ok, that makes sense. What I did as a workaround is to pickle the dataframe before passing it to the memoized function. Then, I have to deserialize it within the function. Would something like that work in your implementation, instead of using str()?

Could you give me an example of what to use for the custom_key_maker for a dataframe?

judahrand · 2021-10-19T15:03:12Z

Why not assemble all arguments into a single dictionary and pickle.dumps it? The result can be hashed either with hashlib.md5 or xxhash.xx3h_64 or similar.

lonelyenvoy added bug Something isn't working help wanted Extra attention is needed labels May 12, 2020

lonelyenvoy added question Further information is requested and removed question Further information is requested labels Dec 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

two dataframes with different values hitting cache incorrectly #7

two dataframes with different values hitting cache incorrectly #7

monstrorivas commented Apr 15, 2020

lonelyenvoy commented May 12, 2020

monstrorivas commented May 12, 2020

judahrand commented Oct 19, 2021 •

edited

two dataframes with different values hitting cache incorrectly #7

two dataframes with different values hitting cache incorrectly #7

Comments

monstrorivas commented Apr 15, 2020

lonelyenvoy commented May 12, 2020

monstrorivas commented May 12, 2020

judahrand commented Oct 19, 2021 • edited

judahrand commented Oct 19, 2021 •

edited