Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak on writes/merges #2522

Open
echai58 opened this issue May 17, 2024 · 7 comments
Open

Memory leak on writes/merges #2522

echai58 opened this issue May 17, 2024 · 7 comments
Labels
question Further information is requested

Comments

@echai58
Copy link

echai58 commented May 17, 2024

Environment

Binding: python


Bug

What happened:
We're noticing constantly rising memory in our processes that write to deltalake. I wrote a minimal reproduction that loops and writes to deltalake, and the memory usage seems to indicate a memory leak.

What you expected to happen:
Memory to be reclaimed after writes.

How to reproduce it:
This is my script I tested with:

import datetime
from deltalake import write_deltalake, DeltaTable
import numpy as np
import pandas as pd 
import pyarrow as pa

schema = pa.schema(
    [
        ("date", pa.date32()),
        ("c1", pa.string()),
        ("c2", pa.int32()),
        ("v1", pa.float64()),
        ("v2", pa.float64()),
        ("v3", pa.int32()),
    ]
)

def write_table(dt: DeltaTable, date: datetime.date):
    num_rows = 150
    data = pd.DataFrame(
        {
            "date": np.repeat(date, num_rows),
            "c1": [f"str_{i}" for i in np.random.randint(0, 100, num_rows)],
            "c2": np.random.randint(1, 100, num_rows),
            "v1": np.random.rand(num_rows) * 100,
            "v2": np.random.rand(num_rows) * 100,
            "v3": np.random.randint(0, 100, num_rows),
        }
    )
    dt.merge(
        pa.Table.from_pandas(data, schema=schema),
        "s.date = t.date and s.c1 = t.c1 and s.c2 = t.c2",
        source_alias="s",
        target_alias="t",
    ).when_matched_update_all().when_not_matched_insert_all().execute()
    if dt.version() % 10 == 0:
        dt.create_checkpoint()

def main():
    DeltaTable.create("test", schema)

    for date in pd.date_range("2022-01-01", periods=25):
        for _ in range(50):
            write_table(DeltaTable("test"), date.date())

if __name__ == "__main__":
    main()

More details:
Here's the memray graph:
image

I also tested this with just write_deltalake(mode="append"), and the issue seems to also persist:
image

I saw #2068 and tried setting that env var, and got the following (doesn't seem to help):
image

@echai58 echai58 added the bug Something isn't working label May 17, 2024
@ion-elgreco
Copy link
Collaborator

ion-elgreco commented May 17, 2024

Can you do the following, change your script to sleep first for 30 secs, then create the table and then sleep again for 30 secs, then start writing in a loop those 50 times?

I think the slow increase in resident size might just be because at every write you update the table state at the end of the commit since it includes more info now.

@rtyler rtyler added the binding/python Issues for the Python package label May 17, 2024
@echai58
Copy link
Author

echai58 commented May 20, 2024

here's the mem graph with the 30 second sleeps:

image

@echai58
Copy link
Author

echai58 commented May 20, 2024

@ion-elgreco

This is a graph for 250 merges:
image

and for 250 appends:
image

The slow increase in the 250 appends I think would correspond to the increased table state metadata. But the merge memory use rises much faster, so maybe it is something particular with merges?

@ion-elgreco
Copy link
Collaborator

Merge operations probably holds more info, so this looks normal to me

@ion-elgreco ion-elgreco added question Further information is requested and removed bug Something isn't working binding/python Issues for the Python package labels May 20, 2024
@echai58
Copy link
Author

echai58 commented May 21, 2024

@ion-elgreco The _last_checkpoint file says:
{"size":360,"size_in_bytes":75770,"version":100}

for the last checkpoint after running a script with 1000 merges resulting in the following memory increase:
image

The amount the memory increased seems much larger than the metadata

@ion-elgreco
Copy link
Collaborator

@echai58 the checkpoint is compressed and also would never translate 1:1 from disk to mem afaik

@echai58
Copy link
Author

echai58 commented May 21, 2024

@ion-elgreco profiling a script that just instantiates the same delta table gives the following:
image

~13 mb , which is still much less than the >100mb seen from the merge script

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants