Memory consumed for a larger dataset is not freed post vegalite_to_png call #143

manurampandit · 2023-12-06T05:27:48Z

Issue
During the process of converting a vegalite spec to png image from vegalite_to_png a memory leak was observed.

Issue description
The overall memory inside the container increased and then settled around that i.e. it did not reduced. Have tried explicit gc collect for the variables used but not much helpful.
As this behavior is specific to particular kind of data, so including a script to randomly generate similar data.
The steps to reproduce mentions the two steps required. Will see if I can link the video recording showing overall memory foot print increase.
Guess is that GIL is not freeing up the memory held during the execution. (or some leakage there)

Steps to reproduce:

Generate this random data set by following below python instructions:

import random
import pandas as pd

def generate_random_ref_id():
    return random.randint(10000000000, 99999999999)

def generate_random_text():
    letters = 'abcdefghijklmnopqrstuvwxyz'
    return ''.join(random.choice(letters) for _ in range(10))

def generate_random_number():
    return random.randint(1, 100)

def generate_row():
    param1 = f"(refId: {generate_random_ref_id()}, refIdR: '{generate_random_text()}:{generate_random_text()}:{generate_random_ref_id()}', paramA: None, Param2: None, time: {generate_random_ref_id()}, address: '{generate_random_text()}.{generate_random_text()}.{generate_random_text()}', param3: '{generate_random_text()}', context: '{generate_random_text()}', locP: b'\\x{generate_random_text()}\\x{generate_random_text()}', id: {generate_random_ref_id()}, paramC: None, paramD: None, paramE: None, paramF: '{generate_random_text()}-{generate_random_text()}', paramF: None, paramG: (ser: {generate_random_ref_id()}, serB: '{generate_random_text()}:{generate_random_text()}', locA: '{generate_random_text()}', locB: '{generate_random_text()}', cvasd: {generate_random_number()}, locD: '{generate_random_text()}', locEParam: '{generate_random_text()}.{generate_random_text()}.{generate_random_text()}-{generate_random_ref_id()}'), paramH: (context: '{generate_random_text()}'), ohterApp: (testLoc: '{generate_random_text()}:(cn{generate_random_text()}-{generate_random_ref_id()})', testV: '{generate_random_ref_id()}.{generate_random_ref_id()}.{generate_random_ref_id()}'), paramJ: None, paramK: '{generate_random_text()}:{generate_random_text()}-{generate_random_text()}-{generate_random_text()}', paramL: None, paramM: None, paramNnumbergoinghere: None, otherTimeDuration: {generate_random_ref_id()}, paramOsdfadfa: None, paramPsdfasdf: None, paramQsdasdfa: None)"

    param8 = f"(XId: '{generate_random_text()}-{generate_random_text()}-{generate_random_text()}-{generate_random_text()}-{generate_random_text()}', parama: None, paramB: None, paramC: '{generate_random_text()}', paramD: 'https://www.google.com/search?q=react+docs&otherparam={generate_random_text()}', paramE: '{generate_random_text()}', paramF: None, paramF: '{generate_random_text()}', paramG: 'https://www.google.com/search?q=react+docs&otherparam={generate_random_text()}', paramH: '{generate_random_text()} {generate_random_text()}. {generate_random_text()} ({generate_random_text()}, {generate_random_text()}) : {generate_random_text()}', paramH: None, paramI: None, paramJ: None, paramK: None, paramL: None, paramM: None)"

    param5 = random.choice(['some', 'thing'])

    param6 = random.choice(['other', 'sense'])
    param7 = random.choice(['cute', 'not_cute'])

    paramXsdfa = str(generate_random_ref_id())

    param9 = random.randint(-1, 10)  # -1 to 10
    param10 = "2024-10-10-08"

    row = [param1, param8, param5, param6, param7, {'paramXsdfa': paramXsdfa}, param9, param10]

    return row

# Generate 50,000 rows
data = []
for _ in range(50000):
    data.append(generate_row())

# Create a DataFrame
df = pd.DataFrame(data, columns=['param1', 'param8', 'param5', 'param6', 'param7', 'paramXsdfa', 'param9', 'param10'])

# Export to CSV
df.to_csv('git/generated_data_50k.csv', index=False)

Run the below code to execute and get image from vegalite_to_png

import pandas as pd
from vl_convert import vegalite_to_png
import json
spec = '''
{
    "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
    "description": "A simple scatterplot chart with embedded data.",
    "width": "container",
    "height": "container",
    "selection": {
        "grid": {
            "type": "interval",
            "bind": "scales"
        }
    },
    "mark": "point",
    "encoding": {
        "x": {
            "field": "param1",
            "type": "nominal",
            "axis": {
                "labelAngle": 0
            },
            "sort": null
        },
        "y": {
            "field": "param5",
            "type": "nominal",
            "sort": null
        }
    },
    "data": {
        "values": [
            
        ]
    }
}
'''

vl_spec = json.loads(spec)
# as spec width, height are not provided by default - container, required for image.
vl_spec['width'] = 800
vl_spec['height'] = 600

data = pd.read_csv('git/generated_data.csv')

json_result = json.loads(data.to_json(orient='table'))
vl_spec['data'] = {'values': json_result['data']}


png_data = vegalite_to_png(vl_spec)
print(png_data)

Temporary hack to get rid of this.

Create a file(name: 'my_vl_process.py') which runs vegalite_to_png in a separate process

import sys
from vl_convert import vegalite_to_png
import json

def main(vl_spec_file_path):
    with open(vl_spec_file_path, 'r') as file:
        vl_spec = json.load(file)
    
    # You can use the vl_spec dictionary as needed
    png_data = vegalite_to_png(vl_spec)
    print(png_data)

if __name__ == "__main__":
    if len(sys.argv) != 2:
        sys.exit("Usage: python my_py.py <vl_spec_file_path>")
    
    vl_spec_file_path = sys.argv[1]
    main(vl_spec_file_path)

Run the above process using python subprocess

import subprocess
import json
import tempfile

with tempfile.NamedTemporaryFile(mode='w', delete=False) as temp_file:
    json.dump(vl_spec, temp_file)

# Get the file path of the temporary file
vl_spec_file_path = temp_file.name

# Define the command you want to run as a list
command = ["python", "git/my_vl_process.py", vl_spec_file_path]  # Replace "your_vl_spec_here" with your actual vl_spec

# Use subprocess.run() to run the command and capture its output
result = subprocess.run(command, stdout=subprocess.PIPE, text=True)

png_data = result.stdout
print(png_data)

This code overcome the memory problem using a separate process wherein the overall memory cleanup is done by gc (for a different process) while in the default case, the memory is leaked to the external environment.

The text was updated successfully, but these errors were encountered:

manurampandit · 2023-12-06T05:40:32Z

vl-convert-clip.mov

The video is clipped in between to fit 10MB limit for github.

jonmmease · 2023-12-06T10:25:36Z

Hi @manurampandit, thanks for the report. What version of vl-convert-python do you have installed?

import vl_convert as vlc
print(vlc.__version__)

manurampandit · 2023-12-06T10:29:13Z

I had tried this on 0.14.0 I can try on the latest one too.

jonmmease · 2023-12-06T10:50:46Z

Yeah, try the latest. There was a memory leak fix that went into 0.14.0, so that's not the issue. We have updated the JavaScript runtime since 0.14.0, so there's a chance that will help.

manurampandit · 2023-12-06T13:02:04Z

I tried this with latest 1.2.0 version too. There is slightly less amount of memory leakage but it's still significant. Trying with 100k above dataset resulted in around 1G

jonmmease · 2023-12-06T14:01:22Z

I think the reason this example is so slow to render is that the x-axis is a string with type nominal, so we end up needing to draw a tick label for every row of the dataset. Was this your intention? If not, it might be helpful to create an example with a large dataset that doesn't need to render so much text.

manurampandit · 2023-12-06T17:23:31Z

I understand the slowness reason (data not in the right format) which I expect some user may accidentally do. My primary intention is to get rid of the memory leak. As I mentioned earlier there is a workaround for now - invoking in a separate process. I tried looking at GIL lock but not much experience into that. Would love to contribute back but need some pointer to look into this.

jonmmease · 2023-12-15T12:50:40Z

I did play with this a bit, and wasn't able to track down where the memory is being held.

What's notable is that, from my testing, this isn't a leak as repeated usage doesn't keep increasing the memory consumption. But the memory used doesn't reduce after each usage has completed. This may suggest there's a cache somewhere, or that the V8 JavaScript runtime isn't performing garbage collection after each usage.

I think a next step would be to see if we can reproduce this using the vl-convert-rs Rust crate directly to take Python out of the equation.

manurampandit · 2023-12-19T17:39:39Z

Agree with the behavioral part that the memory consumed remains almost constant (after first time). I'll try to segregate the issue by calling directly with vl-convert-rs and update this thread.

jonmmease · 2023-12-21T20:44:56Z

Playing with this a little more, and I think the cause may be that the memory allocator the Rust code is using is not releasing the memory back to the OS.

What operating system are you using?

jonmmease · 2023-12-21T20:57:20Z

On MacOS, I'm seeing that a lot of memory is released around 3 minutes after the increase due to the image export. Could you check if you see this behavior as well? I'm not sure if there's anything we can do to speed this up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory consumed for a larger dataset is not freed post vegalite_to_png call #143

Memory consumed for a larger dataset is not freed post vegalite_to_png call #143

manurampandit commented Dec 6, 2023 •

edited

manurampandit commented Dec 6, 2023

jonmmease commented Dec 6, 2023

manurampandit commented Dec 6, 2023

jonmmease commented Dec 6, 2023

manurampandit commented Dec 6, 2023

jonmmease commented Dec 6, 2023

manurampandit commented Dec 6, 2023

jonmmease commented Dec 15, 2023

manurampandit commented Dec 19, 2023

jonmmease commented Dec 21, 2023

jonmmease commented Dec 21, 2023

Memory consumed for a larger dataset is not freed post vegalite_to_png call #143

Memory consumed for a larger dataset is not freed post vegalite_to_png call #143

Comments

manurampandit commented Dec 6, 2023 • edited

manurampandit commented Dec 6, 2023

jonmmease commented Dec 6, 2023

manurampandit commented Dec 6, 2023

jonmmease commented Dec 6, 2023

manurampandit commented Dec 6, 2023

jonmmease commented Dec 6, 2023

manurampandit commented Dec 6, 2023

jonmmease commented Dec 15, 2023

manurampandit commented Dec 19, 2023

jonmmease commented Dec 21, 2023

jonmmease commented Dec 21, 2023

manurampandit commented Dec 6, 2023 •

edited