Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a manager/proxy interface to access temporary storage (temporary directory) #16234

Open
nameexhaustion opened this issue May 15, 2024 · 2 comments
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature P-goal Priority: aligns with long-term Polars goals

Comments

@nameexhaustion
Copy link
Collaborator

nameexhaustion commented May 15, 2024

Description

We currently use temporary storage directly through their paths on disk. We want introduce an interface that makes it easier to:

  • Perform file cleanups in a structured manner.
  • Use temporary storage as a local cache for downloaded files/datasets.
    • Need to figure out how to handle invalidation (i.e. remote file was updated)
      • Use cloud file metadata field?
      • Maybe have a config to fully invalidate all caches e.g. POLARS_INVALIDATE_CACHES

Usage scenarios

There are a few different ways we might use this

  • Caching downloaded cloud files
  • Spilling from operators (e.g. from out-of-core group-by)
@nameexhaustion nameexhaustion added enhancement New feature or an improvement of an existing feature accepted Ready for implementation P-goal Priority: aligns with long-term Polars goals labels May 15, 2024
@ritchie46
Copy link
Member

This might run on our tokio runtime. Then we could static task (runs for the duration of the polars process) that most of the time sleeps and once in a while garbage collects.

@nameexhaustion nameexhaustion added P-medium Priority: medium and removed P-goal Priority: aligns with long-term Polars goals labels May 17, 2024
@ritchie46 ritchie46 added P-goal Priority: aligns with long-term Polars goals and removed P-medium Priority: medium labels May 17, 2024
@ritchie46
Copy link
Member

Alright, did a brainstorm. I think we have got some ideas.

Assuming our spill/cache directory ~.polars/.

We can dump spilled files under a folder created by a combination process id and current datetime. This can hold future spilling files.

For the caching of the files we should provide a time-to-live, TTL. This TTL can for instance be 1 day for files downloaded from the internet.

During startup we create a task that checks for old pid_datetime folders that are not alive anymore (interupted process) and files that surpassed their TTL and cleans them.

~/.polars/
    # Spills from the streaming engine. For future reference
    pid_datetime/
    pid_datetime/
    # files with a TTL
    cache/

The spill manager can be a static struct that initially only deals with the downloads, caching and cleanup. I think that we should set an in-process bit during downloading so that we don't start duplicate downloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature P-goal Priority: aligns with long-term Polars goals
Projects
Status: Ready
Development

No branches or pull requests

2 participants